A year ago, every AI conference had a prompt engineering track. People traded magic phrases like baseball cards. “Think step by step.” “You are a world-class expert.” It felt like a discipline. It was a parlor trick.
I have since shipped enough production agents to say this plainly: the prompt wording is almost never the thing that breaks. What breaks is context assembly. What you retrieved, how you compressed it, what you put in the window, and in what order. That is the real job, and it has a real name now. Anthropic’s Applied AI team published the canonical framing in September 2025, and it lines up exactly with what I see in the field.
The window is a budget, not a bucket
The single most useful idea in that piece is that context is a finite resource with an “attention budget.” Every token you add spends down that budget. Stuff the window with marginally relevant junk and you do not get a smarter agent. You get a dumber one.
Anthropic gives the failure mode a name: “context rot.” As token volume climbs, recall degrades. The model starts missing the one fact that mattered because it is buried under forty things that did not. I have watched this happen in production more times than I can count. A demo that dazzled a room with a clean ten-document context collapses three weeks later when the retrieval layer is pulling sixty documents and the signal is gone.
Prompt engineers treat the window like a bucket. Pour everything in, hope the model sorts it out. Context engineers treat it like a budget. Every token has to earn its seat.
What actually moves the needle
In production, the levers that matter are not phrasings. They are systems.
Retrieval is the first one. The shift Anthropic describes is from pre-loading everything to “just in time” loading: keep lightweight identifiers in the window, fetch the heavy content only when the agent actually needs it. This is the difference between an agent that costs a fortune and times out, and one that stays sharp because it is only ever looking at what is relevant right now.
Compaction is the second. Long-horizon tasks drown in their own history. A multi-hour agent run accumulates tool outputs, intermediate reasoning, and dead ends until the window is full of noise. The fix is to summarize, reinitialize, and carry forward only what matters. If you have ever watched an agent get worse the longer it runs, you have watched a missing compaction layer.
Memory and state are the third. The agents that survive contact with real users are the ones that write structured notes to external storage and pull them back after a context reset. The window is working memory. It was never meant to be long-term storage. Treating it as both is how you get an agent that forgets the user’s name halfway through the session.
Notice what is missing from that list. Clever prompt phrasing. It is not that the system prompt is irrelevant. Anthropic’s point about hitting the “right altitude,” concrete enough to guide behavior, loose enough not to be brittle, is correct. But that is a one-time calibration. The pipeline feeding the window is a living system you tune every day.
The CODN nobody prices in
Here is what gets lost in the prompt-versus-context debate. There is a Cost of Doing Nothing, and it is not zero.
Every week a team spends A/B testing prompt phrasings is a week the retrieval layer stays untested, the compaction strategy stays unbuilt, and the memory architecture stays a TODO. The agent keeps looking impressive in the demo and keeps failing on the long tail in production. The gap between those two states is the CODN, and it compounds. You do not pay it as a single bill. You pay it as eroded trust, as the pilot that never graduates to deployment, as the executive who quietly concludes that agents do not work.
Prompt engineering let you fake competence in a controlled environment. Context engineering is what you need when the environment stops being controlled. One is a sentence you wrote once. The other is an infrastructure problem you own forever.
The teams that win the next two years will not be the ones with the best prompt library. They will be the ones who treat the context window like the scarce, expensive, leaky resource it actually is, and who build the retrieval, compaction, and memory systems to defend it. Stop tuning sentences. Start engineering what the model sees, because that is the only thing it ever knew.