I guess I am mostly enjoying learning the fundamentals of AI stuff, even though I disagree with the direction it is going.
But I am struggling to put into words how alarming I find the comments on threads like this — all sorts of good-natured anecdotes about how XYZ works for them that are more like the suggestions in pet care or cookery threads on Facebook.
(Or worse still, like any Facebook 3D printing group: anyone who prints but wants to understand what is actually going on will know what I mean, I think)
Any shared sense of rigour is just completely torpedoed by the LLM world, particularly the cloud LLM world it seems, and we are reduced to cargo culting. Nobody is any more right or wrong than anyone else.
Have you tried cleaning your context with dawn dish soap, letting it dry and then adding a layer of glue stick?
--
ETA: I don't want to sound so mean about people who try to help, here or in facebook groups. I guess I just find these threads so different to threads on more or less any other topic, where someone's suggestion can be debated or refined by other commenters and then someone will explain a thing about how bash history selections work that will change your entire life. With these threads they devolve to "isn't it weird that threatening it works?"
> Any shared sense of rigour is just completely torpedoed by the LLM world, particularly the cloud LLM world it seems, and we are reduced to cargo culting. Nobody is any more right or wrong than anyone else.
There was always some of this in the tech world, long before LLMs came along.
I've sat in so many meetings when decisions were made based on "that's what _slightly more prestigious company_ does" rather than objective measurable criteria. (And the evidence that the thing in question wasn't universally followed by _slightly more prestigious company_ carried surprisingly little weight).
Absolutely I agree there has always been some cargo culting going on; that's true of all process-oriented businesses.
But people are now individually acting this way on their desks on an hour by hour basis. LLMs make cargo-culting inevitable because they are inscrutable and opaque.
There is always this sense in the LLM-proponent world that LLMs are at any moment as bad as they are ever going to be; line goes up.
But it seems clear that the gap between perceived and measurable productivity is still likely spent in poking entrails with a stick.
We are so used to probabilistic tools that have significant setup time before they become valuable and save us loads of time that we're at risk of repeatedly writing off that setup time without seeing the rewards, believing that one day it will actually work out that way.
(Which is most recognisable from the early JS frontend frameworks era.)
Meantime here we have an article that shows that a thing (longer context windows) that people thought would functionally solve a problem so we would get the value from all that setup does not, in fact, very meaningfully kick it down the road, and the comments are still full of entrails-and-stick work.
> Any shared sense of rigour is just completely torpedoed by the LLM world
Consider that this shared sense of rigour you have in mind is illusory, and LLMs and their context struggles are simply revealing this. I see precious little rigour in any of the 'tech' world I've lived in for decades. The tools proliferate, paradigms emerge and die and reemerge, and whatever stick you consider using to measure any of it has competitors with different units. Past the physics of power and signaling, and the prevailing cost of a silicon wafer, we are almost all, relative to a small number of much older disciplines, muddlers of various degrees of skill.
I've found dealing with context limits relatively easy: specify and confine. LLMs need clear specifications and strong guidance to produce good work.
But that's just my current muddling take on the practice. Perhaps, 90 days from now, even this burden will be gone, and a simple prompt will generate world class operating systems, programming languages and a formal basis in mathematics for both.
The arbitrary and non-deterministic nature of LLM workflows gives me full on ick. As an old embedded/systems guy I have always prioritized determinism and repeatability in my workflows.
But damn, agents are amazing and I'm enjoying being a "thought process designer". I'm not going back. Even if AI development stops today my career will never be the same.
I felt the same way about the non-determinism but realized it can be really beneficial to have a machine that can fairly reliably turn non-determinism into determinism.
I’m working on a tiny agent harness at home to learn and the process of taking human speech and turning it into agent tool calls that output something generally deterministic depending on how the tool is defined is so interesting.
One of the big takeaways is you really only have to rely on the non-determinism<->determinism translation layer once when you switch between the two domains. You can obviously rely on it more if you want, and that’s probably faster because determinism is hard, but you don’t need too do that.
That sounds very cool. It’s sometimes baffling that LLMs can’t use tools reliably. Serena and Semble both require some arcane instructions to coerce Claude Code into compliance. Just stop trying to pipe nonsense commands into each other, man!
This has always been a thing with IT advice, though - the more complex a system and the outcome, the harder it is to clearly define "better" or "worse". Add in the fact that LLMs are intensely and emphatically non-deterministic and LLM guidance basically becomes gardening advice.
Heck, even the 'benchmarks' are mostly somebody's attempt to crystallize their vibes with varying amounts of success.
> LLMs are intensely and emphatically non-deterministic and LLM guidance basically becomes gardening advice
Have you ever tried doing evals on moderately complex but bounded tasks?
I spent some time doing it when testing these "token reducing" tools like Headroom, RTK etc. as well as customizing my Pi tools. What I found interesting was that despite LLMs being deterministic, for a given toolset and prompt, the results were highly consistent for a given eval, across multiple models (I tested at the time using GPT 5.4 mini, 5.5, 5.3 codex, Gemini 3 flash, initially running sets of 5 evals on each task but once I realized how consistent the results were, dropping to sets of 3.
Aside: in my tests, RTK and Headroom made the overall context use higher for roughly equivalent results. The context use for those specific toolcalls went down but the number of model turns and overall context use went up.
It's not just you! Here's a lovely quote from an influential paper, "We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence." I think people went through a similar phase with steam engines. Lot's of practical engineering and heuristics to explain what works, before the emergence of a solid theoretical foundation (thermodynamics) to explain why.
If you want my best guess: I think large context windows cannot be trained properly. There's not enough material, nor computing power, to train such large networks (to the same degree as small windows).
What sense of rigour is going to be in a field (LLM usage as a user) where models, context sizes, tooling and broadly "rules" (scary quotes) change every few weeks? There is no literal change to have a scientific approach to anything, churn is too high, there are papers about model XYZ v 12345 from a few months ago that are already old because there is model ABC on version 54321 that addresses half of the issue shown in the paper and add 3 new problems though.
I feel your frustration for sure and agree to a large extent. Any attempts I’ve made to try to formalize any LLM-based workflows has resulted in me being again dismayed that no one seems to have any real idea of how or why certain things work or don’t work. So I just go back to /plan and “write this down in a markdown document for posterity before we iterate on the implementation”, hoping that maybe next month there might be something a little more rigorous with some kind of rational backing.
> Have you tried cleaning your context with dawn dish soap
I don’t do the glue stick thing at all because I don’t need to, but Dawn really seems to do a good job at getting my Bambu build plate working again. I didn’t seek it out specifically, I already had some for doing dishes. IPA hadn’t worked so I tried Dawn and it has gotten me back having prints stick multiple times now. Not quite up to N=30 yet.
first of all, LLM-assisted coding is less than 3 years old. 3 years ago all we had was GPT-4 with 8192 token context, which wasn't enough for most things.
and second of all...
>Any shared sense of rigour is just completely torpedoed by the LLM world, particularly the cloud LLM world it seems, and we are reduced to cargo culting. Nobody is any more right or wrong than anyone else.
what "sense of rigour"? it's way too soon to put those rose-tinted glasses on.
>what "sense of rigour"? it's way too soon to put those rose-tinted glasses on.
I don't think OP is claiming that prior to LLM coding everything in the software development world was super rigorous (I assume that's effectively what you mean with the "rose-tinted glasses" comment). But rigor was actually possible and in a deterministic way too, which is fundamentally impossible with LLMs. You can build all kinds of guardrails and processes around LLMs that make it somewhat approach rigor again, but it's still fundamentally based on a bunch of statistical probabilities instead of deterministic, repeatable results.
All of the methods I see to mitigate the fundamental and inherent issues of LLMs seem roughly equivalent to the kind of crap you see in astrology groups or palm reading etc. You need Venus and Mercury to be in alignment while Mars is retrograde if you want to be able to get the right results from your token predictor.
Sure, but LLMs are non-deterministic in ways that no sane human ever would be. See the "Is it better to drive or walk to the carwash" scenario from a few months ago as one of many, many examples. Or a personal example I encountered just a week ago: I asked Claude (Opus 4.8 in case any of the "you aren't using the latest model that totally fixes that issue" types are interested) to convert a bunch of DB calls that currently use raw ADO.NET calls to use Dapper instead.
The projects in this repo were on .NET 4.8.1 and were still using the older format for the .csproj file instead of the newer (and far better) "SDK-style" format that Microsoft introduced a few years ago. It tried to use the dotnet CLI to add references to Dapper, even though the older format of .csproj doesn't work with that. The dotnet CLI returned errors about trying to add the package references for Dapper, which Claude completely ignored while continuing to try and convert the ADO.NET calls to Dapper. And at the end it tried building the project, which of course failed, and then it confidently informed me that the conversion had been completed successfully and that the build completed successfully and all tests were passing successfully, even though the output from the build it had done immediately prior clearly told the LLM otherwise.
A real human, despite being non-deterministic, would have caught the issue at multiple stages. They would have seen the error when trying to add the reference. If they ignored that then they would have seen the red squiggly lines all over the (deterministic) IDE telling them there was something wrong, along with autocomplete for Dapper calls not working. And if they continued to ignore those and managed to keep going anyways, they would have clearly seen that the build failed, with tons of errors specifically about references to Dapper failing to resolve. An LLM keeps going on its merry way in ways that effectively 0 humans would.
Programming has already become this way. Opinions about different languages and architectures are taste, or sometimes even just vibes. Few try to actually ask “can I quantify whether microservices or monoliths are better in terms of either maintainability or scaling?”
A lot of this is a result of systems having long ago exceeded the complexity threshold of things people can hold in their heads. There are too many layers, subsystems, languages, APIs, all glued together. Attempts at radical simplification fail because each of those layers and subsystems has features or behaviors someone needs, and a lot of it isn’t even documented.
AI takes this to the extreme. I’ve already learned that certain models have “personalities.” Some are more likely to go with you on magical journeys into hallucination while others are more critical. Some are better at detail while others seem better at abstraction but fall over on detail. Some are better instruction followers. All their quirks are complex and the systems themselves are impossible to understand.
Computer systems are becoming organic, biological.
"Feeping creaturism" has always been a problem, for sure.
But those technologies are layers, and there are reliable things that sometimes bubble across the boundaries — type hints, better code patterns to trigger compiler optimisation, interesting tricks with key column selection — and someone with expertise from that layer below can explain why, and their advice will always work in situations that are sufficiently similar.
You are right about AI personalities. Obvious even with the open weights models. Gemma and Qwen write code and documentation like people from different cultures. Because I guess they are a bit like that.
> But I am struggling to put into words how alarming I find the comments on threads like this — all sorts of good-natured anecdotes about how XYZ works for them that are more like the suggestions in pet care or cookery threads on Facebook.
It will always be this way going forward. Everyone thinks differently about problems. In the past we had experts and only they could do the work at a high level. But now we have many people that are cranking out expert level solutions without much knowledge. Worrying about the minutia is a dying trend.
Edit: I see I touched a nerve. But that is how it is now. You can't fight reality.
Your argument is that superstition is the way of the future and technical rigor no longer applies.
Because that's what OP is talking about. Superstition presented as factual advice instead of the technically rigorous and scientific fact that already exists.
You're being downvoted because you don't understand this fact, or indeed understand what you're saying at all.
I'll spell it out for you: technically and scientifically rigorous facts do actually exist, even in regards to LLMs. We can, in fact, obtain scientific and objective facts about how LLMs perform. It can be rigorously proven that certain context habits affect certain tasks positively or negatively. Your argument is that none of this matters more than superstition. And you're surprised that arguing to a room full of engineers and scientists that science is dead and superstition is the one true way forward gives you negative response.
There aren't any good facts that exist regarding LLMs. It's a black box. Also, do not presume to know what I understand or don't understand from one comment.
> I'll spell it out for you
You are a rude and crude individual. I am not interested in discussing anything further with you.
It's a black box, but you can run tests to quantify the behaviour and establish, for example, that a certain model is X% more likely to give a certain behaviour.
At some level, we've always delegated worrying about the minutiae to someone who builds the tool that is one or two levels below.
I usually don't have to worry about compiler optimisations because compiler experts do that; sometimes they appear in a thread about code and say "compiler guy here — if you write your code like this the compiler can optimise it".
And that person will be provably right (or wrong), in that context. And it'll be the same each time you run the test!
I just… ehh. You make a good point and I worry you are not wrong. It's all so different.
I like my 3D printing analogy much more than I wish I did.
I've been able to avoid context size issues by applying one simple constraint to my agent loop. What I do is prevent all tool calling in the user's top-level conversation thread. Anything that needs to tool call must happen in a recursive invoke of the agent, which returns whatever results to caller.
I can keep the same high level conversation going for an entire day over a million LOC+ codebase without ever hitting meaningful token limits. No compaction or summarization tricks needed. I can burn 50 million tokens in recursive calls and still not touch 100k tokens in my root conversation thread.
There is some rework needed to "bootstrap" the agent each time it has to descend back into Narnia, but this is still far more efficient than carrying around one big flat context that tries to cover everything all the time.
Recursion is very effective at controlling token use, but it can only go so far. I've not observed any uplift for recursive depth beyond 1. I have seen the agent attempt it a few times, but the practical performance is simply not there. External symbolic recursion does not appear to be something the frontier models have been trained for. They are fantastic at emulating recursion in context, but we don't want that if we are trying to achieve a reduction in token use.
You can do this in opencode and pi (haven't used), by defining your own agents or overriding the built-in ones, so in your primary agent you can disable all tools and give it good instructions for how to delegate
I imagine most harnesses should have a way to do this today, if they don't, get a new one. OpenCode i.e. is highly customizable, Claude and VS Code both support a ton as well including custom agents (though unclear if you can create custom top-level in claude-code)
Thanks, those don't deterministically prevent the main loop from using tools thought, unless I'm wrong that's just prompting the main agent on when to use specialized sub agents
you can configure tools, thinking, permissions et al on a per agent basis in the frontmatter
the main agent would be very different, basically an orchestrator, and you are "loop engineering" it, and turning off all the things for this main agent besides being able to run subagents
For anyone using Claude Code, ask it to do all the work in workflows (it has a tool for that), they released that feature together with Opus 4.8 and it also seems a bit better at doing long tasks as well. The main conversation just orchestrates the work at that point.
You can also just ask it to do work in a subagent. It will write a plan and launch the subagent to do the actual code, keeping it out of the main context.
In addition, you can co-author a plan for a biggish chunk of work, divided into stages, have it launch a subagent for phase 1 and check its work, then ESC-ESC to go back to just after you wrote the plan and have it do phase 2. Repeat until done. This keeps the overall goal in the main context for the review, but clears out previous reviews. Kind of like a workflow but with more control.
Claude Code seems to automatically do this in some cases. It seems to have some heuristic "will eat a lot of context" where it decides to dispatch a sub agent.
I see it pretty frequently in troubleshooting and data analysis flows where it will dump the data collection and aggregation into a sub agent then pull out a summarized result.
I'll do something similar where I have the main agent maintain context in a design doc/markdown file and update as it goes along. Then I can clear/restart/handoff at will
Might depend on the model. Haiku doesn’t like to delegate unless you ask it to. I have a custom command for “delegate plan, delegate code, delegate review”, but launching it with Haiku gives me mediocre results.
[User] Actual human prompt
[Agent] Attempted use of tool & hand slap
[Agent] call(projection of user's prompt relative to discovered tool constraints)
["User"] Prompt from above call
[Agent] Legal tool use
[Agent] ... until satisfied
[Agent] return(summary that satisfies the prompt for this level of execution)
[Agent] Additional call() invokes possible depending on returned summary
[Agent] Final return(summary) from root ends this turn of conversation and user sees summary
[User] Next turn of conversation initiated by actual human
I have a different way, but still trying to figure out how well it works. Instead of going into recursion, the agent is allowed to restart the thread by doing the summarize/debrief/reflect pass, writing key findings into persistent memory and rewriting the prompt whenever the context goes too large or it gets stuck. Recursion with TCO if you may.
In a way it's a generalization of the spec-driver approach, but in addition to the the formal spec the carryover buffer lives in the memory.
This is interesting to me because reducing context & token usage is in the user's best interest but not in the financial interest of AI vendors. I am not an expert but it sounds like your "one simple trick" would fix context issues and allow much tighter control over token usage. Thanks for being willing to share this tip in an HN comment, changing how those in the know use AI agents going forward -- it's hard to keep up!
> This is interesting to me because reducing context & token usage is in the user's best interest but not in the financial interest of AI vendors.
AI vendors still need to compete with each other both in terms of token cost and competency. An agent that is costly and less effective by wasting tokens is less competitive.
How do you get the agent to stick to it without constantly rejecting tool calls with the same description? I've tried a similar setup a number of times and it tends to forget about this constraint very quickly.
The tool itself enforces the constraint. This is deterministic. If an agent tries to read a big fat file in root, it gets an error from that tool's implementation that reiterates the requirement.
I don't bother warning it in the system prompt anymore. It's pointless. I let it bump its head as required. A few hundred tokens and the agent is back on track each time.
If the model isn't following the system/developer prompts easily, you might want to try a bigger/better model, tends to mostly be about model quality if it doesn't follow what you tell it to. Besides that, conflicting directions in the system/developer prompts can lead to the model seemingly ignoring instructions too.
This has not been my experience with Opus since Anthropic released the 1M token context window for use under the subscription plans. I routinely push past 500k tokens, even sometimes up to around 800k tokens, and don't see this problem. I've seen it to some extent when getting truly near the limit, up around and above 900k tokens, though what I see isn't as severe as the author seems to see.
(And I rarely fill the context window that far anyway when working on a single task, or a series of tasks that are related enough to warrant the same context; more typical is anywhere between 200k and 600k or so.)
I'm not saying that no one ever has this experience, but it's odd to me that some people see it so often that it warrants giving it a name.
I don't use Claude Code. I use my own handwritten agent (formerly using Pi) and know every token that goes into it. There are zero memories to confuse it. The system prompt is 200 tokens and completely self consistent.
Plus I've found that the only time models go above 100k tokens anyway is when they've started looping at which point it's much better to go back anyway.
Anecdotally most models know their recall is terrible (or have been trained to act as such), that's why they constantly reread files before editing or while reasoning.
I read it as a models performance being random and observed differences in the opinions are the results of the overinterpretation of the random outcomes.
I think however that some people seem to be always lucky which indicates that it is not random but rather some fixed differences between people and their environments.
I think that's issue, rather than 60K being small.
Most of the actual edits/changes I request to codex are solved within 100-150K tokens, beyond 200K I'd definitively try to restart the session as soon as I could as all models are horrible once you get across ~20% of the total context size. And this is while working on +million LOC codebases.
Problem I guess is that there is no solid and concrete evidence of this (to me [and others seemingly] obvious) degradation, but should be easy to prove, yet no one has time to sit down and show it :)
But the likelihood of a model getting minor details wrong once you're above some magical threshold between 15-20%, seems to skyrocket, and I hit that issue sufficient amount of times that now my workflow is trying to prevent that.
what are y'all doing to hit that? Do you just not give it any pointers and let it churn away? What kind of context are you handing off?
I routinely get claude to do things pretty decently and finish up easily in the 4-5 digit range of tokens. It seems to be doing the right kind of thing to not waste its time looking at 1000 files.
I usually see this when the context gets "tainted" as I call it. The model gets stuck on a bad path and there's no way to bring it back without clearing the context and starting again.
Frequently it'll be something as small as 1 sentence of a prompt many messages ago.
When cases like that happen, I reset the context and try to be explicit about assumptions and requirements to keep it off the "tainted" path. Other times it's actually useful and agents will do things they normally wouldn't do once the state is tainted. For instance, if you're testing a chat bot's ability to stay on topic, you can seed the context early with what you want it to do. It generally will refuse initially but later on in the conversation it will still silently take that seeded context into account almost "subconsciously" and become more likely to do the thing it originally refused.
I'm always a bit confused when people say things like this. 60k token is often more than the initial context I feed the model with. And I don't think I ever had a productive session that began under 150k tokens.
Bit of what makes it so fun, our experiences seem to wildly differ! On one hand, you have experiences like yours, but then my own experience is that I never had a productive session when the scope grows beyond 150K tokens! If I needed 60K just as a starting context, I'd take that to mean the suggested change is way to large, and if the model cannot solve the entire thing within maybe 15-20% of the total context size, divide and conquer is needed otherwise there will be a lot of time wasted to patch things up when things are "completed".
Yeah indeed it's very interesting. And the 60k initial context don't even contain the suggested change yet. For me if I don't do this the current models tend to fixate and local patches instead of tracing symbols and making a holistic model of what a change interacts with in the codebase
I hate to do the "you're holding it wrong" trope, but I think you might have something misconfigured somewhere unless you missed a 0, because just past 60k tokens is such a small context window to be seeing issue in.
Do you have any old documentation that it's picking up and referencing? If you set all claude settings back to default do you see the same issue?
Opus 4.6 was on drugs past 200k, I skipped 4.7, 4.8 did good up to ~350k, and Fable did great beyond 400k, in my limited testing. The quality does appear to be trending upwards.
agreed. the claudes have been getting better and better with every release in this regard.
opus 4.5 would start failing tool calls when approaching its 200k limit, opus 4.6 could get to ~300k before getting confused, opus 4.7 i could stretch to around 400k the dumb zone started, with opus 4.8 i've had sessions get over 500k comfortably.
admittedly we only had limited time with fable, but i had a couple sessions get into 800-900k just fine.
I often push past 300k or so and I’ve absolutely worked at 800k but it’s an observable problem. Large context windows can work depending on the problem but I do feel more effective biasing towards small ones <300k.
I have a custom build command for a rust project (yarn build:lib) and my experience is 120k for GLM and roughly 200-300k for Opus. After that, they default to cargo build.
My projects have specific build/verify steps as well, and after a certain point Claude forgets to run them. I’m going to try a “No brown M&Ms” hook to halt Claude if it tries to run the default command instead of the instructed commands from CLAUDE.md. Perhaps this will be a good signal that a compacted or fresh session is needed at that point to avoid mistakes.
I mean, that’s basically the magic of the harness. The whole thing that skyrocketted the intelligence is that the harness (cli tool) prevent the LLM from editing the file before reading it.
Can you imagine even a junior making such a mistake?
There’s a simple way to solve this: just use Codex. The auto-compaction is really good, and lets threads go on for a long time without losing track. In case you do notice a session is starting to go off track, it’s straightforward to make a new session, ask it to summarize an old session into an AGENTS.md, and start it from there.
Opus in recent versions is fine beyond 100k, but I usually do try to keep it under 200k.
But, this is also why so-called "memory" systems are usually a mistake that make the models dumber. They don't have memory, they only have context, and every irrelevant fact you shove into the context is less context for the problem. Less distractions, better results.
The way to have the agent remember things is to have it document its work, like a human developer would do if they wanted their project to be friendly to other developers working on it. Good developer docs with an index page and a good plan with checklists, in concise Markdown files, checked in to the repo is the ideal memory for models and the ideal docs you need to figure out WTF the model has been up to. Helps with code review, too, whether by humans or another model. There's no down side.
At least for me, Opus keeps writing stuff to memories, only to consistently forget checking those memories before doing the same mistake again. This ("remember to check memories!") is of course then again written as a memory... Clearly not a very well working system, yep.
Yeah, I see it write stuff to memory pretty regularly, maybe it works sometimes, but for things I want it to stop doing or always do, I make it impossible to do otherwise via lint or some style enforcement, or via a test that fails if code shows up that violates the constraint.
But, it does a good job following existing conventions in a codebase, as long as they're really consistent. So the more actively you enforce that consistency the more likely it is to do the right thing without memories or prompting.
I don't like "never do" or "always do" type rules in AGENTS.md or in memory, as it often over-interprets them and ties itself in knots trying to satisfy an impossible set of goals.
In my own multi agent framework I use cheap models to check the responses of the expensive models, as well as using multiple expensive models adversarially in debate. The cheap models are great at spotting eg the model getting stuck in the alternate between two broken ideas or not following code conventions or missing a step in the skill and so on. I’m currently working on making them detect user corrections and police that going forward to intervene when the expensive models forget the thing you just corrected them about etc.
I've explicitly banned Opus from creating memories unprompted, as it would often save info that's incorrect and which would then be propagated to future sessions until caught. Ugh x 10.
Almost every comment here is appealing to personal experience. By contrast, OP refers to two studies that compare performance on some kind of standardised test over a range of models.
Can't speak to how good those tests are, but they can't be worse than anecdotal evidence for something as vague/subjective as LLM performance.
I'll respond with more anecdotal evidence, the Llama family has been terrible at following directions in all the tests I've done--not sure about the other models in RULER.
In the Chroma results, they look at Sonnet 4 which was also terrible in my experience. The same prompt that worked perfectly in Sonnet 4.5 would fail miserably in Sonnet 4
Would be good to see newer tests with both SOTA and open weight. The SOTA ones always seem to follow directions and stay on topic better but it'd be good to have some data to back it up.
I'm getting a lot of mileage out of basically acting like the AI's Product Manager, and insisting that it writes up short PRDs for every feature we propose to build. That gives it a reference over time of everything that has been built, but also makes it less liable to drift with each one. Each one gets its own conversation. For me this is a happy medium between stopping it going off the rails but also making sure it can reference past decisions when it needs to. The one thing I dislike about Pocock's method (not to use PRDs so much but to have an in depth discussion to get alignment) first is it wastes a lot of the best window on that initial back and forth.
Is it adhoc or you use more structured approaches like openspec? I also tend to work on a plan first, but it stays as in-session todo, which is hard to reference later.
It's ad hoc / my own framework, just found something which works for me. The exact structure is
- Work Mode - HITL/AFK
- Problem Statement
- Who It Affects - Primary / Secondary User
- User Stories
- Business Case
- Why Now
- Success Critera
- In Scope/Out of Scope [Out of Scope v. important)
- Thinnest Slice (This I've found super valuable, means you max out the amount of 'product' for your buck and avoid diminishing marginal returns or overbuilding. Often I will build this)
- Eigenfeature - What is the larger feature we _could_ (but probably won't) which would solve for this use case and other stuff I might not have thought of
- Technical Notes
- Deps
- Schema Changes
- Risks
- Final Recommendation [go / no go, including on scope]
There's a note in my Claude / Agents MD which says no net new feature gets introduced without this and I get it to move through a pipeline of folders (active, approved, shipped, proposed etc). All runs in a system of MD files and have even created a little MD Kanban from the metadata!
I guess I've stumbled into something similar. Though I don't have a fixed format like yours. I first do a lot of back and forth to generate what I call a design document also includes rationales for various points or decisions. I use both Claude and Codex to iterate on this until I'm happy. The end result includes a lot of what you mention.
I then start a fresh conversation, make it analyze the design document and code, and for larger changes, generate a high-level implementation document which includes concrete phases or steps. I review this plan and iterate if necessary.
Then for each phase I make it generate a detailed plan for that phase and save it along side the other documents. Once the phase is over, I make it write a summary of what was done, decisions made and reasons for it. And typically a good point to compact the model's context.
These documents gives additional context for when I make another model do code review, and help illuminate drift or gaps from the main design document.
I found myself in a similar workflow.
Depending on the task at hand (starting a new project, enhancement, maintenance), I let the agent create/read the markdown files that I keep updated (AGENT, STATE, ROADMAP, DESIGN, ARCHITECTURE, (CODESTYLE if I plan to modi it myself)). Then I choose the various roles that I need in this session and and have a planning phase. After that, the agent is starting implement the changes and I have a manual correction phase.
This flow works for my needs, building idea demos, prototypes or tools for my own sake.
I don't let agent code in our main code base where everything is still hand tailored. That's a conscious decision.
I noticed that the cheaper models (flash, ...) are quite hard to hold back changing files. A question for possible options sometimes results in "yes, I'll go with option A" without asking back.
Frontier models on the other hand love to plan and ask you deliberately for your consent.
I use pi.dev with almost no skills at all to understand how models really work and "feel" to work with.
Working in the era of 200k context window meant I had to narrowly scope tasks to fit in the context window, forcing me to think about how to reduce complexity and naturally resulting in atomic work. 1M context windows and the promise that the latest models are "better at long running tasks" made me lazy in how I scope tasks and quality got worse. I now went back to narrow-scoping one session per task and zero compaction, trying not to go past 400k context window. If I end up with a long session, I was likely too ambitious and should have broken up the task.
Considerations about what goes on in agents internally will probably not be part of software development for long.
Personally, I already see LLMs and agents as blackboxes. I give each feature request to multiple LLMs and then compare the results. I don't manually use "sessions" at all. I just look at the outcome. When I dislike it, I "git reset --hard", change my prompts and restart the feature request.
To have an ongoing sense of which agents perform best, I keep a log and calculate an ELO score of which agents meet my demands best. This score is imporant to me, not so much how the agent achieves it.
Unless we do our own benchmarks, we have to take all the marketing fluff from the frontier labs at face value, and all public benchmarks degrade eventually as labs optimize towards them. OP’s approach is wasteful because it is brute force, but post says that an ELO is kept, so this is also an experiment, and I don‘t see what‘s wrong with that. You learn which model performs well in which settings which may save resources later. It‘s also wasteful to keep working with the wrong model/harness/tools for too long.
In an interactive session, adding "Fine, but make the button red" after the model generated a first solution more than doubles the tokens used. As the model now not only gets the original code and the feature request but also the updated code plus the change request as input tokens.
Sending a feature request to an LLM and then sending the feature request again with "The button shall be red" only doubles the tokens used.
I wrote my own agent, and it sends data to LLMs in this order: "General Prompts (How to write good code)" + "The Code" + "The Feature Request". This means the KV cache will be used even when the feature request changes.
And output tokens are usually way less than the input tokens.
So I think that my approach is very lightweight on token usage compared to an interactive session.
It would be interesting to measure it for the other agents out there. Sending a feature request two times vs an interactive session.
That’s usually not true due to caching. It may be true if you leave a large gap in between, but if you send “make it red” right after, then it’s purely incremental
The cost is nothing compared to the outcome and time savings. What I see is that people with no money want to jump into this pool but they aren't having a good time. That is generally the case when you are poor.
I do my own framework and spend a lot of time trying to debug this and it’s not so much the context size in hard numbers but rather the probability that there is debris or wrong directions in the window that are drowning out the things the user thinks are important.
This manifests in the llm that keeps going back to doing the thing that failed when they tried it just before the last approach etc. The frequency of things in the context window give weight even if they are the wrong things.
I have a lot of tricks like not giving the llm lots of tools but rather giving it a tool it can use to search for tools etc.
But the bigger solution is in process where you use something like superpowers to force the llm through stages and you control the context that carries forward.
I think of the context window as a pot of soup that you add ingredients to between meals. If you have a relatively focused recipe and you are able to add only the ingredients you want, the soup stays good. If you or the agent add an ingredient that isn't fresh, it is going to be difficult to salvage and it is better to start over with a new pot.
It is not that agents can't function with a large context window, they can if that information generally has a desirable signal (like a large initial document or a well-focused session). Mistakes and the confusing signals that come out of fixing mistakes are why performance degrades. I start to trust the context window less not as a matter of size but the amount of friction we run into. The friction can be random but it is more often an issue with the path that I have us on.
I doubt the dropoff is as large as 100k tokens. I start a new session and paste the best results from the previous one as soon as as LLM makes more than a couple of missteps. Theres too much focus on fixing what's wrong rather than going back to what worked and amending in a different way.
If you don't point out what's wrong I find the LLM will go into great technical detail which consumes a lot of tokens, but not 'see the wood for the trees'.
It seems to me human beings also have mechanisms to compact context, which may be why we can forget what we came into a room for when going through doorways. I think it would be interesting to research which markers we use to compartmentalize our thinking.
I built a very small personal extension for Pi [1] that gives me a /last command. It clears the entire session, only retaining the agent's last output message. This allows me to do manual "compaction". Basically I tell the agent something like "state the plan as discussed with references to files that should be edited", and call /last, then tell it to implement.
> the dumb zone, where attention drops off and the model starts forgetting what you told it five minutes ago
I use opus 1m context all day every day at work and I simply have never encountered this. I don’t even think about context windows anymore I just let it do what it wants re compaction. Hard for me to understand where this article is coming from.
I dislike the non-specificity of "models" here. Different models have different attention architectures, and can therefore have significant differences in long-context behavior. It's true that long context is an issue can most models do drop off in quality, but I would not extrapolate behavior of old models to new ones.
I'm actually doing a big refactoring in a project where if everything gets loaded (code / docs), the context gets like 750k filled (Opus 4.8), and then the agent has the remaining ~200k to do actual coding, until I have to reset. I haven't finished the work but I'm like 80% there, and it seems the progress is good and the quality is also good, verified by doing some performance tests and a lot of comparisons between outputs between the original code and the new one.
Maybe I could achieve better and quicker results with keeping the context in the proper zone, but trying it will have to wait until the next project.
Funny to read about that superpowers repo, since only yesterday I wrote skills to do some markdown-plan centered aproach. I feel like smallish local models are getting capable of lots of things now, but they need lots of structure for resiliency.
Yeah I’ve been using gpt-5.3-codex-spark in Codex lately and it can be surprisingly good and it’s super fast. However it needs more explicit instructions.
I've had no problem with Claude Code Opus 4.8 effort max using 20% token context (200k) on software development tasks (all stages). I aways load core source files and the ones we are working on up front. Around 20%, I make it autoprepare for a new session and clear.
Admittedly I have been doing this precautiously, based on anecdotal evidence, not because I had bad experiences with longer context deterioration myself.
In the brief time I had access to Fable 5, it went on long running tasks (>45 mins) into the 30-40% zone without apparent context coherence problems.
Why is it surprising that, at some point, more information will lead to worse performance?
It seems obvious. Moreover, in a simple model, it seems like whatever tokens you do add have to have MORE information than the average in the existing window.
In a non-trivial model (and this is the model I would choose), since you are adding them to the end, they likely have to have MUCH more information.
I /clear all the time out of habit. I want to be able to get the thing done with minimal context. It also means you can do it again slightly different if needed, you know the seed conditions for the task.
100% with the author on that one, albeit the performance decay seems to depend on the type of task for me. Simple plumbing tasks seem to run okay with longer running contexts.
Also, some colleagues were playing around with RTK (https://github.com/rtk-ai/rtk), which decreases the amount of token used by tool calls and, although it seems an interesting idea, I am pretty sure there are many caveats. Although, I believe if these type of tools prove to be efficient enough, perhaps harnesses will have them natively.
Considering how expensive context is in terms of compute, I wonder why (and if ) vendors don't invest more into context engineering.
When it comes to source code, I feel like LLMs could just as well work with something like minified source code, if an LLM is trained on programming well, I think there's no reason why something like a variable should be represented by something more than a single token. Comments can be discarded, etc. In fact considering embeddings for LLMs are very rich, I think common ops could be reduced to a single token.
Imo that's why LLMs are soo good at reverse engineering. A lot of the time, assembly (with symbols) is pretty close to the source code, but compressed and encoded, and if you're familiar with the patterns of your compiler, reversing it is not that difficult.
Anyways, context engineering could be huge boon to input token curation imo (and maybe it already is)
The approach we're taking to deal with this very real context rot is using a bunch of related techniques which we call transposing the agent loop: https://alejo.ch/3jt
In essence, we run many short agent loops, generating their prompts dynamically from structured data. Each loop advances the state in a small step towards the final goal.
Can anybody explain me why just not limit the context window to something smaller instead of all that context engineering? It forces things to be constrained.
I wonder how much this depends on the quality and consistency of the context?
For example, it may be the case that a long context full of useful information relevant to the task is completely fine, perhaps even beneficial. And if the context contains a bunch of unrelated tangents and conflicting instructions, then it will be detrimental.
Have there been studies on what makes models get dumber? To what extent is context length to blame vs context quality?
There's an env var you can set in Claude Code to bring the autocompact threshold down, effectively setting your own max context window. I have it at 400k.
Perhaps compacting the context can be made in multiple requests over smaller and overlapping chunks to avoid using the 'dumb zone', and for yielding a better result.
Even taking the author's criticism about large context windows for granted, which in my experience are exaggerated, they are still a huge UX improvement over short windows. That reason alone is enough for me to support them.
In my own testing I have seen peak performance happen usually within 15-20% of the intended context limit, albeit there are a few optimizations depending on the task quality.
Long context generation is a sampling problem. Set your opencode to use a modern sampler like min_p or newer and you'll see models behave better at longer context.
i let the main loop spawn sub terminal via tmux to prevent large contexts. it's great to divide tasks in small patterns and consolidate it step by step.
The problem with "context rot" is that its existence and severity is purely anecdotal. As far as I know, nobody has actually measured context rot systematically. The only thing we know is that memory degrades somewhat in long contexts, via things like needle in haystack tests. But that's not the same issue. Context rot is usually taken to mean that the model gets dumber even if it doesn't need to remember specific things in its context window.
This would be really easy to measure. Just take some standard benchmarks, but fill up the context beforehand. Is the benchmark performance degraded? If so, by how much?
It's pretty hard to measure because most context rot comes from related context and the model has to be able to figure which parts are truly relevant, which ones are relevant but stale, which ones to ignore etc.
Each relevant thing is basically a rule. Trying to so something with 500 rules is what's hard.
If you take a standard benchmark and just prepend a random book to it, it will not capture that
context window size isnt quite the issue though, its that the attention mass kinda spreads out too much and everything kinda converges to a sortah global average region full of what we know to be slop! theres some really cool ways at the harness or model layer to mitigate this. just isnt really prioritized by the labs often.
I guess I am mostly enjoying learning the fundamentals of AI stuff, even though I disagree with the direction it is going.
But I am struggling to put into words how alarming I find the comments on threads like this — all sorts of good-natured anecdotes about how XYZ works for them that are more like the suggestions in pet care or cookery threads on Facebook.
(Or worse still, like any Facebook 3D printing group: anyone who prints but wants to understand what is actually going on will know what I mean, I think)
Any shared sense of rigour is just completely torpedoed by the LLM world, particularly the cloud LLM world it seems, and we are reduced to cargo culting. Nobody is any more right or wrong than anyone else.
Have you tried cleaning your context with dawn dish soap, letting it dry and then adding a layer of glue stick?
--
ETA: I don't want to sound so mean about people who try to help, here or in facebook groups. I guess I just find these threads so different to threads on more or less any other topic, where someone's suggestion can be debated or refined by other commenters and then someone will explain a thing about how bash history selections work that will change your entire life. With these threads they devolve to "isn't it weird that threatening it works?"
> Any shared sense of rigour is just completely torpedoed by the LLM world, particularly the cloud LLM world it seems, and we are reduced to cargo culting. Nobody is any more right or wrong than anyone else.
There was always some of this in the tech world, long before LLMs came along.
I've sat in so many meetings when decisions were made based on "that's what _slightly more prestigious company_ does" rather than objective measurable criteria. (And the evidence that the thing in question wasn't universally followed by _slightly more prestigious company_ carried surprisingly little weight).
Absolutely I agree there has always been some cargo culting going on; that's true of all process-oriented businesses.
But people are now individually acting this way on their desks on an hour by hour basis. LLMs make cargo-culting inevitable because they are inscrutable and opaque.
There is always this sense in the LLM-proponent world that LLMs are at any moment as bad as they are ever going to be; line goes up.
But it seems clear that the gap between perceived and measurable productivity is still likely spent in poking entrails with a stick.
We are so used to probabilistic tools that have significant setup time before they become valuable and save us loads of time that we're at risk of repeatedly writing off that setup time without seeing the rewards, believing that one day it will actually work out that way.
(Which is most recognisable from the early JS frontend frameworks era.)
Meantime here we have an article that shows that a thing (longer context windows) that people thought would functionally solve a problem so we would get the value from all that setup does not, in fact, very meaningfully kick it down the road, and the comments are still full of entrails-and-stick work.
> Any shared sense of rigour is just completely torpedoed by the LLM world
Consider that this shared sense of rigour you have in mind is illusory, and LLMs and their context struggles are simply revealing this. I see precious little rigour in any of the 'tech' world I've lived in for decades. The tools proliferate, paradigms emerge and die and reemerge, and whatever stick you consider using to measure any of it has competitors with different units. Past the physics of power and signaling, and the prevailing cost of a silicon wafer, we are almost all, relative to a small number of much older disciplines, muddlers of various degrees of skill.
I've found dealing with context limits relatively easy: specify and confine. LLMs need clear specifications and strong guidance to produce good work.
But that's just my current muddling take on the practice. Perhaps, 90 days from now, even this burden will be gone, and a simple prompt will generate world class operating systems, programming languages and a formal basis in mathematics for both.
The arbitrary and non-deterministic nature of LLM workflows gives me full on ick. As an old embedded/systems guy I have always prioritized determinism and repeatability in my workflows.
But damn, agents are amazing and I'm enjoying being a "thought process designer". I'm not going back. Even if AI development stops today my career will never be the same.
I felt the same way about the non-determinism but realized it can be really beneficial to have a machine that can fairly reliably turn non-determinism into determinism.
I’m working on a tiny agent harness at home to learn and the process of taking human speech and turning it into agent tool calls that output something generally deterministic depending on how the tool is defined is so interesting.
One of the big takeaways is you really only have to rely on the non-determinism<->determinism translation layer once when you switch between the two domains. You can obviously rely on it more if you want, and that’s probably faster because determinism is hard, but you don’t need too do that.
That sounds very cool. It’s sometimes baffling that LLMs can’t use tools reliably. Serena and Semble both require some arcane instructions to coerce Claude Code into compliance. Just stop trying to pipe nonsense commands into each other, man!
It's like working with humans.
Can't help but feel like a lot of people who are deep in IT made it there because they hated working with humans.
This has always been a thing with IT advice, though - the more complex a system and the outcome, the harder it is to clearly define "better" or "worse". Add in the fact that LLMs are intensely and emphatically non-deterministic and LLM guidance basically becomes gardening advice.
Heck, even the 'benchmarks' are mostly somebody's attempt to crystallize their vibes with varying amounts of success.
> LLMs are intensely and emphatically non-deterministic and LLM guidance basically becomes gardening advice
Have you ever tried doing evals on moderately complex but bounded tasks?
I spent some time doing it when testing these "token reducing" tools like Headroom, RTK etc. as well as customizing my Pi tools. What I found interesting was that despite LLMs being deterministic, for a given toolset and prompt, the results were highly consistent for a given eval, across multiple models (I tested at the time using GPT 5.4 mini, 5.5, 5.3 codex, Gemini 3 flash, initially running sets of 5 evals on each task but once I realized how consistent the results were, dropping to sets of 3.
Aside: in my tests, RTK and Headroom made the overall context use higher for roughly equivalent results. The context use for those specific toolcalls went down but the number of model turns and overall context use went up.
Gardening advice. Better analogy.
It's not just you! Here's a lovely quote from an influential paper, "We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence." I think people went through a similar phase with steam engines. Lot's of practical engineering and heuristics to explain what works, before the emergence of a solid theoretical foundation (thermodynamics) to explain why.
https://arxiv.org/pdf/2002.05202
If you want my best guess: I think large context windows cannot be trained properly. There's not enough material, nor computing power, to train such large networks (to the same degree as small windows).
What sense of rigour is going to be in a field (LLM usage as a user) where models, context sizes, tooling and broadly "rules" (scary quotes) change every few weeks? There is no literal change to have a scientific approach to anything, churn is too high, there are papers about model XYZ v 12345 from a few months ago that are already old because there is model ABC on version 54321 that addresses half of the issue shown in the paper and add 3 new problems though.
With benchmarks, you can re-run them after a change. A measurement in a paper will go out of date quickly unless turned into a benchmark.
This lack of rigour feels a lot like “did you try restarting the computer? Most of the time, others tried restarting the computer and it works”
I feel your frustration for sure and agree to a large extent. Any attempts I’ve made to try to formalize any LLM-based workflows has resulted in me being again dismayed that no one seems to have any real idea of how or why certain things work or don’t work. So I just go back to /plan and “write this down in a markdown document for posterity before we iterate on the implementation”, hoping that maybe next month there might be something a little more rigorous with some kind of rational backing.
> Have you tried cleaning your context with dawn dish soap
I don’t do the glue stick thing at all because I don’t need to, but Dawn really seems to do a good job at getting my Bambu build plate working again. I didn’t seek it out specifically, I already had some for doing dishes. IPA hadn’t worked so I tried Dawn and it has gotten me back having prints stick multiple times now. Not quite up to N=30 yet.
first of all, LLM-assisted coding is less than 3 years old. 3 years ago all we had was GPT-4 with 8192 token context, which wasn't enough for most things.
and second of all...
>Any shared sense of rigour is just completely torpedoed by the LLM world, particularly the cloud LLM world it seems, and we are reduced to cargo culting. Nobody is any more right or wrong than anyone else.
what "sense of rigour"? it's way too soon to put those rose-tinted glasses on.
>what "sense of rigour"? it's way too soon to put those rose-tinted glasses on.
I don't think OP is claiming that prior to LLM coding everything in the software development world was super rigorous (I assume that's effectively what you mean with the "rose-tinted glasses" comment). But rigor was actually possible and in a deterministic way too, which is fundamentally impossible with LLMs. You can build all kinds of guardrails and processes around LLMs that make it somewhat approach rigor again, but it's still fundamentally based on a bunch of statistical probabilities instead of deterministic, repeatable results.
All of the methods I see to mitigate the fundamental and inherent issues of LLMs seem roughly equivalent to the kind of crap you see in astrology groups or palm reading etc. You need Venus and Mercury to be in alignment while Mars is retrograde if you want to be able to get the right results from your token predictor.
Astrology? And I thought I was being overly harsh with the 3D printing comparison ;-)
Aren’t human coders non-deterministic? There’s no guarantee two people with otherwise identical levels of experience will always write identical code.
Any software engineering practice that had enough review and feedback to work with humans should work more or less the same with AI coders.
It’s when someone tries replacing an entire team or an entire process with a single prompt that they get in trouble.
>Aren’t human coders non-deterministic?
Sure, but LLMs are non-deterministic in ways that no sane human ever would be. See the "Is it better to drive or walk to the carwash" scenario from a few months ago as one of many, many examples. Or a personal example I encountered just a week ago: I asked Claude (Opus 4.8 in case any of the "you aren't using the latest model that totally fixes that issue" types are interested) to convert a bunch of DB calls that currently use raw ADO.NET calls to use Dapper instead.
The projects in this repo were on .NET 4.8.1 and were still using the older format for the .csproj file instead of the newer (and far better) "SDK-style" format that Microsoft introduced a few years ago. It tried to use the dotnet CLI to add references to Dapper, even though the older format of .csproj doesn't work with that. The dotnet CLI returned errors about trying to add the package references for Dapper, which Claude completely ignored while continuing to try and convert the ADO.NET calls to Dapper. And at the end it tried building the project, which of course failed, and then it confidently informed me that the conversion had been completed successfully and that the build completed successfully and all tests were passing successfully, even though the output from the build it had done immediately prior clearly told the LLM otherwise.
A real human, despite being non-deterministic, would have caught the issue at multiple stages. They would have seen the error when trying to add the reference. If they ignored that then they would have seen the red squiggly lines all over the (deterministic) IDE telling them there was something wrong, along with autocomplete for Dapper calls not working. And if they continued to ignore those and managed to keep going anyways, they would have clearly seen that the build failed, with tons of errors specifically about references to Dapper failing to resolve. An LLM keeps going on its merry way in ways that effectively 0 humans would.
TBD on if the calculator can properly review and participate in the feedback loop with itself.
They also don't learn, so they never get less unpredictable. You can't give the senior robot the production keys and expect it won't delete prod.
Programming has already become this way. Opinions about different languages and architectures are taste, or sometimes even just vibes. Few try to actually ask “can I quantify whether microservices or monoliths are better in terms of either maintainability or scaling?”
A lot of this is a result of systems having long ago exceeded the complexity threshold of things people can hold in their heads. There are too many layers, subsystems, languages, APIs, all glued together. Attempts at radical simplification fail because each of those layers and subsystems has features or behaviors someone needs, and a lot of it isn’t even documented.
AI takes this to the extreme. I’ve already learned that certain models have “personalities.” Some are more likely to go with you on magical journeys into hallucination while others are more critical. Some are better at detail while others seem better at abstraction but fall over on detail. Some are better instruction followers. All their quirks are complex and the systems themselves are impossible to understand.
Computer systems are becoming organic, biological.
"Feeping creaturism" has always been a problem, for sure.
But those technologies are layers, and there are reliable things that sometimes bubble across the boundaries — type hints, better code patterns to trigger compiler optimisation, interesting tricks with key column selection — and someone with expertise from that layer below can explain why, and their advice will always work in situations that are sufficiently similar.
You are right about AI personalities. Obvious even with the open weights models. Gemma and Qwen write code and documentation like people from different cultures. Because I guess they are a bit like that.
It's in the hype train's interest to keep the actual value unknowable. If you quantify what you're paying for then the FOMO is greatly reduced.
> But I am struggling to put into words how alarming I find the comments on threads like this — all sorts of good-natured anecdotes about how XYZ works for them that are more like the suggestions in pet care or cookery threads on Facebook.
It will always be this way going forward. Everyone thinks differently about problems. In the past we had experts and only they could do the work at a high level. But now we have many people that are cranking out expert level solutions without much knowledge. Worrying about the minutia is a dying trend.
Edit: I see I touched a nerve. But that is how it is now. You can't fight reality.
Your argument is that superstition is the way of the future and technical rigor no longer applies.
Because that's what OP is talking about. Superstition presented as factual advice instead of the technically rigorous and scientific fact that already exists.
You're being downvoted because you don't understand this fact, or indeed understand what you're saying at all.
I'll spell it out for you: technically and scientifically rigorous facts do actually exist, even in regards to LLMs. We can, in fact, obtain scientific and objective facts about how LLMs perform. It can be rigorously proven that certain context habits affect certain tasks positively or negatively. Your argument is that none of this matters more than superstition. And you're surprised that arguing to a room full of engineers and scientists that science is dead and superstition is the one true way forward gives you negative response.
There aren't any good facts that exist regarding LLMs. It's a black box. Also, do not presume to know what I understand or don't understand from one comment.
> I'll spell it out for you
You are a rude and crude individual. I am not interested in discussing anything further with you.
It's a black box, but you can run tests to quantify the behaviour and establish, for example, that a certain model is X% more likely to give a certain behaviour.
At some level, we've always delegated worrying about the minutiae to someone who builds the tool that is one or two levels below.
I usually don't have to worry about compiler optimisations because compiler experts do that; sometimes they appear in a thread about code and say "compiler guy here — if you write your code like this the compiler can optimise it".
And that person will be provably right (or wrong), in that context. And it'll be the same each time you run the test!
I just… ehh. You make a good point and I worry you are not wrong. It's all so different.
I like my 3D printing analogy much more than I wish I did.
I've been able to avoid context size issues by applying one simple constraint to my agent loop. What I do is prevent all tool calling in the user's top-level conversation thread. Anything that needs to tool call must happen in a recursive invoke of the agent, which returns whatever results to caller.
I can keep the same high level conversation going for an entire day over a million LOC+ codebase without ever hitting meaningful token limits. No compaction or summarization tricks needed. I can burn 50 million tokens in recursive calls and still not touch 100k tokens in my root conversation thread.
There is some rework needed to "bootstrap" the agent each time it has to descend back into Narnia, but this is still far more efficient than carrying around one big flat context that tries to cover everything all the time.
Recursion is very effective at controlling token use, but it can only go so far. I've not observed any uplift for recursive depth beyond 1. I have seen the agent attempt it a few times, but the practical performance is simply not there. External symbolic recursion does not appear to be something the frontier models have been trained for. They are fantastic at emulating recursion in context, but we don't want that if we are trying to achieve a reduction in token use.
This makes intuitive sense. Can I ask what harness you're using that allows you to configure the constraint and how?
It's a custom agent loop. There are no other parties involved here. Just vanilla C#/.NET and the OpenAI DLL.
I would also be really interested in seeing this if you’re willing to share it.
Are you going to open source it
You can do this in opencode and pi (haven't used), by defining your own agents or overriding the built-in ones, so in your primary agent you can disable all tools and give it good instructions for how to delegate
I imagine most harnesses should have a way to do this today, if they don't, get a new one. OpenCode i.e. is highly customizable, Claude and VS Code both support a ton as well including custom agents (though unclear if you can create custom top-level in claude-code)
https://opencode.ai/docs/agents/
https://code.claude.com/docs/en/sub-agents
https://code.visualstudio.com/docs/agent-customization/custo...
Thanks, those don't deterministically prevent the main loop from using tools thought, unless I'm wrong that's just prompting the main agent on when to use specialized sub agents
you can configure tools, thinking, permissions et al on a per agent basis in the frontmatter
the main agent would be very different, basically an orchestrator, and you are "loop engineering" it, and turning off all the things for this main agent besides being able to run subagents
for opencode:
https://opencode.ai/docs/agents/#permissions (what tools, mcp, etc...)
https://opencode.ai/docs/agents/#task-permissions (what subagents it can call)
https://opencode.ai/docs/agents/#additional (thinking effort)
For anyone using Claude Code, ask it to do all the work in workflows (it has a tool for that), they released that feature together with Opus 4.8 and it also seems a bit better at doing long tasks as well. The main conversation just orchestrates the work at that point.
You can also just ask it to do work in a subagent. It will write a plan and launch the subagent to do the actual code, keeping it out of the main context.
In addition, you can co-author a plan for a biggish chunk of work, divided into stages, have it launch a subagent for phase 1 and check its work, then ESC-ESC to go back to just after you wrote the plan and have it do phase 2. Repeat until done. This keeps the overall goal in the main context for the review, but clears out previous reviews. Kind of like a workflow but with more control.
Claude Code seems to automatically do this in some cases. It seems to have some heuristic "will eat a lot of context" where it decides to dispatch a sub agent.
I see it pretty frequently in troubleshooting and data analysis flows where it will dump the data collection and aggregation into a sub agent then pull out a summarized result.
I'll do something similar where I have the main agent maintain context in a design doc/markdown file and update as it goes along. Then I can clear/restart/handoff at will
Might depend on the model. Haiku doesn’t like to delegate unless you ask it to. I have a custom command for “delegate plan, delegate code, delegate review”, but launching it with Haiku gives me mediocre results.
So what does the top level thread look like? "Make foo() do bar" (Subagent invoked) "Job finished!"
The top level and N+1 looks like:
I have a different way, but still trying to figure out how well it works. Instead of going into recursion, the agent is allowed to restart the thread by doing the summarize/debrief/reflect pass, writing key findings into persistent memory and rewriting the prompt whenever the context goes too large or it gets stuck. Recursion with TCO if you may.
In a way it's a generalization of the spec-driver approach, but in addition to the the formal spec the carryover buffer lives in the memory.
Kiro does this automatically from what I can tell using it
This is interesting to me because reducing context & token usage is in the user's best interest but not in the financial interest of AI vendors. I am not an expert but it sounds like your "one simple trick" would fix context issues and allow much tighter control over token usage. Thanks for being willing to share this tip in an HN comment, changing how those in the know use AI agents going forward -- it's hard to keep up!
> This is interesting to me because reducing context & token usage is in the user's best interest but not in the financial interest of AI vendors.
AI vendors still need to compete with each other both in terms of token cost and competency. An agent that is costly and less effective by wasting tokens is less competitive.
The tokens are still being burnt, they're just doing so in a parallel dimension from the users main context window.
It's true that the initial tool response still has the same amount of tokens but it doesn't keep dragged along in the longer-lived top context.
The real benefit is being able to use a cheaper, but good enough, model with a specific system prompt dedicated to that task.
How do you get the agent to stick to it without constantly rejecting tool calls with the same description? I've tried a similar setup a number of times and it tends to forget about this constraint very quickly.
The tool itself enforces the constraint. This is deterministic. If an agent tries to read a big fat file in root, it gets an error from that tool's implementation that reiterates the requirement.
I don't bother warning it in the system prompt anymore. It's pointless. I let it bump its head as required. A few hundred tokens and the agent is back on track each time.
If the model isn't following the system/developer prompts easily, you might want to try a bigger/better model, tends to mostly be about model quality if it doesn't follow what you tell it to. Besides that, conflicting directions in the system/developer prompts can lead to the model seemingly ignoring instructions too.
Which tools? Even file reads and writes?
Especially these things.
The only tools permissible to root in my scheme are call() and return().
Is it in pi.dev? Don't thinking tokens still take up context?
How do you get something like this set up?
This has not been my experience with Opus since Anthropic released the 1M token context window for use under the subscription plans. I routinely push past 500k tokens, even sometimes up to around 800k tokens, and don't see this problem. I've seen it to some extent when getting truly near the limit, up around and above 900k tokens, though what I see isn't as severe as the author seems to see.
(And I rarely fill the context window that far anyway when working on a single task, or a series of tasks that are related enough to warrant the same context; more typical is anywhere between 200k and 600k or so.)
I'm not saying that no one ever has this experience, but it's odd to me that some people see it so often that it warrants giving it a name.
I see this said often and find it insane given how many times I find opus models making basic recall mistakes at <100k tokens.
Personally I consider < 60k to be the smart zone for opus. This is worse for opus 4.7 and 4.8 cause of the more granular tokenizer
60k is tiny, if it's making recall mistakes that early then you might have some false memories or incorrect instructions in your CLAUDE.md.
60k isn't much bigger than the system prompt.
I don't use Claude Code. I use my own handwritten agent (formerly using Pi) and know every token that goes into it. There are zero memories to confuse it. The system prompt is 200 tokens and completely self consistent.
Plus I've found that the only time models go above 100k tokens anyway is when they've started looping at which point it's much better to go back anyway.
Anecdotally most models know their recall is terrible (or have been trained to act as such), that's why they constantly reread files before editing or while reasoning.
Yeah 60k is ludicrous, I've barely seeded the context at that point and I don't see context related degradation until well into the 600-700k.
In this thread: People tossing coins independently and fighting over the result they got.
No it's not.
It seems that people have different workflows or repos, or memories or prompts or expectations.
For what it’s worth, as a third party I read your and qsera’s comments as saying the same thing.
Maybe I misread the comment then.
I read it as a models performance being random and observed differences in the opinions are the results of the overinterpretation of the random outcomes.
I think however that some people seem to be always lucky which indicates that it is not random but rather some fixed differences between people and their environments.
> I've barely seeded the context at that point
I think that's issue, rather than 60K being small.
Most of the actual edits/changes I request to codex are solved within 100-150K tokens, beyond 200K I'd definitively try to restart the session as soon as I could as all models are horrible once you get across ~20% of the total context size. And this is while working on +million LOC codebases.
Problem I guess is that there is no solid and concrete evidence of this (to me [and others seemingly] obvious) degradation, but should be easy to prove, yet no one has time to sit down and show it :)
But the likelihood of a model getting minor details wrong once you're above some magical threshold between 15-20%, seems to skyrocket, and I hit that issue sufficient amount of times that now my workflow is trying to prevent that.
what are y'all doing to hit that? Do you just not give it any pointers and let it churn away? What kind of context are you handing off?
I routinely get claude to do things pretty decently and finish up easily in the 4-5 digit range of tokens. It seems to be doing the right kind of thing to not waste its time looking at 1000 files.
>you might have some false memories or incorrect instructions in your CLAUDE.md
did you internalize what was wrong with that quote when it was said? does it apply here?
>making basic recall mistakes at <100k tokens.
I usually see this when the context gets "tainted" as I call it. The model gets stuck on a bad path and there's no way to bring it back without clearing the context and starting again.
Frequently it'll be something as small as 1 sentence of a prompt many messages ago.
When cases like that happen, I reset the context and try to be explicit about assumptions and requirements to keep it off the "tainted" path. Other times it's actually useful and agents will do things they normally wouldn't do once the state is tainted. For instance, if you're testing a chat bot's ability to stay on topic, you can seed the context early with what you want it to do. It generally will refuse initially but later on in the conversation it will still silently take that seeded context into account almost "subconsciously" and become more likely to do the thing it originally refused.
I'm always a bit confused when people say things like this. 60k token is often more than the initial context I feed the model with. And I don't think I ever had a productive session that began under 150k tokens.
Bit of what makes it so fun, our experiences seem to wildly differ! On one hand, you have experiences like yours, but then my own experience is that I never had a productive session when the scope grows beyond 150K tokens! If I needed 60K just as a starting context, I'd take that to mean the suggested change is way to large, and if the model cannot solve the entire thing within maybe 15-20% of the total context size, divide and conquer is needed otherwise there will be a lot of time wasted to patch things up when things are "completed".
Yeah indeed it's very interesting. And the 60k initial context don't even contain the suggested change yet. For me if I don't do this the current models tend to fixate and local patches instead of tracing symbols and making a holistic model of what a change interacts with in the codebase
Not specific to Opus but yes it would make mistakes. I usually try to keep context window under 10%
I hate to do the "you're holding it wrong" trope, but I think you might have something misconfigured somewhere unless you missed a 0, because just past 60k tokens is such a small context window to be seeing issue in.
Do you have any old documentation that it's picking up and referencing? If you set all claude settings back to default do you see the same issue?
Opus 4.6 was on drugs past 200k, I skipped 4.7, 4.8 did good up to ~350k, and Fable did great beyond 400k, in my limited testing. The quality does appear to be trending upwards.
> Opus 4.6 was on drugs past 200k
Which drugs?
The way it hallucinates stuff, it'd probably be something in the LSD family. ;)
Combine it with meth and sleep deprivation and that could explain it.
Shrooms, sometimes crack
agreed. the claudes have been getting better and better with every release in this regard.
opus 4.5 would start failing tool calls when approaching its 200k limit, opus 4.6 could get to ~300k before getting confused, opus 4.7 i could stretch to around 400k the dumb zone started, with opus 4.8 i've had sessions get over 500k comfortably.
admittedly we only had limited time with fable, but i had a couple sessions get into 800-900k just fine.
I often push past 300k or so and I’ve absolutely worked at 800k but it’s an observable problem. Large context windows can work depending on the problem but I do feel more effective biasing towards small ones <300k.
Thats another problem of this post, the author mentions Claude but not explicitely what models...
100k tokens "by lunch" is also not my finding, the newer models will hit that already right in the initial exploratory phase
Really depends on the project.
I found "by lunch" odd too, but considering that Claude wrote the article, it's not going to know specifics.
I’ve had similar experiences with Fable. 70%+ context used out of 1M, still sharp and no memory issues.
I have a custom build command for a rust project (yarn build:lib) and my experience is 120k for GLM and roughly 200-300k for Opus. After that, they default to cargo build.
My projects have specific build/verify steps as well, and after a certain point Claude forgets to run them. I’m going to try a “No brown M&Ms” hook to halt Claude if it tries to run the default command instead of the instructed commands from CLAUDE.md. Perhaps this will be a good signal that a compacted or fresh session is needed at that point to avoid mistakes.
I mean, that’s basically the magic of the harness. The whole thing that skyrocketted the intelligence is that the harness (cli tool) prevent the LLM from editing the file before reading it.
Can you imagine even a junior making such a mistake?
As the gamblers say at the poker table: If you can't figure out who the mark is when you site down...
There’s a simple way to solve this: just use Codex. The auto-compaction is really good, and lets threads go on for a long time without losing track. In case you do notice a session is starting to go off track, it’s straightforward to make a new session, ask it to summarize an old session into an AGENTS.md, and start it from there.
Opus in recent versions is fine beyond 100k, but I usually do try to keep it under 200k.
But, this is also why so-called "memory" systems are usually a mistake that make the models dumber. They don't have memory, they only have context, and every irrelevant fact you shove into the context is less context for the problem. Less distractions, better results.
The way to have the agent remember things is to have it document its work, like a human developer would do if they wanted their project to be friendly to other developers working on it. Good developer docs with an index page and a good plan with checklists, in concise Markdown files, checked in to the repo is the ideal memory for models and the ideal docs you need to figure out WTF the model has been up to. Helps with code review, too, whether by humans or another model. There's no down side.
At least for me, Opus keeps writing stuff to memories, only to consistently forget checking those memories before doing the same mistake again. This ("remember to check memories!") is of course then again written as a memory... Clearly not a very well working system, yep.
Yeah, I see it write stuff to memory pretty regularly, maybe it works sometimes, but for things I want it to stop doing or always do, I make it impossible to do otherwise via lint or some style enforcement, or via a test that fails if code shows up that violates the constraint.
But, it does a good job following existing conventions in a codebase, as long as they're really consistent. So the more actively you enforce that consistency the more likely it is to do the right thing without memories or prompting.
I don't like "never do" or "always do" type rules in AGENTS.md or in memory, as it often over-interprets them and ties itself in knots trying to satisfy an impossible set of goals.
In my own multi agent framework I use cheap models to check the responses of the expensive models, as well as using multiple expensive models adversarially in debate. The cheap models are great at spotting eg the model getting stuck in the alternate between two broken ideas or not following code conventions or missing a step in the skill and so on. I’m currently working on making them detect user corrections and police that going forward to intervene when the expensive models forget the thing you just corrected them about etc.
I've explicitly banned Opus from creating memories unprompted, as it would often save info that's incorrect and which would then be propagated to future sessions until caught. Ugh x 10.
“Memory” systems are a way for developers to feel like they are contributing to AI
Almost every comment here is appealing to personal experience. By contrast, OP refers to two studies that compare performance on some kind of standardised test over a range of models.
Can't speak to how good those tests are, but they can't be worse than anecdotal evidence for something as vague/subjective as LLM performance.
I'll respond with more anecdotal evidence, the Llama family has been terrible at following directions in all the tests I've done--not sure about the other models in RULER.
In the Chroma results, they look at Sonnet 4 which was also terrible in my experience. The same prompt that worked perfectly in Sonnet 4.5 would fail miserably in Sonnet 4
Would be good to see newer tests with both SOTA and open weight. The SOTA ones always seem to follow directions and stay on topic better but it'd be good to have some data to back it up.
But the studies are in 2024 and 2025. They don’t apply to current Claude models.
I'm getting a lot of mileage out of basically acting like the AI's Product Manager, and insisting that it writes up short PRDs for every feature we propose to build. That gives it a reference over time of everything that has been built, but also makes it less liable to drift with each one. Each one gets its own conversation. For me this is a happy medium between stopping it going off the rails but also making sure it can reference past decisions when it needs to. The one thing I dislike about Pocock's method (not to use PRDs so much but to have an in depth discussion to get alignment) first is it wastes a lot of the best window on that initial back and forth.
Is it adhoc or you use more structured approaches like openspec? I also tend to work on a plan first, but it stays as in-session todo, which is hard to reference later.
It's ad hoc / my own framework, just found something which works for me. The exact structure is
- Work Mode - HITL/AFK
- Problem Statement
- Who It Affects - Primary / Secondary User
- User Stories
- Business Case
- Why Now
- Success Critera
- In Scope/Out of Scope [Out of Scope v. important)
- Thinnest Slice (This I've found super valuable, means you max out the amount of 'product' for your buck and avoid diminishing marginal returns or overbuilding. Often I will build this)
- Eigenfeature - What is the larger feature we _could_ (but probably won't) which would solve for this use case and other stuff I might not have thought of
- Technical Notes
- Deps
- Schema Changes
- Risks
- Final Recommendation [go / no go, including on scope]
There's a note in my Claude / Agents MD which says no net new feature gets introduced without this and I get it to move through a pipeline of folders (active, approved, shipped, proposed etc). All runs in a system of MD files and have even created a little MD Kanban from the metadata!
I guess I've stumbled into something similar. Though I don't have a fixed format like yours. I first do a lot of back and forth to generate what I call a design document also includes rationales for various points or decisions. I use both Claude and Codex to iterate on this until I'm happy. The end result includes a lot of what you mention.
I then start a fresh conversation, make it analyze the design document and code, and for larger changes, generate a high-level implementation document which includes concrete phases or steps. I review this plan and iterate if necessary.
Then for each phase I make it generate a detailed plan for that phase and save it along side the other documents. Once the phase is over, I make it write a summary of what was done, decisions made and reasons for it. And typically a good point to compact the model's context.
These documents gives additional context for when I make another model do code review, and help illuminate drift or gaps from the main design document.
I found myself in a similar workflow. Depending on the task at hand (starting a new project, enhancement, maintenance), I let the agent create/read the markdown files that I keep updated (AGENT, STATE, ROADMAP, DESIGN, ARCHITECTURE, (CODESTYLE if I plan to modi it myself)). Then I choose the various roles that I need in this session and and have a planning phase. After that, the agent is starting implement the changes and I have a manual correction phase.
This flow works for my needs, building idea demos, prototypes or tools for my own sake. I don't let agent code in our main code base where everything is still hand tailored. That's a conscious decision.
I noticed that the cheaper models (flash, ...) are quite hard to hold back changing files. A question for possible options sometimes results in "yes, I'll go with option A" without asking back. Frontier models on the other hand love to plan and ask you deliberately for your consent.
I use pi.dev with almost no skills at all to understand how models really work and "feel" to work with.
Is there back-and-forth? How long do these get? Can you share an example?
Working in the era of 200k context window meant I had to narrowly scope tasks to fit in the context window, forcing me to think about how to reduce complexity and naturally resulting in atomic work. 1M context windows and the promise that the latest models are "better at long running tasks" made me lazy in how I scope tasks and quality got worse. I now went back to narrow-scoping one session per task and zero compaction, trying not to go past 400k context window. If I end up with a long session, I was likely too ambitious and should have broken up the task.
Considerations about what goes on in agents internally will probably not be part of software development for long.
Personally, I already see LLMs and agents as blackboxes. I give each feature request to multiple LLMs and then compare the results. I don't manually use "sessions" at all. I just look at the outcome. When I dislike it, I "git reset --hard", change my prompts and restart the feature request.
To have an ongoing sense of which agents perform best, I keep a log and calculate an ELO score of which agents meet my demands best. This score is imporant to me, not so much how the agent achieves it.
This is an absolutely crazy wasteful thing to do considering the actual cost of all that inference and nothing to be proud of.
Unless we do our own benchmarks, we have to take all the marketing fluff from the frontier labs at face value, and all public benchmarks degrade eventually as labs optimize towards them. OP’s approach is wasteful because it is brute force, but post says that an ELO is kept, so this is also an experiment, and I don‘t see what‘s wrong with that. You learn which model performs well in which settings which may save resources later. It‘s also wasteful to keep working with the wrong model/harness/tools for too long.
It is the other way round.
In an interactive session, adding "Fine, but make the button red" after the model generated a first solution more than doubles the tokens used. As the model now not only gets the original code and the feature request but also the updated code plus the change request as input tokens.
Sending a feature request to an LLM and then sending the feature request again with "The button shall be red" only doubles the tokens used.
The cost is far from linear though. Because of prompt caching and the fact that generally output tokens are a lot more expensive than input tokens.
Agreed that it is not linear.
I wrote my own agent, and it sends data to LLMs in this order: "General Prompts (How to write good code)" + "The Code" + "The Feature Request". This means the KV cache will be used even when the feature request changes.
And output tokens are usually way less than the input tokens.
So I think that my approach is very lightweight on token usage compared to an interactive session.
It would be interesting to measure it for the other agents out there. Sending a feature request two times vs an interactive session.
"Make the button red" probably doesn't need an LLM at all.
One tends to use LLMs for everything in practice. It‘s inconvenient to switch mode of operation
That’s usually not true due to caching. It may be true if you leave a large gap in between, but if you send “make it red” right after, then it’s purely incremental
Probably like 1% of the energy an average person spends on driving.
Average american is what you mean
The cost is nothing compared to the outcome and time savings. What I see is that people with no money want to jump into this pool but they aren't having a good time. That is generally the case when you are poor.
come on now, we can't just not escape the permanent underclass by using our brains, we've also got to use up all the resources while doing it.
What kind of projects/code do you have them work on?
Asking because I could guess that approach would be ok for the types of front end work that doesn't require much security or other validation.
But it sounds like it wouldn't be suitable for work in regulated industries or anything that needs to have extreme care taken.
?
Which model is leading the pack for you?
From the SOTA model providers, I only use OpenAI and Google. And between gpt-5.5 and gemini-3.1-pro-preview, gpt-5.5 is currently leading.
Yes context management is key.
I do my own framework and spend a lot of time trying to debug this and it’s not so much the context size in hard numbers but rather the probability that there is debris or wrong directions in the window that are drowning out the things the user thinks are important.
This manifests in the llm that keeps going back to doing the thing that failed when they tried it just before the last approach etc. The frequency of things in the context window give weight even if they are the wrong things.
I have a lot of tricks like not giving the llm lots of tools but rather giving it a tool it can use to search for tools etc.
But the bigger solution is in process where you use something like superpowers to force the llm through stages and you control the context that carries forward.
I think of the context window as a pot of soup that you add ingredients to between meals. If you have a relatively focused recipe and you are able to add only the ingredients you want, the soup stays good. If you or the agent add an ingredient that isn't fresh, it is going to be difficult to salvage and it is better to start over with a new pot.
It is not that agents can't function with a large context window, they can if that information generally has a desirable signal (like a large initial document or a well-focused session). Mistakes and the confusing signals that come out of fixing mistakes are why performance degrades. I start to trust the context window less not as a matter of size but the amount of friction we run into. The friction can be random but it is more often an issue with the path that I have us on.
Hmm iirc if you ask Claude it itself recommends one conversation per task.
That’s what I did intuitively anyway.
I doubt the dropoff is as large as 100k tokens. I start a new session and paste the best results from the previous one as soon as as LLM makes more than a couple of missteps. Theres too much focus on fixing what's wrong rather than going back to what worked and amending in a different way.
If you don't point out what's wrong I find the LLM will go into great technical detail which consumes a lot of tokens, but not 'see the wood for the trees'.
It seems to me human beings also have mechanisms to compact context, which may be why we can forget what we came into a room for when going through doorways. I think it would be interesting to research which markers we use to compartmentalize our thinking.
I built a very small personal extension for Pi [1] that gives me a /last command. It clears the entire session, only retaining the agent's last output message. This allows me to do manual "compaction". Basically I tell the agent something like "state the plan as discussed with references to files that should be edited", and call /last, then tell it to implement.
[1] https://pi.dev/
> the dumb zone, where attention drops off and the model starts forgetting what you told it five minutes ago
I use opus 1m context all day every day at work and I simply have never encountered this. I don’t even think about context windows anymore I just let it do what it wants re compaction. Hard for me to understand where this article is coming from.
This has not been my experience and I do not think any of the methodologies testing for this do so usefully.
I dislike the non-specificity of "models" here. Different models have different attention architectures, and can therefore have significant differences in long-context behavior. It's true that long context is an issue can most models do drop off in quality, but I would not extrapolate behavior of old models to new ones.
I'm actually doing a big refactoring in a project where if everything gets loaded (code / docs), the context gets like 750k filled (Opus 4.8), and then the agent has the remaining ~200k to do actual coding, until I have to reset. I haven't finished the work but I'm like 80% there, and it seems the progress is good and the quality is also good, verified by doing some performance tests and a lot of comparisons between outputs between the original code and the new one.
Maybe I could achieve better and quicker results with keeping the context in the proper zone, but trying it will have to wait until the next project.
Funny to read about that superpowers repo, since only yesterday I wrote skills to do some markdown-plan centered aproach. I feel like smallish local models are getting capable of lots of things now, but they need lots of structure for resiliency.
Yeah I’ve been using gpt-5.3-codex-spark in Codex lately and it can be surprisingly good and it’s super fast. However it needs more explicit instructions.
I've had no problem with Claude Code Opus 4.8 effort max using 20% token context (200k) on software development tasks (all stages). I aways load core source files and the ones we are working on up front. Around 20%, I make it autoprepare for a new session and clear.
Admittedly I have been doing this precautiously, based on anecdotal evidence, not because I had bad experiences with longer context deterioration myself.
In the brief time I had access to Fable 5, it went on long running tasks (>45 mins) into the 30-40% zone without apparent context coherence problems.
Why is it surprising that, at some point, more information will lead to worse performance?
It seems obvious. Moreover, in a simple model, it seems like whatever tokens you do add have to have MORE information than the average in the existing window.
In a non-trivial model (and this is the model I would choose), since you are adding them to the end, they likely have to have MUCH more information.
Proof as always is an exercise to the reader.
I /clear all the time out of habit. I want to be able to get the thing done with minimal context. It also means you can do it again slightly different if needed, you know the seed conditions for the task.
100% with the author on that one, albeit the performance decay seems to depend on the type of task for me. Simple plumbing tasks seem to run okay with longer running contexts.
Also, some colleagues were playing around with RTK (https://github.com/rtk-ai/rtk), which decreases the amount of token used by tool calls and, although it seems an interesting idea, I am pretty sure there are many caveats. Although, I believe if these type of tools prove to be efficient enough, perhaps harnesses will have them natively.
Considering how expensive context is in terms of compute, I wonder why (and if ) vendors don't invest more into context engineering.
When it comes to source code, I feel like LLMs could just as well work with something like minified source code, if an LLM is trained on programming well, I think there's no reason why something like a variable should be represented by something more than a single token. Comments can be discarded, etc. In fact considering embeddings for LLMs are very rich, I think common ops could be reduced to a single token.
Imo that's why LLMs are soo good at reverse engineering. A lot of the time, assembly (with symbols) is pretty close to the source code, but compressed and encoded, and if you're familiar with the patterns of your compiler, reversing it is not that difficult.
Anyways, context engineering could be huge boon to input token curation imo (and maybe it already is)
The approach we're taking to deal with this very real context rot is using a bunch of related techniques which we call transposing the agent loop: https://alejo.ch/3jt
In essence, we run many short agent loops, generating their prompts dynamically from structured data. Each loop advances the state in a small step towards the final goal.
I think it's Your mileage may vary.
Few of the best sessions I have ever had with claude went into 700-800k territory.
I frequently reach 400-600k without visible (to me) signs of quality regression.
Can anybody explain me why just not limit the context window to something smaller instead of all that context engineering? It forces things to be constrained.
I wonder how much this depends on the quality and consistency of the context?
For example, it may be the case that a long context full of useful information relevant to the task is completely fine, perhaps even beneficial. And if the context contains a bunch of unrelated tangents and conflicting instructions, then it will be detrimental.
Have there been studies on what makes models get dumber? To what extent is context length to blame vs context quality?
> The number on the box gets bigger every release.
Not really tho right? Since we got to 1m context in mid 2025 nearly no one has gone higher.
There's an env var you can set in Claude Code to bring the autocompact threshold down, effectively setting your own max context window. I have it at 400k.
100K seems quite much.
I had the impression, models would get inconsistent after just 3000 words.
Perhaps compacting the context can be made in multiple requests over smaller and overlapping chunks to avoid using the 'dumb zone', and for yielding a better result.
Even taking the author's criticism about large context windows for granted, which in my experience are exaggerated, they are still a huge UX improvement over short windows. That reason alone is enough for me to support them.
Maybe this is the line, we'll hit eventually. Maybe the models become smarter, but the context will sit.
Evaluating the Sensitivity of LLMs to Prior Context
https://arxiv.org/abs/2506.00069
In my own testing I have seen peak performance happen usually within 15-20% of the intended context limit, albeit there are a few optimizations depending on the task quality.
It is a lot like giving a person instructions, the more you tell them, the more they will forget the specifics.
Long context generation is a sampling problem. Set your opencode to use a modern sampler like min_p or newer and you'll see models behave better at longer context.
Why is it a "dumb zone"?
What in the models causes this 'dumbing down'?
i let the main loop spawn sub terminal via tmux to prevent large contexts. it's great to divide tasks in small patterns and consolidate it step by step.
The problem with "context rot" is that its existence and severity is purely anecdotal. As far as I know, nobody has actually measured context rot systematically. The only thing we know is that memory degrades somewhat in long contexts, via things like needle in haystack tests. But that's not the same issue. Context rot is usually taken to mean that the model gets dumber even if it doesn't need to remember specific things in its context window.
This would be really easy to measure. Just take some standard benchmarks, but fill up the context beforehand. Is the benchmark performance degraded? If so, by how much?
It's pretty hard to measure because most context rot comes from related context and the model has to be able to figure which parts are truly relevant, which ones are relevant but stale, which ones to ignore etc.
Each relevant thing is basically a rule. Trying to so something with 500 rules is what's hard.
If you take a standard benchmark and just prepend a random book to it, it will not capture that
context window size isnt quite the issue though, its that the attention mass kinda spreads out too much and everything kinda converges to a sortah global average region full of what we know to be slop! theres some really cool ways at the harness or model layer to mitigate this. just isnt really prioritized by the labs often.
Is there any chance that this is because training corpus largely consists of documents shorter than the advertised context windows?
> dumb zone
Reminds me the sign, "Do not dumb here. No dumb zone."
aka Softmax context rot
Hasn’t been my experience at all - 1M window is a very clear upgrade working with Claude code.
Even better, don't trust LLMs at all.