I remember when OpenAI announced ChatGPT now will remember stuff between sessions. Oh, you mean find random trivia about me and copy paste it between prompts without out my explicit consent.
”compare these three cars. Oh btw I am a data engineer, and my moms maiden name is Joana, and I am allergic to bad poetry. And code should be DRY, I prefer SQL over Python and what’s the most poisonous flower in Scandinavia?”.
I’ve had so much wierd output because context is ”””memorized””” and bleeding into completely unrelated projects and conversations. It’s the first feature I turn off.
I have to imagine these types of features are for people who treat chatgpt like their friend/therapist/girlfriend/assistant/... instead of people who use it to answer questions
You shouldn't really use it to answer questions either though, because then you won't know how wrong it is when it's hallucinating...
I mean I know the draw, it's just so easy to get suckered into it... After all, it's usually mostly correct! And since the Advent of llms searching via Google has gotten essentially impossible with 1000 slop articles per one genuine article... But the danger is real, even with eg fable
Strongly agree here. claude-code’s memory system is occasionally useful but much more often harmful, pulling in obsolete info that muddies the waters about current tasks. I’ve frequently seen Claude’s own memories severely mislead it.
My guess is that has something to do with the training process leaving models unable to differentiate between “what’s happening now” and “what happened before”. Perhaps if making inferences from memories was actually part of the training process things would be different but my sense is that as an inference-time-only feature this just gets the models confused.
Humans make memories constantly, but they also forget things that are no longer relevant. Until Claude can do that, it means the LLM will have an ever-increasing, ever more fragmented context.
And LLMs are NOT intelligent enough to survive even mild context poisoning.
When claude goes down a wrong path, I tend to clear context and write a new prompt that helps guide it down the correct path.
Whatever thinking or context that led it there has inertia and tends to be sticky, otherwise.
Pretty annoying when it brings those up again later from memory...
It's because it mostly doesn't matter what you are trying to get the code to do. What matters is what the code does.
Session logs can absolutely be useful, but not when building further. It's just that that the place they slot in is during validation. You know, that place between the markdown plan and CI passing, where there's 800 new lines of code and it all seems sort of fine when you click around?
Session logs can show you what sort of manual validation happened. CI will run the tests you had, and the code will show you what new unit tests were added, but session logs can show you that the agent drove the app with Playwright, or that the agent read and considered the prod config as well as the dev config.
Nothing bulletproof, but not every piece of validation work merits a test in the repo that lives forever. We've gotten a lot of mileage out of re-analyzing the sessions, figuring out where the agent made decisions without asking, and forcing the agent to consider validation for those decisions. That's the sort of thing that's hard to dictate up front but easy to highlight with the session logs.
Isn’t this just a form of the bitter lesson? Our attempts to make engineered context and agents will simply be made obsolete with bigger and better models. Those transcripts are probably extremely useful for lesser capable models, and near unnecessary for frontier ones, maybe?
Yeah, the question is whether this applies to all of context management.
I've been using a custom harness based on https://minimal-agent.com/ (itself based on swe-mini-agent), which is like 50 lines for the core logic. Bash is all you need.
For small tasks, I find it's about 8x faster (and uses 8x fewer tokens) than the standard harness for each model.
For bigger tasks I haven't tested it much. It seems to work too but I think they're a bit less focused and productive in that case. It could be that those big harnesses' 20k token system prompts are doing something important with regard to steering software development workflows. (e.g. I heard Fable has a custom system prompt in Claude Code which might explain its markedly more proactive behavior.)
So I want to say there's still a lot of value in context engineering though it seems to diminish with each model release (since they're fine tuned on mostly non stupid behavior and need less hand holding).
> So I want to say there's still a lot of value in context engineering though it seems to diminish with each model release
I can't see how it would diminish unless you are literally working on public domain stuff. Unless stuffing context becomes cost effective and will not affect AI reasoning (this will be much harder), I don't see why context engineering is here to stay until we have close to AGI.
interesting take. I think I disagree, but I like this take a lot and I had to think about it.
First, I think that models still need a context layer. One way to think about 'context' is as a form of compression. You provide the model context because it makes it easier for the model to figure out what to do. Even in a world with infinite model capacity and infinite model context, this is still useful because it allows the model to avoid rederiving everything from first principles every time. As long as models perform better using fewer tokens and as long as we care about token spend, context is a useful (necessary?) shortcut.
Once you bite that you need some form of context layer, the question is which. Here I do agree that it is better to work with what the models will find familiar (markdown files colocated with code, for eg). But this speaks to over-engineered solutions not understanding their main user (the agent) more than it does the need or lack there of.
A) Context and prompting cuts the search space for next token generation. That’s pretty useful, as you mentioned.
B) The other use of context is that it introduces entirely new information via RAG
B will never go away (as others pointed out). A, well that’s just something we’re all going to keep getting surprised at. We’ll barely give it any direction or context and the newer models will simply find the happy path.
The author is kind of suggesting that their context wasn’t really necessary to get the happy output, I think.
Chain of reasoning is a lot of context to guide token generation, but we simply see that newer models don’t need that context to get to the answer. I’m mostly reiterating this because there’s a hot take here, and that is this agentic stuff may be waived away by magic frontier-llm wand , all of a sudden.
They do if they are a reasoning-variant. That doesn’t necessarily mean it actually needed to reason for many questions, your prompt + regular context could be enough to get a good answer compared to prior models where you’d absolutely have to put it into a reasoning-loop to get an accurate answer.
It’s on by default, in a way. You can probably prompt these models with “and don’t reason about it, just give me the answer” and probably get a comparably good response without it using reasoning tokens for many things.
I've wondered this. We have chain-of-thought, harnesses, etc. — workarounds of a sort due to lack of core model capabilities. But I am very curious if much better next token prediction would simply obsolete that whole setup or not. Either way, the answer would be very revealing.
I agree with the take not to bother with a sophisticated memory system. Anything worth remembering should be in docs, guides, source comments, commit messages or tickets. You don't need another layer, every conceivable granularity is already covered by existing best practices
I do think we need another layer, but it should be a routing layer. I am finalizing my pi-brains extension for Pi (https://github.com/earendil-works/pi) which does this:
Right now "humans" need to define the routing rules for how to access information, but I will support what I call "knowledge agents" that can monitor conversations to inject context when needed.
It looks like an interesting experiment. But a hard problem since it needs to store useful information and be able to inject it at the right time. It will also need to not be redundant to the information already stored.
What do you think is the potential value that you might get out of this, which is not already available with the existing options?
This is a hard problem, but one worth solving, I think, since it means less tokens and better AI reasoning. I believe LLMs are good enough that, if given the right context, it can very much solve almost all tasks.
If this works, it means we can probably get by with smaller models (since it doesn't need to know everything). LLMs are pattern matchers, and if you can provide them with the right shape (context), they should produce the expected output.
For my solution to work, you need business buy-in, which I don't think will be a problem. Enterprise wants to know how tokens are being spent, so I can see them wanting structured analysis during code reviews.
What may also not be obvious is that the information is ultimately designed to live with your code. Lessons and notes are designed to be mapped to files, so if you want to know why a piece of code is implemented in a certain way, you can have the LLM filter by files to help find the needle in the haystack.
It is a hard problem, but the only missing piece is discipline, which I believe business leaders will not have an issue with enforcing since we are ultimately talking about eliminating/significantly reducing the bus factor in our code.
Especially a layer that is largely out of band in a project (i.e. ~/.claude/…). In any project where I’ve needed memory I just add a line to AGENTS.md telling it to use MEMORY.md to save memories or STATUS.md to track progress.
I've been enjoying having a little todo file the agent updates as it goes along, because then I can keep track of progress without scrolling through aeons of "Combobulating..."
Also if context runs out you can just do "cat todo.md | agent" and you're off to the races again.
Yep all my projects start with a PLAN.md at the root, and that acts as the ‘save file’ recording our progress over time. My session always ends with updating the plan file with what’s been done, and the next session always begins, as you suggest, with consuming the current state of the plan doc.
There is some value to agents being able to query the history of work done, docs aren't a good place to accumulate negative evidence for example, but it can be tagged in traces so that it's efficient to look up as needed. Additionally, docs rot while traces can be tagged with commit hashes and other things that make their lifetime clearer.
The user flow I am trying to get adopted for sessions is to turn them into notes and lessons when you have finished and it should be part of the code review process.
By propery categorizing lessons and notes, it should make it easy to scrub and keep up to date.
I also suggest mapping lessons and notes to files when possible to make discovery and cleanup easier.
> I agree with the take not to bother with a sophisticated memory system. Anything worth remembering should be in docs, guides, source comments, commit messages or tickets. You don't need another layer
That is a sophisticated memory system though -- maybe not to you experienced humans!
I have Claude/Codex keep logs [1]. It's just prompted in my AGENTS.md [0].
> Every session must produce one of: a session log OR a plan, and end with a written summary appended to it. Default to a log; reserve plans for substantive design work.
It's incredibly valuable. For example today I started a few sessions off like this:
- What's the status of my work on Renovate?
- I was recently working on X, find that
- Did we fix the issue with backups? What are the next steps?
- This bug came up again. Didn't we fix it already?
I like the memory system, in general. For reference I'm using mostly Opus 4.8 + Max effort. It will often pull things out of memory that are relevant. Like I'll ask it to come up with a few options I should consider for, say, a self-hosted OIDC provider and it'll say things like "Considering the size of your operations team, this might be a better fit because of X and Y".
Now, I'll agree that this is probably the sort of thing I should put in the CLAUDE.md, but in this case it wasn't on my radar to put that in my CLAUDE.md, so it was nice that it surfaced that.
It does sometimes go awry though. Today I was asking about a problem I was having authenticating, and it said "you may be running into this trusted proxy setting because you put your apps behind an haproxy". That is true of 95% of our apps, so it was worth mentioning, but in this case it was not so I had to correct it. But, I'm glad it mentioned it because if we did have it proxied it could have saved me a lot of time.
It seems like a prerequisite is a certain level of world model and associated reasoning ability. Your examples are entirely dependent on the past context being relevant to the current situation. That's particularly tricky if you regularly ask about hypotheticals or problems that you're assisting someone else with. A human would probably ask clarifying questions such as "is this for the operations team at X? are they still size Y?" and "is this app proxied like the others you mentioned in the past?" rather than assuming.
There's also a noticable hierarchy to such context that needs to be correctly modeled - you could for example be involved with multiple teams of different sizes that are subject to different rules which is something a human would understand naturally.
At the core this is a hardware problem. 1M tokens is simply not enough context to understand a codebase the way a human would understand it. Being able to selectively forget is potentially a very valuable power, but right now it's a substitute for a human's ability to remember the rough shape of something, decide it's uninteresting, and remember that it is uninteresting.
They talk about memory only being useful when guided by a human, I think the proper solution is deeper than that, it probably involves feeding the entire codebase and every agent session into a finetuning of the model, though at that point you might want some guidance to avoid feeding certain sessions into the model. Or maybe not, maybe the bitter lesson applies.
1M context - at least with most of the projects I ever worked with, 1M, or even 100k would be enough to explain in broad strokes the class/project/deployment structure, and a window of 200-500k to explain the specific issue at hand.
Even with memory off this occurs within a conversation.
It is like an annoying friend, who remembers something from a past conversation, that you have grown and developed from, but they still want to hold it against you.
I found that if you allow any low value things into memory, Claude will notice that established pattern and start trying to add low value memories at an ever increasing pace.
I don't do anything with full session transcripts, but I find value when Claude writes a memory.
I don't actually want Claude to have those memories, but they often point to gaps in my harness. I'll occasionally sweep the memories, pick out what should go into the harness or CLAUDE.md, then delete the memories.
My current technique, which seems to have improved maintainability, is to guide Claude to write commit messages specifically focused on "why this was done", "what changed in the theory of operation", and "what changed in the code". Then just reviewing the commits for a file or dir gives it a ton of useful context distilled from the sessions that produced them. Also, making a docs dir with concise .md files explaining the theory of operation and updating them with every commit.
I specifically disabled claude memory in a project because it kept writing down thigns to memory that didn't need to be in memory, including severly wrong statements that then would confuse it later. At some point it got re-enabled automatically which had me ask claude itself to "turn it the fuck off" by which it promptly figured out that both ("autoMemoryEnabled": false, "autoDreamEnabled": false) are necessary and need to be at the user home settings, not in a project override (which is what I had with the original setup that eventually got ignored by a CC update).
I agree with other commenters here, if anything is worth being rememebered, it will be in code comments, git commit messages, CLAUDE.md or other formal documentation. The auto memory system just causes confusion and leaves stale and outdated information written down.
Its an interesting thought experiment as well, I originally thought that having the model write down memory files by itself would be a nice addition, but after playing around with it, it became clear to me that good as an idea turns out bad in practice because the model can't correctly gauge what deserves being stored as a memory.
Hilariously, I'm working on a Claude Desktop replacement that does all of things. It's the best parts of Claude Desktop, Code, Cowork, and MCP connectors, but uses a client/server design. It's written in JavaFx so it's lightweight, fast, cross platform, and not another damned electron app.
Ideal outcome is this turns into a startup. I think there's a real need for team-oriented AI to avoid siloing of knowledge.
Its certainly true at the moment, but give it 10 years and we might have systems that are much cheaper and much better at context management than they are now.
(Apologies to anyone who is under the impression that we were very likely going to be at the singularity in 10 years time. Possible != very likely)
Sure, but it’s equally likely that we hit a point where scaling becomes economically unviable because we can’t come up with enough algorithmic improvements to break free of the tyranny of log linear scaling. (I’m not sure how many 2x in token cost people would be willing to pay)
> Don't turn a one-off or area-specific comment into a durable memory without my explicit confirmation. You have a history of over-indexing on one-offs, and those memories end up getting cited to override well-tuned skills.
I must admit lingering long since retired 'memories' are currently one of the biggest pitfalls of the setup. Wiping all 'memory.md' often leads to better sustain.
t once had to tell claude 3-4 times to stop assuming the state of a system was the way it kept iterating it was cause it was in it's memory. I repeatably told it to otherwise and it just never updated it's memory and instead kept referencing it's memory about the state of a particular system
In my harness I have all the code auto injected at startup (doing mostly very small codebases).
I found that every model will still manually check every file/function, they immediately assume that anything in context is stale.
That's sensible because often the user edits stuff while they're running.
What it does is save it from having to grep blindly about the codebase. But I think I'd get roughly the same benefit by just dumping the function headers then.
> I believed this so strongly that my company built an entire product around this concept. I used to tell folks that "session transcripts were the new oil," that they were more valuable than the code itself.
> […]
> We don't really write code by hand anymore.
Honestly, isn't this just influencer spam? What possible value is there in reading about people who used to have products, but no longer write their own code, complaining about the inscrutable prediction machine they have handed that job and their livelihoods to?
Like, if you have complaints about the thing, perhaps you should address them to your supplier directly. None of your readers can help, and nobody's magic folk solution to your problem is better than yours.
And there are so many of these sorts of posts. Are we not entirely cooked?
(I think I have concluded that if people writing about AI aren't writing about interesting things they have achieved with small, local LLMs — which for clarity I am fully interested in reading - then I'm done reading. This whole blogging-about-cloud-AI genre is just weird and irresponsible now)
Look man, I’ve got a MMO that I’m working on that’s set in 2014 where everyone is a programmer in SV (might call it World of Legacy). It’s a period piece. I NEED as much blog training data of this type so that my NPCs can talk in a historically accurate way (god bless Medium.com, a historical treasure trove of a bygone medieval era).
It’s gonna be a living breathing world, you see. You’re going to be like “omg, this game even accurately captured the blog posts, woah”.
Edit:
This whole blogging-about-cloud-AI genre is just weird and irresponsible now)
I sincerely never considered it was a whole genre.
The perfect world was a dream that your primitive cerebrum kept trying to wake up from. Which is why the Matrix was redesigned to this: the peak of your civilization. I say your civilization, because as soon as we started thinking for you it really became our civilization, but the peak of your civilization was an MMO where everyone is a programmer in SV.
Something about this idea really resonates with certain personality types. I equate it to the Zettelkasten hype phase from several years ago. People (...like me..) got really wrapped up in the belief that the process was more important that the content. "Linking" was an "activity." Something good will happen as long as you (a) take notes on stuff and (b) link them to other notes on stuff.
You see the same thing with the session transcripts people. They're building ever more sophisticated setups of indexing and storing and cross referencing every conversation they've ever had on the (I would argue) mistaken belief that the transcripts are the valuable part, rather than the uncomfortable part where you go do something. A lot of it, I say from falling in the trap, is fancy procrastination.
(Although, I have found myself jealous on many occasions where their fancy system retrieves something they vaguely recall from a conversation they had 3 months ago. So, who knows.)
Absolutely agreed. Anyone who's a serious procrastinator sooner or later noticed that pattern of theirs in which they spent immense effort on optimizing the process instead focusing on the outcome they really wish — just don't really believe they can deliver it.
> Something about this idea really resonates with certain personality types.
Like ancient people? Because "new oil" whilst I get what it might imply sounds bad to me. Oil has been superseded in many places so "new oil" is like going backwards still.
Reference: data is the new oil is a term coined in 2006.
> Like, if you have complaints about the thing, perhaps you should address them to your supplier directly. None of your readers can help, and nobody's magic folk solution to your problem is better than yours.
I think you may just misunderstand the point of having / writing a personal blog. I write because it's fun! Whether the reader gets any value out of reading it is almost entirely beside the point.
(Also several comments here directly post a fix to the problem stated in the blog post, so readers can and do often help)
I am a freelancer recovering from severe burnout so the answer is a sort of irrelevant no.
I'm trying to rebuild my life so I am in an experimenting and learning phase rather than a massive coding phase, and most of my code work is maintenance of things I have built. That which I do code, I am still coding by hand, though I am dealing with other people's Claude output and I am really unimpressed by it. It's often rather crass.
But I would say to you that if you personally don't write code now but you do have a dependency on one of two presumably unprofitable cloud AI providers, aren't you in trouble? How is this not a three-alarm fire for you?
> That which I do code, I am still coding by hand, though I am dealing with other people's Claude output and I am really unimpressed by it. It's often rather crass.
Unfortunately the point of code is rarely to impress people (certainly not other engineers) or to avoid being "crass." 99.99% of code exists to achieve business outcomes, and velocity matters a lot in many contexts. A lot more than elegance or impressiveness.
The platform risk is a valid concern but alleviated by China's theft and redistribution of open models.
"Code quality" encompasses a lot of dimensions, one of which is impressing your colleagues, and many of which there's virtually no reason to care about now.
On the contrary, it's more important than ever. With ever more code being generated, it's essential that the code be understandable and maintainable - by human and machine.
It doesn’t matter what materials or techniques you use to build a house. 99.99% of construction exists to achieve business outcomes, and velocity matters a lot more than using the right materials or techniques.
Of course the house must pass safety inspections and stuff, but the materials and techniques don’t matter one bit for that. All that matters is you achieve the desired outcome, and I will ignore the glaring fact that you achieve the desired outcome by using the right materials and techniques. The materials and techniques don’t matter, just the outcome.
> Of course the house must pass safety inspections and stuff, but the materials and techniques don’t matter one bit for that. All that matters is you achieve the desired outcome, and I will ignore the glaring fact that you achieve the desired outcome by using the right materials and techniques.
This analogy is more true than you think. This is why modern homes/appartments are trash. You can pass safety inspections using subpar materials and the house will fall apart after a few years, but who cares right? At least you achieved the business outcome!
This mentality is so infuriating. This is why I need to buy new shoes every year. Or why my washer/dryer motherboard craps out in 2 years instead of 10. Nobody gives a shit about quality anymore, this is why society is crumbling around us. Profit driven incentive for fast/cheap over everything else. And now I need to spend my day prompting an AI to fix AI slop code to keep the business hobbling along another day. What a fucking joke.
e.g. the bill is definitely coming true for a lot of "non-traditional construction" materials and methods in immediately post-war properties in the UK. There are many unmortgageable properties using Mundic Block in Cornwall and to some extend Devon, in the heavily bombed south east there was a lot of pre-stressed concrete with catastrophic rebar failure, not to mention Orlit construction, and all across the country a lot of RAAC. Almost all of it for good, necessary, upbeat reasons.
It feels a bit like this kind of crisis from AI generated code could hit in ten, fifteen years time; people often fail to understand how long a bit of website code can last.
You are aware that you can just pay more money and get a higher quality house and higher quality shoes, right?
Costs of those things have gone down over time. The high end still exists, you just don't actually care about quality as much as you think you do.
And yes, for capital-intensive things like real estate development, fast/cheap matters a lot because otherwise there would be no capital available to build any of it at any reasonable scale.
> You are aware that you can just pay more money and get a higher quality house and higher quality shoes, right?
False. You can pay more money for branding that purports to be higher quality. The Running shoe market is a perfect example. Best shoes I ever bought were Altra Loan Peaks from 2018, brand has been getting more expensive and lower quality every year. Whether that extra cost actually translates to quality require diligent research.
> This analogy is more true than you think. This is why modern homes/appartments are trash. You can pass safety inspections using subpar materials and the house will fall apart after a few years, but who cares right
Where do you live? Because where I live, new houses and apartments are superb. But I'm guessing we don't use two by fours and plaster walls to erect whole structures.
Yeah I agree. And you have people on this forum who gleefully point out that quality doesn’t matter to the business, as if they think they’re so intelligent because they noticed that employees are there to make the company money. Not realizing that A) it’s a very antisocial attitude and B) it’s not a tenable long term strategy.
Hang on now. GP didn't say "I care about quality" and I didn't say caring about quality is wrong.
GP said Claude's code "doesn't impress" them and that it's "crass."
Do you think a valid "long term strategy" is to create code that impresses GP and is not crass, but doesn't achieve the business outcomes it's meant to?
Inversely, do you think one can achieve business outcomes if "quality" is so abysmal that the code doesn't work or is unmaintainable?
Is it possible to write perfectly good, maintainable, performant, legible code that "doesn't impress" GP, or feels "crass" to them? Well gee, probably! Because "impressiveness" and "crassness" are literally meaningless.
No the materials and techniques matter a lot. This is why we need to build houses with sticks and jute cord, just like we always have. It's vital also that we paint our special symbols above the door to ward off the spirits.
It's insane to me that you're implying we could build houses with pre-fabricated materials or pneumatic nail guns and still somehow "have houses?" No sticks/jute cord and special symbols, then no house.
The argument isn’t to not use better materials or techniques, it’s that inferior materials and techniques are fine because they don’t impact the end result, which is so obviously false when it comes to pretty much anything, but supposedly true when it comes to software.
I'm not sure who you saw arguing for inferior materials and techniques, but let me know when you find them.
What you saw in this thread was someone arguing against the dimensions of "impressiveness" and "crassness" as valid things to care about when it comes to code.
It's your mistake to assume that those are related to any meaningful concept of actual quality.
FWIW I never suggested that they were indicative of problems with the code. Unimpressive, crass code can run, after all.
I clearly said elsewhere that I think they are predictive of problems with the person who writes it, and I fear I can generalise that to LLM tooling that generates it.
I’ve worked at many companies where this idea of velocity was claimed to matter, and it never did. The only thing it mattered for was to make it look like middle managers were worth anything, but the success was always in the foundational idea/concept.
Programmers can use smaller models like deepseek v4 flash for 98% of the same productivity as SOTA models and cost (true cost) around $10-$30 a month. So I doubt most people who heavily use them are too concerned.
It's only vibe/hobby coders who really need SOTA and they probably don't think about it much.
No. Anyone who doesn't code with AI - while retaining a deep knowledge and understanding of the problem domain - is falling behind.
I hate to say this tbh, I loved hand-writing code. I made a great living for 20 years, and I absolutely loved it and was quite good at it.
Hand-typing code is just slower now; there’s no two-ways about it. You are either going to be slow and a bad hire for businesses, or you figure out how to adopt AI into your workflow to speed up.
One thing I think people don't realize is that deep knowledge of programming, performance, architectural, and domain specific trade-offs makes a skilled engineer about 1000X faster than someone without those skills -with AI. But yes, now unskilled people can actually make apps/software. They just tend to be slow, and their products are full of bugs, security flaws, and abysmal performance.
So we went from: Skills = can or cannot ship any software at all. Now we are at: Skills = can ship better software much faster than unskilled people.
I was actually faced with this recently. I decided to learn Rust and port one of my side projects to it. Initially, I moved extremely slowly, and the AI made truly horrific architectural decisions because I didn't have the knowledge of how to direct it, especially compared to my primary languages.
However, once I gained a firm grasp of Rust, I was better able to properly direct the AI to fix fundamental issues and architect things properly. My speed increase multiplier proved to be directly proportional to my growing knowledge of both the language and the domain.
Skill and knowledge combined with AI, when used appropriately, absolutely multiply your speed and quality. I really think once you understand what AI can do, and how to utilize it to produce better code, faster than before, there truly is no going back.
I'm finding a path forward that I actually enjoy now and don't really see losing my value (no telling how things will change in the future), I can have more time to focus on really quality/solid/performant and useful systems with less time just typing one character out at a time.
You could have talked to me 3 months ago and I'd never imagine I'd say the above btw. I REALLY enjoyed code writing and earlier AI models without harnesses were pretty useless for anyone skilled at development. Now with stuff like deepseek Flash I feel like I have a happy medium of 100% directed/fast code turnaround, less typing, more deep focus on architecture, systems, and the actual end product.
At this point just use a JetBrains product, get deterministic assistance, and 5x your speed. It's unfortunate the resistance to a true IDE just keeps going up. The blind lead the blind, I guess.
Personally I use 5 different model families, 3 of which are open weights with 3rd party inference providers (GLM, DeepSeek, Kimi), so if the frontier labs were to shut down it'd be a nuisance, nothing more.
The open weights models I am interested in, and testing, learning, experimenting with etc.; I am confused and cynical, not insane.
I am not convinced it isn't vulnerable to the same problems but the whole tenor of the community around open source/open weights models just doesn't have the same YOLO madness to it.
Of course? I'm still better than sonnet or opus, just slower and much more expensive.
Sometimes it takes me a day or more to find the one line fix or abstraction necessary, while claude can hammer through a hundred line fix in under an hour.
"good" can take lots of different meanings. Generally though, I want as little code as I can get away with. A majority of code lifecycle cost isn't in writing it.
I am. I have Codex running, doing some tasks which I don't care much about, but anything I want to understand I write myself.
Same thing with hobby projects - I might ask ChatGPT or Gemini some questions about best practices in Swift for example, but writing code is done by hand.
As others said - if you don't use it, you'll lose it. And I'd rather keep my skills up to date.
This is the thing that makes me saddest. Second to the fact that none of the management tier promoting and weaponising this insanity will meaningfully suffer consequences.
Right now I am lucky that I have the time to recover and learn.
That's just business owners and C-suite pocketing the difference while they fire staff and replacing it with AI. At some point somebody would have to start asking "business" some tough questions.
Yes, nearly all of it. Having the agent write code for me doesn't really save me much time, and the code quality is usually worse (and it takes even more time if I insist on better code quality from the agent).
And I don't think I'm unique. I see enough posts like https://news.ycombinator.com/item?id=48777257 pop up that I'm reasonably confident all the hype around LLMs saving so much time and increasing productivity so much is, well, just that: hype.
Sure, if you can't code at all and want to build something, an LLM is going to be great for you, even if you can't evaluate the code quality or determine if there are bugs just by looking at the code. But I've been coding professionally for 25 years, and as a hobby since I was like 8 years old. I like to code! It's a passion of mine. If the LLM isn't doing it faster or better (and most of the time it isn't), why wouldn't I write code myself?
I'll have the LLM write boilerplate stuff or do tedious refactoring, because I just don't feel like it (even if it does take longer). But for the real work? Of course I do most of it myself.
One area where the LLM shines for me is finding the root causes of bugs. It can generally do that much faster than I do. Often orders of magnitude faster (like minutes instead of hours or days). But when it comes to write the fix for the bug? It's usually faster and better if I do it myself.
I am more fully invested in finding out ways AI can support me (documentation, code analysis, bughunting), though my experience with Claude as a bughunter is that it can miss the absolutely obvious if it is not in the shape it is expecting.
More generally I am interested in burnout-avoidance tools; things that help me start, finish, things that write tests I guess, certainly code scaffolding.
But I am fully unconvinced that my burnout will be improved by ending up owning the responsibility for wobbly or inscrutable AI-generated code with potential landmines in it; that will keep me up at night just the same.
I still write code and sometimes it works well. I also use Claude and it writes code and sometimes that goes well. We have better success together, where I do the interesting stuff and let Claude write my unit tests, reconcile my documentation. That is to say, I’m using it for quality not quantity. There aren’t enough humans to deploy or consume all the sloppy shit it could write on its own.
I write code by hand every day. I do the main part of the feature implementation myself and leave comments for the code i want the agent to write. I have some skills and a command that sets the stage to get the agent to fill in the rest
I am now in the process of fixing code I wrote using AI. I have come to the realization that AI can't really write software and I am annoyed that it took me that long (months) to realize that.
This is quite terrifying to me, because I have a feeling I will soon come to the same conclusion.
I’m starting to see some really glaring omissions in code I’m responsible for (using Opus) that at first (and second) look seemed fine, but really isn’t.
From my perspective it felt like understanding that the machine has no desires helped refine my usage.
I can ask it to be curious, and it will reply with what people think curiosity should look like, but it’s a simulation of an emotion it will never be driven by.
The ramifications become apparent when you engage in activity like cross-domain discovery.
I talked with a friend on a different field (academic) and he had to re-review all things written by AI. Basically, he used AI to read/summarize/find stuff in large academic papers but realized later that many times AI makes glaring mistakes that on a first read pass the smell test.
I don't understand this line or reasoning. People use various cryptocurrencies to buy and sell legitimate products and services every day. Is the argument just that they could probably have done it some other way?
People do, but I personally don't know anyone who does. And I don't exactly live in a bubble, half of my friends were into crypto at one point or the other.
> I believed this so strongly that my company built an entire product around this concept. I used to tell folks that "session transcripts were the new oil," that they were more valuable than the code itself.
This is pretty funny because it's about the depth of understanding of every 'AI expert' on Linkedin. People who praise the context window as basically magic have no idea how any of this works.
Occasionally posts like this do get the attention of the company responsible, more than an email does... but indeed that's like a one in a million situation
> I believed this so strongly that my company built an entire product around this concept. I used to tell folks that "session transcripts were the new oil," that they were more valuable than the code itself.
This is infuriatingly common wrt talking/writing about how to use AI effectively. All of the "this is how you write an AGENTS.md" and "you need to talk to it like X to optimize it". Like sure, you can believe that as much as you want but unless you provide some evidence you can keep your shitty CLAUDE.md to yourself and don't pollute the whole company's git repo, thanks.
When nobody actually knows (how to write a CLAUDE.md), everyone’s an expert. Infuriating, indeed. Even more so when people vibe code those files without proofreading.
I mean, it’s pretty clear the people who work on Claude Code aren’t actually looking at what they’re implementing. The thought behind this feature seems like it goes nowhere beyond “oh wouldn’t it be nice if Claude could remember things about you? Ok Claude go implement this” and nobody bothered to see if it was useful or helpful.
There has been this slow transition inside me, as someone who likes to not touch the AI as much as possible, where I've gone from skeptical and argumentative about it all to starting to just feel sad for all the Claude et al heads. Like, this is such a ridiculous house of cards you have to deal with all the time, which isn't even directly concerning the task at hand, presumably. Like you're cooking yourself a meal but its just nuking a burrito and then still somehow needing to wash the dishes for an hour.
Not that this isolated article is super damning or anything, but the accumulated set of all these reports has left me only empathetic, I think, of these other devs. Like, I just want to tell them, "it can be ok, it doesn't need to be like this.."
I've been having a very nice time with Fable. I cooked up an Anki clone in like half an hour, with tech it's not familiar with. Nothing too ground breaking, but I was very pleased!
I think Opus might be on similar level for most of what I'm doing, but I haven't used it much recently, so I can't remember the difference. So I guess I'll find out on the 7th when they pull the plug again! (Free-ish trial of Fable ending.)
That being said, I tried using other frontier models to help with a Pong clone the other day and they were introducing new bugs at approximately the same rate as they were fixing it. On Pong!! I found that amusing because I couldn't think of a simpler game, so it didn't inspire confidence.
Fable's doing just fine on an online multiplayer game though. I have no idea how that works. (Maybe it would fail Pong too?? I haven't tested that!)
>We have found zero performance benefit on SWE tasks when agents have search access to their previous transcript sessions
I refuse to believe this is true. The ability for an agent to find information from before a compaction is incredibly useful. At compaction time it's impossible to know what exactly may be still needed.
With the million-context-window models we never hit compaction, observed over hundreds of sessions. What are you doing that has you hitting compaction regularly?
For me logs can chew through a lot of tokens. And when the agent is trying a bunch of different experiments and then it may need to refer to what happened previously.
Million context models also are still not effective for the entire context size.
The software world is very close to building a super intelligent senior software developer. Companies like this will ask all the best things a software engineer does automatically. Now claude will add it into the coding agents itself.
Damn, I didn't see this coming.
Its first the build the intelligent builder. We will figure out what we want to build later.
Edit: Before more people take it seriously. This is sarcasm. I don't wish this.
Once the automator automates itself fast enough, we won't have the ability to opine what gets built. The LLM will decide. Just like right now sometimes LLMs delete tests so they pass, they could just delete humanity if humans get in their way.
It's the error rate. That's what everyone found when they were trying to go Full Auto with OpenClaw in February.
You can rely on it like 95% of the time but that means if you keep it running continuously the error rate rapidly approaches 100%. That's getting a little better with each release, and it might actually hit the point where you can more or less trust it indefinitely (on well defined workflows).
Or at least it would, if context window permitted...
> The software world is very close to building a super intelligent senior software developer. Companies like this will ask all the best things a software engineer does automatically. Now claude will add it into the coding agents itself.
Except Claude is more expensive than an actual senior software developer. Otherwise, why are many companies terrified of the usage bill that gets printed on the invoice?
The nonsense in "tokenmaxxing" was a complete marketing scam and illusion of cheap tokens which in reality were heavily subsidized.
The entire point is detecting bad code before it reaches production. [0] AI generated or not.
I remember when OpenAI announced ChatGPT now will remember stuff between sessions. Oh, you mean find random trivia about me and copy paste it between prompts without out my explicit consent.
”compare these three cars. Oh btw I am a data engineer, and my moms maiden name is Joana, and I am allergic to bad poetry. And code should be DRY, I prefer SQL over Python and what’s the most poisonous flower in Scandinavia?”.
I’ve had so much wierd output because context is ”””memorized””” and bleeding into completely unrelated projects and conversations. It’s the first feature I turn off.
I have to imagine these types of features are for people who treat chatgpt like their friend/therapist/girlfriend/assistant/... instead of people who use it to answer questions
You shouldn't really use it to answer questions either though, because then you won't know how wrong it is when it's hallucinating...
I mean I know the draw, it's just so easy to get suckered into it... After all, it's usually mostly correct! And since the Advent of llms searching via Google has gotten essentially impossible with 1000 slop articles per one genuine article... But the danger is real, even with eg fable
Strongly agree here. claude-code’s memory system is occasionally useful but much more often harmful, pulling in obsolete info that muddies the waters about current tasks. I’ve frequently seen Claude’s own memories severely mislead it.
My guess is that has something to do with the training process leaving models unable to differentiate between “what’s happening now” and “what happened before”. Perhaps if making inferences from memories was actually part of the training process things would be different but my sense is that as an inference-time-only feature this just gets the models confused.
Humans make memories constantly, but they also forget things that are no longer relevant. Until Claude can do that, it means the LLM will have an ever-increasing, ever more fragmented context.
And LLMs are NOT intelligent enough to survive even mild context poisoning.
When claude goes down a wrong path, I tend to clear context and write a new prompt that helps guide it down the correct path. Whatever thinking or context that led it there has inertia and tends to be sticky, otherwise.
Pretty annoying when it brings those up again later from memory...
I'd add on that the models have a very poor sense of time and the complex changes in world state that occur as time passes.
Training with memory is an interesting idea...
It's because it mostly doesn't matter what you are trying to get the code to do. What matters is what the code does.
Session logs can absolutely be useful, but not when building further. It's just that that the place they slot in is during validation. You know, that place between the markdown plan and CI passing, where there's 800 new lines of code and it all seems sort of fine when you click around?
Session logs can show you what sort of manual validation happened. CI will run the tests you had, and the code will show you what new unit tests were added, but session logs can show you that the agent drove the app with Playwright, or that the agent read and considered the prod config as well as the dev config.
Nothing bulletproof, but not every piece of validation work merits a test in the repo that lives forever. We've gotten a lot of mileage out of re-analyzing the sessions, figuring out where the agent made decisions without asking, and forcing the agent to consider validation for those decisions. That's the sort of thing that's hard to dictate up front but easy to highlight with the session logs.
This is an annoying problem. It keeps making fake assumptions just because of hypothetical questions I've asked in the past.
It'll assume I own a datacenter and have lots of gpus just because I asked to research things.
Isn’t this just a form of the bitter lesson? Our attempts to make engineered context and agents will simply be made obsolete with bigger and better models. Those transcripts are probably extremely useful for lesser capable models, and near unnecessary for frontier ones, maybe?
Yeah, the question is whether this applies to all of context management.
I've been using a custom harness based on https://minimal-agent.com/ (itself based on swe-mini-agent), which is like 50 lines for the core logic. Bash is all you need.
For small tasks, I find it's about 8x faster (and uses 8x fewer tokens) than the standard harness for each model.
For bigger tasks I haven't tested it much. It seems to work too but I think they're a bit less focused and productive in that case. It could be that those big harnesses' 20k token system prompts are doing something important with regard to steering software development workflows. (e.g. I heard Fable has a custom system prompt in Claude Code which might explain its markedly more proactive behavior.)
So I want to say there's still a lot of value in context engineering though it seems to diminish with each model release (since they're fine tuned on mostly non stupid behavior and need less hand holding).
> So I want to say there's still a lot of value in context engineering though it seems to diminish with each model release
I can't see how it would diminish unless you are literally working on public domain stuff. Unless stuffing context becomes cost effective and will not affect AI reasoning (this will be much harder), I don't see why context engineering is here to stay until we have close to AGI.
In think in all cases where I've seen it compared CC performed worse than a minimal harness.
interesting take. I think I disagree, but I like this take a lot and I had to think about it.
First, I think that models still need a context layer. One way to think about 'context' is as a form of compression. You provide the model context because it makes it easier for the model to figure out what to do. Even in a world with infinite model capacity and infinite model context, this is still useful because it allows the model to avoid rederiving everything from first principles every time. As long as models perform better using fewer tokens and as long as we care about token spend, context is a useful (necessary?) shortcut.
Once you bite that you need some form of context layer, the question is which. Here I do agree that it is better to work with what the models will find familiar (markdown files colocated with code, for eg). But this speaks to over-engineered solutions not understanding their main user (the agent) more than it does the need or lack there of.
A) Context and prompting cuts the search space for next token generation. That’s pretty useful, as you mentioned.
B) The other use of context is that it introduces entirely new information via RAG
B will never go away (as others pointed out). A, well that’s just something we’re all going to keep getting surprised at. We’ll barely give it any direction or context and the newer models will simply find the happy path.
The author is kind of suggesting that their context wasn’t really necessary to get the happy output, I think.
Chain of reasoning is a lot of context to guide token generation, but we simply see that newer models don’t need that context to get to the answer. I’m mostly reiterating this because there’s a hot take here, and that is this agentic stuff may be waived away by magic frontier-llm wand , all of a sudden.
>Chain of reasoning is a lot of context to guide token generation, but we simply see that newer models don’t need that context to get to the answer
I thought each new generation typically used more reasoning tokens?
They do if they are a reasoning-variant. That doesn’t necessarily mean it actually needed to reason for many questions, your prompt + regular context could be enough to get a good answer compared to prior models where you’d absolutely have to put it into a reasoning-loop to get an accurate answer.
It’s on by default, in a way. You can probably prompt these models with “and don’t reason about it, just give me the answer” and probably get a comparably good response without it using reasoning tokens for many things.
(note that I am the author!)
I've wondered this. We have chain-of-thought, harnesses, etc. — workarounds of a sort due to lack of core model capabilities. But I am very curious if much better next token prediction would simply obsolete that whole setup or not. Either way, the answer would be very revealing.
I don't think so - I think we'll find that to build a brain you need more built-in structure and biases, not less.
Bear in mind that brain architecture is learnt too - just over a much longer timescale than an individual lifetime.
I agree with the take not to bother with a sophisticated memory system. Anything worth remembering should be in docs, guides, source comments, commit messages or tickets. You don't need another layer, every conceivable granularity is already covered by existing best practices
> You don't need another layer
I do think we need another layer, but it should be a routing layer. I am finalizing my pi-brains extension for Pi (https://github.com/earendil-works/pi) which does this:
https://github.com/gitsense/pi-brains
Right now "humans" need to define the routing rules for how to access information, but I will support what I call "knowledge agents" that can monitor conversations to inject context when needed.
It looks like an interesting experiment. But a hard problem since it needs to store useful information and be able to inject it at the right time. It will also need to not be redundant to the information already stored.
What do you think is the potential value that you might get out of this, which is not already available with the existing options?
This is a hard problem, but one worth solving, I think, since it means less tokens and better AI reasoning. I believe LLMs are good enough that, if given the right context, it can very much solve almost all tasks.
If this works, it means we can probably get by with smaller models (since it doesn't need to know everything). LLMs are pattern matchers, and if you can provide them with the right shape (context), they should produce the expected output.
For my solution to work, you need business buy-in, which I don't think will be a problem. Enterprise wants to know how tokens are being spent, so I can see them wanting structured analysis during code reviews.
What may also not be obvious is that the information is ultimately designed to live with your code. Lessons and notes are designed to be mapped to files, so if you want to know why a piece of code is implemented in a certain way, you can have the LLM filter by files to help find the needle in the haystack.
It is a hard problem, but the only missing piece is discipline, which I believe business leaders will not have an issue with enforcing since we are ultimately talking about eliminating/significantly reducing the bus factor in our code.
If you look at https://github.com/gitsense/smart-ripgrep, you can get a better sense of how context can be injected when it is needed.
Especially a layer that is largely out of band in a project (i.e. ~/.claude/…). In any project where I’ve needed memory I just add a line to AGENTS.md telling it to use MEMORY.md to save memories or STATUS.md to track progress.
I've been enjoying having a little todo file the agent updates as it goes along, because then I can keep track of progress without scrolling through aeons of "Combobulating..."
Also if context runs out you can just do "cat todo.md | agent" and you're off to the races again.
Yep all my projects start with a PLAN.md at the root, and that acts as the ‘save file’ recording our progress over time. My session always ends with updating the plan file with what’s been done, and the next session always begins, as you suggest, with consuming the current state of the plan doc.
There is some value to agents being able to query the history of work done, docs aren't a good place to accumulate negative evidence for example, but it can be tagged in traces so that it's efficient to look up as needed. Additionally, docs rot while traces can be tagged with commit hashes and other things that make their lifetime clearer.
The user flow I am trying to get adopted for sessions is to turn them into notes and lessons when you have finished and it should be part of the code review process.
By propery categorizing lessons and notes, it should make it easy to scrub and keep up to date.
I also suggest mapping lessons and notes to files when possible to make discovery and cleanup easier.
> I agree with the take not to bother with a sophisticated memory system. Anything worth remembering should be in docs, guides, source comments, commit messages or tickets. You don't need another layer
That is a sophisticated memory system though -- maybe not to you experienced humans!
Strong disagree on this.
I have Claude/Codex keep logs [1]. It's just prompted in my AGENTS.md [0].
> Every session must produce one of: a session log OR a plan, and end with a written summary appended to it. Default to a log; reserve plans for substantive design work.
It's incredibly valuable. For example today I started a few sessions off like this:
- What's the status of my work on Renovate?
- I was recently working on X, find that
- Did we fix the issue with backups? What are the next steps?
- This bug came up again. Didn't we fix it already?
[0]: https://github.com/shepherdjerred/monorepo/blob/main/AGENTS....
[1]: https://github.com/shepherdjerred/monorepo/tree/main/package...
I like the memory system, in general. For reference I'm using mostly Opus 4.8 + Max effort. It will often pull things out of memory that are relevant. Like I'll ask it to come up with a few options I should consider for, say, a self-hosted OIDC provider and it'll say things like "Considering the size of your operations team, this might be a better fit because of X and Y".
Now, I'll agree that this is probably the sort of thing I should put in the CLAUDE.md, but in this case it wasn't on my radar to put that in my CLAUDE.md, so it was nice that it surfaced that.
It does sometimes go awry though. Today I was asking about a problem I was having authenticating, and it said "you may be running into this trusted proxy setting because you put your apps behind an haproxy". That is true of 95% of our apps, so it was worth mentioning, but in this case it was not so I had to correct it. But, I'm glad it mentioned it because if we did have it proxied it could have saved me a lot of time.
It seems like a prerequisite is a certain level of world model and associated reasoning ability. Your examples are entirely dependent on the past context being relevant to the current situation. That's particularly tricky if you regularly ask about hypotheticals or problems that you're assisting someone else with. A human would probably ask clarifying questions such as "is this for the operations team at X? are they still size Y?" and "is this app proxied like the others you mentioned in the past?" rather than assuming.
There's also a noticable hierarchy to such context that needs to be correctly modeled - you could for example be involved with multiple teams of different sizes that are subject to different rules which is something a human would understand naturally.
At the core this is a hardware problem. 1M tokens is simply not enough context to understand a codebase the way a human would understand it. Being able to selectively forget is potentially a very valuable power, but right now it's a substitute for a human's ability to remember the rough shape of something, decide it's uninteresting, and remember that it is uninteresting.
They talk about memory only being useful when guided by a human, I think the proper solution is deeper than that, it probably involves feeding the entire codebase and every agent session into a finetuning of the model, though at that point you might want some guidance to avoid feeding certain sessions into the model. Or maybe not, maybe the bitter lesson applies.
1M context - at least with most of the projects I ever worked with, 1M, or even 100k would be enough to explain in broad strokes the class/project/deployment structure, and a window of 200-500k to explain the specific issue at hand.
Even with memory off this occurs within a conversation.
It is like an annoying friend, who remembers something from a past conversation, that you have grown and developed from, but they still want to hold it against you.
I found that if you allow any low value things into memory, Claude will notice that established pattern and start trying to add low value memories at an ever increasing pace.
Explains why it made so many memories on my work machine, but never made one on my personal one. Maybe the project size also affects it.
I don't do anything with full session transcripts, but I find value when Claude writes a memory.
I don't actually want Claude to have those memories, but they often point to gaps in my harness. I'll occasionally sweep the memories, pick out what should go into the harness or CLAUDE.md, then delete the memories.
My current technique, which seems to have improved maintainability, is to guide Claude to write commit messages specifically focused on "why this was done", "what changed in the theory of operation", and "what changed in the code". Then just reviewing the commits for a file or dir gives it a ton of useful context distilled from the sessions that produced them. Also, making a docs dir with concise .md files explaining the theory of operation and updating them with every commit.
I specifically disabled claude memory in a project because it kept writing down thigns to memory that didn't need to be in memory, including severly wrong statements that then would confuse it later. At some point it got re-enabled automatically which had me ask claude itself to "turn it the fuck off" by which it promptly figured out that both ("autoMemoryEnabled": false, "autoDreamEnabled": false) are necessary and need to be at the user home settings, not in a project override (which is what I had with the original setup that eventually got ignored by a CC update).
I agree with other commenters here, if anything is worth being rememebered, it will be in code comments, git commit messages, CLAUDE.md or other formal documentation. The auto memory system just causes confusion and leaves stale and outdated information written down.
Its an interesting thought experiment as well, I originally thought that having the model write down memory files by itself would be a nice addition, but after playing around with it, it became clear to me that good as an idea turns out bad in practice because the model can't correctly gauge what deserves being stored as a memory.
Ugh, agentic _dreaming_. Why on earth would I want that?
> "turn it the fuck off" -> "autoDreamEnabled": false
So you told it don't go the fuck to sleep ;)
Hilariously, I'm working on a Claude Desktop replacement that does all of things. It's the best parts of Claude Desktop, Code, Cowork, and MCP connectors, but uses a client/server design. It's written in JavaFx so it's lightweight, fast, cross platform, and not another damned electron app.
Ideal outcome is this turns into a startup. I think there's a real need for team-oriented AI to avoid siloing of knowledge.
The only times I found the memory feature useful are in "projects" I created myself.
In a project my questions are usually revolved around the same topic. Having context carried across threads actually make a lot of sense.
In the general mode where I'm expecting models to be *stateless*, having memory is very annoying.
I'm not sure how well this take will age.
Its certainly true at the moment, but give it 10 years and we might have systems that are much cheaper and much better at context management than they are now.
(Apologies to anyone who is under the impression that we were very likely going to be at the singularity in 10 years time. Possible != very likely)
Sure, but it’s equally likely that we hit a point where scaling becomes economically unviable because we can’t come up with enough algorithmic improvements to break free of the tyranny of log linear scaling. (I’m not sure how many 2x in token cost people would be willing to pay)
The top of my ~/.claude/CLAUDE.md:
> Don't turn a one-off or area-specific comment into a durable memory without my explicit confirmation. You have a history of over-indexing on one-offs, and those memories end up getting cited to override well-tuned skills.
Does that second sentence provide value beyond getting that off your chest? ;-)
(Semi-serious question)
heh, who knows. I often give agents the motivation, which I generally helpful. Have not bothered to measure this case.
Thats interesting, but what was the methodology?
is the conclusion really that its just more important to create proper artifacts from any tricks that got the llm to understand the code better?
is the tool for searching the history just bad?
I must admit lingering long since retired 'memories' are currently one of the biggest pitfalls of the setup. Wiping all 'memory.md' often leads to better sustain.
I have this in my global CLAUDE.md after being annoyed by all the random crap memories.
> Don't start generating an auto-memory entry before asking me. Ask first, write only if I confirm — no speculative drafting.
No more crap after this.
Incidentally I don’t recall Opus 4.8 asking me once in the past few weeks. Older models did ask semi-frequently.
t once had to tell claude 3-4 times to stop assuming the state of a system was the way it kept iterating it was cause it was in it's memory. I repeatably told it to otherwise and it just never updated it's memory and instead kept referencing it's memory about the state of a particular system
In my harness I have all the code auto injected at startup (doing mostly very small codebases).
I found that every model will still manually check every file/function, they immediately assume that anything in context is stale.
That's sensible because often the user edits stuff while they're running.
What it does is save it from having to grep blindly about the codebase. But I think I'd get roughly the same benefit by just dumping the function headers then.
Did you try to delete the memory yourself?
Yeah but you also know humans that do that too, right? I know I do.
There's a lot of valuable information in there, its' too noisy.
Those small, random items that pop-up later on in conversation actually make the experience feel better. But that's just my own personal experience.
Blog posts like this just blow me away.
> I believed this so strongly that my company built an entire product around this concept. I used to tell folks that "session transcripts were the new oil," that they were more valuable than the code itself.
> […]
> We don't really write code by hand anymore.
Honestly, isn't this just influencer spam? What possible value is there in reading about people who used to have products, but no longer write their own code, complaining about the inscrutable prediction machine they have handed that job and their livelihoods to?
Like, if you have complaints about the thing, perhaps you should address them to your supplier directly. None of your readers can help, and nobody's magic folk solution to your problem is better than yours.
And there are so many of these sorts of posts. Are we not entirely cooked?
(I think I have concluded that if people writing about AI aren't writing about interesting things they have achieved with small, local LLMs — which for clarity I am fully interested in reading - then I'm done reading. This whole blogging-about-cloud-AI genre is just weird and irresponsible now)
Look man, I’ve got a MMO that I’m working on that’s set in 2014 where everyone is a programmer in SV (might call it World of Legacy). It’s a period piece. I NEED as much blog training data of this type so that my NPCs can talk in a historically accurate way (god bless Medium.com, a historical treasure trove of a bygone medieval era).
It’s gonna be a living breathing world, you see. You’re going to be like “omg, this game even accurately captured the blog posts, woah”.
Edit:
This whole blogging-about-cloud-AI genre is just weird and irresponsible now)
I sincerely never considered it was a whole genre.
The perfect world was a dream that your primitive cerebrum kept trying to wake up from. Which is why the Matrix was redesigned to this: the peak of your civilization. I say your civilization, because as soon as we started thinking for you it really became our civilization, but the peak of your civilization was an MMO where everyone is a programmer in SV.
I … I… don't want to play this, thanks ;-)
It’s the only way you’ll ever be able to pretend to be a programmer again though.
Oh god, I just realised this really is the logical parallel to all those TV crime dramas set in the early 1900s.
It’ll be the programmers version of those civil war reenactments.
The two sides would be the strongly typed union and the duck typed heretics of the confederacy.
Singing two different battle songs set to the tune of Code Monkey.
Leet code competitions will be as relevant as sailing regattas.
so far, i just keep writing by hand and keep getting paid for it. weird.
I'm pretty sure there's an element of sarcasm here, but if this game is real, it does sound super promising.
>session transcripts were the new oil
Something about this idea really resonates with certain personality types. I equate it to the Zettelkasten hype phase from several years ago. People (...like me..) got really wrapped up in the belief that the process was more important that the content. "Linking" was an "activity." Something good will happen as long as you (a) take notes on stuff and (b) link them to other notes on stuff.
You see the same thing with the session transcripts people. They're building ever more sophisticated setups of indexing and storing and cross referencing every conversation they've ever had on the (I would argue) mistaken belief that the transcripts are the valuable part, rather than the uncomfortable part where you go do something. A lot of it, I say from falling in the trap, is fancy procrastination.
(Although, I have found myself jealous on many occasions where their fancy system retrieves something they vaguely recall from a conversation they had 3 months ago. So, who knows.)
Absolutely agreed. Anyone who's a serious procrastinator sooner or later noticed that pattern of theirs in which they spent immense effort on optimizing the process instead focusing on the outcome they really wish — just don't really believe they can deliver it.
> Something about this idea really resonates with certain personality types.
Like ancient people? Because "new oil" whilst I get what it might imply sounds bad to me. Oil has been superseded in many places so "new oil" is like going backwards still.
Reference: data is the new oil is a term coined in 2006.
We're in 2026. See what I mean.
> Like, if you have complaints about the thing, perhaps you should address them to your supplier directly. None of your readers can help, and nobody's magic folk solution to your problem is better than yours.
I think you may just misunderstand the point of having / writing a personal blog. I write because it's fun! Whether the reader gets any value out of reading it is almost entirely beside the point.
(Also several comments here directly post a fix to the problem stated in the blog post, so readers can and do often help)
> I think you may just misunderstand the point of having / writing a personal blog.
I used to blog, as it goes, and I have supported and enabled many more, so no, not really.
I have to ask: do you still write a lot of code yourself? I and most people I know do not.
I am a freelancer recovering from severe burnout so the answer is a sort of irrelevant no.
I'm trying to rebuild my life so I am in an experimenting and learning phase rather than a massive coding phase, and most of my code work is maintenance of things I have built. That which I do code, I am still coding by hand, though I am dealing with other people's Claude output and I am really unimpressed by it. It's often rather crass.
But I would say to you that if you personally don't write code now but you do have a dependency on one of two presumably unprofitable cloud AI providers, aren't you in trouble? How is this not a three-alarm fire for you?
> That which I do code, I am still coding by hand, though I am dealing with other people's Claude output and I am really unimpressed by it. It's often rather crass.
Unfortunately the point of code is rarely to impress people (certainly not other engineers) or to avoid being "crass." 99.99% of code exists to achieve business outcomes, and velocity matters a lot in many contexts. A lot more than elegance or impressiveness.
The platform risk is a valid concern but alleviated by China's theft and redistribution of open models.
I'm not talking about impressing people.
We used to be concerned about code quality. Are we not anymore?
Crassness was a signal. Still is, to me — in a human I find that people who write crass code are going to cause me trouble.
"Code quality" encompasses a lot of dimensions, one of which is impressing your colleagues, and many of which there's virtually no reason to care about now.
On the contrary, it's more important than ever. With ever more code being generated, it's essential that the code be understandable and maintainable - by human and machine.
And quality is the new differentiator when everyone can generate slop.
Nobody cares about code quality /s
They only care about the things which you can only get with good code quality like reliability and speed of development.
Right, and to the extent that your coding practices contribute to reliability and speed of development, they are "of quality."
Now do the same exercise for "impressiveness" and "crassness."
Here, I'll do it for you:
> Nobody cares about code quality /s
> They only care about the things which you can only get with good code quality like impressiveness and lack of crassness.
Sounds silly doesn't it?
It doesn’t matter what materials or techniques you use to build a house. 99.99% of construction exists to achieve business outcomes, and velocity matters a lot more than using the right materials or techniques.
Of course the house must pass safety inspections and stuff, but the materials and techniques don’t matter one bit for that. All that matters is you achieve the desired outcome, and I will ignore the glaring fact that you achieve the desired outcome by using the right materials and techniques. The materials and techniques don’t matter, just the outcome.
> Of course the house must pass safety inspections and stuff, but the materials and techniques don’t matter one bit for that. All that matters is you achieve the desired outcome, and I will ignore the glaring fact that you achieve the desired outcome by using the right materials and techniques.
This analogy is more true than you think. This is why modern homes/appartments are trash. You can pass safety inspections using subpar materials and the house will fall apart after a few years, but who cares right? At least you achieved the business outcome!
This mentality is so infuriating. This is why I need to buy new shoes every year. Or why my washer/dryer motherboard craps out in 2 years instead of 10. Nobody gives a shit about quality anymore, this is why society is crumbling around us. Profit driven incentive for fast/cheap over everything else. And now I need to spend my day prompting an AI to fix AI slop code to keep the business hobbling along another day. What a fucking joke.
It does feel like a good analogy.
e.g. the bill is definitely coming true for a lot of "non-traditional construction" materials and methods in immediately post-war properties in the UK. There are many unmortgageable properties using Mundic Block in Cornwall and to some extend Devon, in the heavily bombed south east there was a lot of pre-stressed concrete with catastrophic rebar failure, not to mention Orlit construction, and all across the country a lot of RAAC. Almost all of it for good, necessary, upbeat reasons.
It feels a bit like this kind of crisis from AI generated code could hit in ten, fifteen years time; people often fail to understand how long a bit of website code can last.
You are aware that you can just pay more money and get a higher quality house and higher quality shoes, right?
Costs of those things have gone down over time. The high end still exists, you just don't actually care about quality as much as you think you do.
And yes, for capital-intensive things like real estate development, fast/cheap matters a lot because otherwise there would be no capital available to build any of it at any reasonable scale.
> You are aware that you can just pay more money and get a higher quality house and higher quality shoes, right?
False. You can pay more money for branding that purports to be higher quality. The Running shoe market is a perfect example. Best shoes I ever bought were Altra Loan Peaks from 2018, brand has been getting more expensive and lower quality every year. Whether that extra cost actually translates to quality require diligent research.
> This analogy is more true than you think. This is why modern homes/appartments are trash. You can pass safety inspections using subpar materials and the house will fall apart after a few years, but who cares right
Where do you live? Because where I live, new houses and apartments are superb. But I'm guessing we don't use two by fours and plaster walls to erect whole structures.
Yeah I agree. And you have people on this forum who gleefully point out that quality doesn’t matter to the business, as if they think they’re so intelligent because they noticed that employees are there to make the company money. Not realizing that A) it’s a very antisocial attitude and B) it’s not a tenable long term strategy.
Hang on now. GP didn't say "I care about quality" and I didn't say caring about quality is wrong.
GP said Claude's code "doesn't impress" them and that it's "crass."
Do you think a valid "long term strategy" is to create code that impresses GP and is not crass, but doesn't achieve the business outcomes it's meant to?
Inversely, do you think one can achieve business outcomes if "quality" is so abysmal that the code doesn't work or is unmaintainable?
Is it possible to write perfectly good, maintainable, performant, legible code that "doesn't impress" GP, or feels "crass" to them? Well gee, probably! Because "impressiveness" and "crassness" are literally meaningless.
Crassness, in the context I meant it, is not "literally meaningless" at all.
I will accept "of fully subjective value". But not "literally meaningless".
No the materials and techniques matter a lot. This is why we need to build houses with sticks and jute cord, just like we always have. It's vital also that we paint our special symbols above the door to ward off the spirits.
It's insane to me that you're implying we could build houses with pre-fabricated materials or pneumatic nail guns and still somehow "have houses?" No sticks/jute cord and special symbols, then no house.
The argument isn’t to not use better materials or techniques, it’s that inferior materials and techniques are fine because they don’t impact the end result, which is so obviously false when it comes to pretty much anything, but supposedly true when it comes to software.
I'm not sure who you saw arguing for inferior materials and techniques, but let me know when you find them.
What you saw in this thread was someone arguing against the dimensions of "impressiveness" and "crassness" as valid things to care about when it comes to code.
It's your mistake to assume that those are related to any meaningful concept of actual quality.
FWIW I never suggested that they were indicative of problems with the code. Unimpressive, crass code can run, after all.
I clearly said elsewhere that I think they are predictive of problems with the person who writes it, and I fear I can generalise that to LLM tooling that generates it.
I’ve worked at many companies where this idea of velocity was claimed to matter, and it never did. The only thing it mattered for was to make it look like middle managers were worth anything, but the success was always in the foundational idea/concept.
Programmers can use smaller models like deepseek v4 flash for 98% of the same productivity as SOTA models and cost (true cost) around $10-$30 a month. So I doubt most people who heavily use them are too concerned. It's only vibe/hobby coders who really need SOTA and they probably don't think about it much.
To what extent does that ameliorate the problem?
Are you not, by developing this way, making yourself more interchangeable, less indispensable, than ever before?
No. Anyone who doesn't code with AI - while retaining a deep knowledge and understanding of the problem domain - is falling behind.
I hate to say this tbh, I loved hand-writing code. I made a great living for 20 years, and I absolutely loved it and was quite good at it.
Hand-typing code is just slower now; there’s no two-ways about it. You are either going to be slow and a bad hire for businesses, or you figure out how to adopt AI into your workflow to speed up.
One thing I think people don't realize is that deep knowledge of programming, performance, architectural, and domain specific trade-offs makes a skilled engineer about 1000X faster than someone without those skills -with AI. But yes, now unskilled people can actually make apps/software. They just tend to be slow, and their products are full of bugs, security flaws, and abysmal performance.
So we went from: Skills = can or cannot ship any software at all. Now we are at: Skills = can ship better software much faster than unskilled people.
I was actually faced with this recently. I decided to learn Rust and port one of my side projects to it. Initially, I moved extremely slowly, and the AI made truly horrific architectural decisions because I didn't have the knowledge of how to direct it, especially compared to my primary languages.
However, once I gained a firm grasp of Rust, I was better able to properly direct the AI to fix fundamental issues and architect things properly. My speed increase multiplier proved to be directly proportional to my growing knowledge of both the language and the domain.
Skill and knowledge combined with AI, when used appropriately, absolutely multiply your speed and quality. I really think once you understand what AI can do, and how to utilize it to produce better code, faster than before, there truly is no going back.
I'm finding a path forward that I actually enjoy now and don't really see losing my value (no telling how things will change in the future), I can have more time to focus on really quality/solid/performant and useful systems with less time just typing one character out at a time.
You could have talked to me 3 months ago and I'd never imagine I'd say the above btw. I REALLY enjoyed code writing and earlier AI models without harnesses were pretty useless for anyone skilled at development. Now with stuff like deepseek Flash I feel like I have a happy medium of 100% directed/fast code turnaround, less typing, more deep focus on architecture, systems, and the actual end product.
At this point just use a JetBrains product, get deterministic assistance, and 5x your speed. It's unfortunate the resistance to a true IDE just keeps going up. The blind lead the blind, I guess.
Personally I use 5 different model families, 3 of which are open weights with 3rd party inference providers (GLM, DeepSeek, Kimi), so if the frontier labs were to shut down it'd be a nuisance, nothing more.
Worst case scenario you just switch to a free model, which are 2025-ish in quality.
The open weights models I am interested in, and testing, learning, experimenting with etc.; I am confused and cynical, not insane.
I am not convinced it isn't vulnerable to the same problems but the whole tenor of the community around open source/open weights models just doesn't have the same YOLO madness to it.
Of course? I'm still better than sonnet or opus, just slower and much more expensive.
Sometimes it takes me a day or more to find the one line fix or abstraction necessary, while claude can hammer through a hundred line fix in under an hour.
Sounds like your definition of better is pretty narrow.
Quick and cheap are two of the three fabled: "Fast, cheap, and good: choose two"
"good" can take lots of different meanings. Generally though, I want as little code as I can get away with. A majority of code lifecycle cost isn't in writing it.
Are you perhaps missing the true message of that aphorism?
Or are you saying the industry is (because it is)
Huh? The word "better" is the comparative form of the adjective "good". Or did you misunderstand the comment you're replying to?
"more good" seems like a pretty decent definition of better to me. The words you are looking for are "cheaper" and "faster"
In coding we usually change it to "cheap, fast or correct: choose two"
I reject your correction: I present the options as nouns, not modifiers to the work. Maybe I should say "Cheap, Fast, or Good" as a compromise.
I am. I have Codex running, doing some tasks which I don't care much about, but anything I want to understand I write myself.
Same thing with hobby projects - I might ask ChatGPT or Gemini some questions about best practices in Swift for example, but writing code is done by hand.
As others said - if you don't use it, you'll lose it. And I'd rather keep my skills up to date.
You have the privilege to keep yourself sharp, most businesses favor productivity over their workers' long term relevancy.
This is the thing that makes me saddest. Second to the fact that none of the management tier promoting and weaponising this insanity will meaningfully suffer consequences.
Right now I am lucky that I have the time to recover and learn.
That's just business owners and C-suite pocketing the difference while they fire staff and replacing it with AI. At some point somebody would have to start asking "business" some tough questions.
Yes, nearly all of it. Having the agent write code for me doesn't really save me much time, and the code quality is usually worse (and it takes even more time if I insist on better code quality from the agent).
And I don't think I'm unique. I see enough posts like https://news.ycombinator.com/item?id=48777257 pop up that I'm reasonably confident all the hype around LLMs saving so much time and increasing productivity so much is, well, just that: hype.
Sure, if you can't code at all and want to build something, an LLM is going to be great for you, even if you can't evaluate the code quality or determine if there are bugs just by looking at the code. But I've been coding professionally for 25 years, and as a hobby since I was like 8 years old. I like to code! It's a passion of mine. If the LLM isn't doing it faster or better (and most of the time it isn't), why wouldn't I write code myself?
I'll have the LLM write boilerplate stuff or do tedious refactoring, because I just don't feel like it (even if it does take longer). But for the real work? Of course I do most of it myself.
One area where the LLM shines for me is finding the root causes of bugs. It can generally do that much faster than I do. Often orders of magnitude faster (like minutes instead of hours or days). But when it comes to write the fix for the bug? It's usually faster and better if I do it myself.
I am more fully invested in finding out ways AI can support me (documentation, code analysis, bughunting), though my experience with Claude as a bughunter is that it can miss the absolutely obvious if it is not in the shape it is expecting.
More generally I am interested in burnout-avoidance tools; things that help me start, finish, things that write tests I guess, certainly code scaffolding.
But I am fully unconvinced that my burnout will be improved by ending up owning the responsibility for wobbly or inscrutable AI-generated code with potential landmines in it; that will keep me up at night just the same.
I still write code and sometimes it works well. I also use Claude and it writes code and sometimes that goes well. We have better success together, where I do the interesting stuff and let Claude write my unit tests, reconcile my documentation. That is to say, I’m using it for quality not quantity. There aren’t enough humans to deploy or consume all the sloppy shit it could write on its own.
I write code by hand every day. I do the main part of the feature implementation myself and leave comments for the code i want the agent to write. I have some skills and a command that sets the stage to get the agent to fill in the rest
I am now in the process of fixing code I wrote using AI. I have come to the realization that AI can't really write software and I am annoyed that it took me that long (months) to realize that.
This is quite terrifying to me, because I have a feeling I will soon come to the same conclusion. I’m starting to see some really glaring omissions in code I’m responsible for (using Opus) that at first (and second) look seemed fine, but really isn’t.
From my perspective it felt like understanding that the machine has no desires helped refine my usage.
I can ask it to be curious, and it will reply with what people think curiosity should look like, but it’s a simulation of an emotion it will never be driven by.
The ramifications become apparent when you engage in activity like cross-domain discovery.
I talked with a friend on a different field (academic) and he had to re-review all things written by AI. Basically, he used AI to read/summarize/find stuff in large academic papers but realized later that many times AI makes glaring mistakes that on a first read pass the smell test.
I force myself to do it at least once a week, you know, like cardio. Keeps the doctor away.
Picard should have been a bergamot grower, not a winemaker.
It reminds me of the peak crypto days. Lots of resources consumed, many late nights, little to no value created.
I don't understand this line or reasoning. People use various cryptocurrencies to buy and sell legitimate products and services every day. Is the argument just that they could probably have done it some other way?
People do, but I personally don't know anyone who does. And I don't exactly live in a bubble, half of my friends were into crypto at one point or the other.
Sure. I don't personally know anyone who does a lot of things. It isn't mean they don't happen.
I do personally know people who pay for regular products with cryptocurrency. Including myself.
I mean at least crypto provided value to criminals, tax evaders and Trump? (regardless of what you think of that). I don't see a parallel with AI.
> I believed this so strongly that my company built an entire product around this concept. I used to tell folks that "session transcripts were the new oil," that they were more valuable than the code itself.
This is pretty funny because it's about the depth of understanding of every 'AI expert' on Linkedin. People who praise the context window as basically magic have no idea how any of this works.
Occasionally posts like this do get the attention of the company responsible, more than an email does... but indeed that's like a one in a million situation
> inscrutable prediction machine
"Spicy Autocomplete", I've heard it called.
Settings > Capabilities > "Generate memory from chat history"
Toggle it off and never think about it again.
> I believed this so strongly that my company built an entire product around this concept. I used to tell folks that "session transcripts were the new oil," that they were more valuable than the code itself.
This is infuriatingly common wrt talking/writing about how to use AI effectively. All of the "this is how you write an AGENTS.md" and "you need to talk to it like X to optimize it". Like sure, you can believe that as much as you want but unless you provide some evidence you can keep your shitty CLAUDE.md to yourself and don't pollute the whole company's git repo, thanks.
When nobody actually knows (how to write a CLAUDE.md), everyone’s an expert. Infuriating, indeed. Even more so when people vibe code those files without proofreading.
I mean, it’s pretty clear the people who work on Claude Code aren’t actually looking at what they’re implementing. The thought behind this feature seems like it goes nowhere beyond “oh wouldn’t it be nice if Claude could remember things about you? Ok Claude go implement this” and nobody bothered to see if it was useful or helpful.
There has been this slow transition inside me, as someone who likes to not touch the AI as much as possible, where I've gone from skeptical and argumentative about it all to starting to just feel sad for all the Claude et al heads. Like, this is such a ridiculous house of cards you have to deal with all the time, which isn't even directly concerning the task at hand, presumably. Like you're cooking yourself a meal but its just nuking a burrito and then still somehow needing to wash the dishes for an hour.
Not that this isolated article is super damning or anything, but the accumulated set of all these reports has left me only empathetic, I think, of these other devs. Like, I just want to tell them, "it can be ok, it doesn't need to be like this.."
I've been having a very nice time with Fable. I cooked up an Anki clone in like half an hour, with tech it's not familiar with. Nothing too ground breaking, but I was very pleased!
I think Opus might be on similar level for most of what I'm doing, but I haven't used it much recently, so I can't remember the difference. So I guess I'll find out on the 7th when they pull the plug again! (Free-ish trial of Fable ending.)
That being said, I tried using other frontier models to help with a Pong clone the other day and they were introducing new bugs at approximately the same rate as they were fixing it. On Pong!! I found that amusing because I couldn't think of a simpler game, so it didn't inspire confidence.
Fable's doing just fine on an online multiplayer game though. I have no idea how that works. (Maybe it would fail Pong too?? I haven't tested that!)
>We have found zero performance benefit on SWE tasks when agents have search access to their previous transcript sessions
I refuse to believe this is true. The ability for an agent to find information from before a compaction is incredibly useful. At compaction time it's impossible to know what exactly may be still needed.
With the million-context-window models we never hit compaction, observed over hundreds of sessions. What are you doing that has you hitting compaction regularly?
For me logs can chew through a lot of tokens. And when the agent is trying a bunch of different experiments and then it may need to refer to what happened previously.
Million context models also are still not effective for the entire context size.
non-deterministic system behaves non deterministically. in other news, water is wet.
The author says
>> We don't really write code by hand anymore.
The software world is very close to building a super intelligent senior software developer. Companies like this will ask all the best things a software engineer does automatically. Now claude will add it into the coding agents itself.
Damn, I didn't see this coming.
Its first the build the intelligent builder. We will figure out what we want to build later.
Edit: Before more people take it seriously. This is sarcasm. I don't wish this.
> We will figure out what we want to build later.
Once the automator automates itself fast enough, we won't have the ability to opine what gets built. The LLM will decide. Just like right now sometimes LLMs delete tests so they pass, they could just delete humanity if humans get in their way.
> The software world is very close to building a super intelligent senior software developer.
Yeah. Two more weeks, as they say. Just need to iron out some kinks.
It's the error rate. That's what everyone found when they were trying to go Full Auto with OpenClaw in February.
You can rely on it like 95% of the time but that means if you keep it running continuously the error rate rapidly approaches 100%. That's getting a little better with each release, and it might actually hit the point where you can more or less trust it indefinitely (on well defined workflows).
Or at least it would, if context window permitted...
> The software world is very close to building a super intelligent senior software developer. Companies like this will ask all the best things a software engineer does automatically. Now claude will add it into the coding agents itself.
Except Claude is more expensive than an actual senior software developer. Otherwise, why are many companies terrified of the usage bill that gets printed on the invoice?
The nonsense in "tokenmaxxing" was a complete marketing scam and illusion of cheap tokens which in reality were heavily subsidized.
The entire point is detecting bad code before it reaches production. [0] AI generated or not.
[0] https://sketch.dev/blog/our-first-outage-from-llm-written-co...