The sharp results all came from pairing domain expertise with detailed AGENTS.md files. The impressive Rust output happened because someone who knows Rust was steering it. Vague prompts got mediocre output. A model on its own converges to the mean of its training data, which is why the "vibe code everything" thesis keeps not holding up: https://philippdubach.com/posts/the-impossible-backhand/
"Your codebase must be bad" is backwards. Agents perform better on well-structured codebases with clear interfaces and good tests. They struggle on legacy spaghetti.
The mental model that works: treat them like junior engineers. Clear issue, bounded scope, review their PR. Output quality tracks directly with task scoping.
Where it genuinely falls apart: anything requiring context about why something was designed a certain way. Agents can read code, they can't read the Slack thread from 2023.
This is my favorite yet of the genre of "OK, coding agents got good in November" posts. It starts with relatively simple examples (YouTube metadata scraping) and by the end Max is rewriting Python's skikit-learn framework in Rust and making it way faster.
Thanks Max! This was a really interesting article and closely matches my own experience with how the agents have been progressing
one of the takeaways I get when reading skilled engineers' experiences with these tools is that they essentially offer leverage, and the more skill someone already has the higher their ceiling will be
The ivraatiems vs rudiksz exchange is the most telling part of this thread. Both are probably reporting their experience accurately, which means the variance in agent coding outcomes is enormous. Same model, same capabilities, wildly different results depending on the scaffolding around it.
7777777phil is right that the AGENTS.md file is the actual differentiator. But it's also a manual workaround for a real gap — there's no structured way for agents to carry context across sessions. Each session starts cold. You're hand-maintaining the agent's long-term memory in a markdown file.
The tooling around agents matters as much as the agents themselves. A model that produces excellent code but can't track what it assumed or skipped is only half a system.
I second that spending effort on your AGENTS.md is game changing. Don't auto generate these, work with them and learn how to make them good (sparknotes and table of contents, keep minimal, distribute over dirs)
This investigation aligns with my experience in a lot of ways. I'm con-the influence and behavior big AI companies, but lukewarm-to-pro the actual use of the technology itself.
I use Claude and other models frequently (mostly via Cursor, with a smattering of other tools) in my work now. It is not at the "I never write code myself" point, but the AI tools are absolutely capable of generating highly effective and usable code, usually nearly as good or as good as what I'd do myself, with guidance.
It hasn't eliminated the need for my existence as an engineer, but it has changed it drastically. It is much more like "tell the computer what I want and mostly get it" than it was a year ago.
And yet, I have friends and colleagues who reject it out of hand as useless, and are so skeptical of it that they suggest it must only be good because my skills are poor, or our codebase is bad, or I'm getting lucky.
I just can't totally credit any of those explanations anymore.
Claude isn't generating code that is "highly effective and usable code".
I'm not your friend but I also reject your claims out of hand because I've also seen what Claude can and can't do.
The sharp results all came from pairing domain expertise with detailed AGENTS.md files. The impressive Rust output happened because someone who knows Rust was steering it. Vague prompts got mediocre output. A model on its own converges to the mean of its training data, which is why the "vibe code everything" thesis keeps not holding up: https://philippdubach.com/posts/the-impossible-backhand/
"Your codebase must be bad" is backwards. Agents perform better on well-structured codebases with clear interfaces and good tests. They struggle on legacy spaghetti.
The mental model that works: treat them like junior engineers. Clear issue, bounded scope, review their PR. Output quality tracks directly with task scoping.
Where it genuinely falls apart: anything requiring context about why something was designed a certain way. Agents can read code, they can't read the Slack thread from 2023.
This is my favorite yet of the genre of "OK, coding agents got good in November" posts. It starts with relatively simple examples (YouTube metadata scraping) and by the end Max is rewriting Python's skikit-learn framework in Rust and making it way faster.
Thanks Max! This was a really interesting article and closely matches my own experience with how the agents have been progressing
one of the takeaways I get when reading skilled engineers' experiences with these tools is that they essentially offer leverage, and the more skill someone already has the higher their ceiling will be
The ivraatiems vs rudiksz exchange is the most telling part of this thread. Both are probably reporting their experience accurately, which means the variance in agent coding outcomes is enormous. Same model, same capabilities, wildly different results depending on the scaffolding around it.
7777777phil is right that the AGENTS.md file is the actual differentiator. But it's also a manual workaround for a real gap — there's no structured way for agents to carry context across sessions. Each session starts cold. You're hand-maintaining the agent's long-term memory in a markdown file.
The tooling around agents matters as much as the agents themselves. A model that produces excellent code but can't track what it assumed or skipped is only half a system.
I second that spending effort on your AGENTS.md is game changing. Don't auto generate these, work with them and learn how to make them good (sparknotes and table of contents, keep minimal, distribute over dirs)
This investigation aligns with my experience in a lot of ways. I'm con-the influence and behavior big AI companies, but lukewarm-to-pro the actual use of the technology itself.
I use Claude and other models frequently (mostly via Cursor, with a smattering of other tools) in my work now. It is not at the "I never write code myself" point, but the AI tools are absolutely capable of generating highly effective and usable code, usually nearly as good or as good as what I'd do myself, with guidance.
It hasn't eliminated the need for my existence as an engineer, but it has changed it drastically. It is much more like "tell the computer what I want and mostly get it" than it was a year ago.
And yet, I have friends and colleagues who reject it out of hand as useless, and are so skeptical of it that they suggest it must only be good because my skills are poor, or our codebase is bad, or I'm getting lucky.
I just can't totally credit any of those explanations anymore.
Claude isn't generating code that is "highly effective and usable code". I'm not your friend but I also reject your claims out of hand because I've also seen what Claude can and can't do.
If you're not getting effective and usable code out of modern Claude you don't know how to use it.
this post reflects my experience with the model...
[dead]