We're going to do it again, aren't we? We're going to take something simple and sensible ("write tests first", "small composable modules", etc.), give it a fancy complicated name ("Behavior-Constrained Implementation Lifecycle pattern", "Boundary-Scoped Processing Constructs pattern", etc.), and create an entire industry of consultants and experts selling books and enterprise coaching around it, each swearing they have the secret sauce and the right incantations.
The damn thing _talks_. You can just _speak_ to it. You can just ask it to do what you want.
I'm confused. Are you criticising the article, or simply expressing concern for what may happen?
The context suggests the former, but your criticisms bear no relation to the linked content. If anything, your edict to "write tests first" is even more succinctly expressed as "Red/green TDD".
People are rushing to be the first one to coin something and hit it big. Imagine the amount of $$$ you could get for being an "expert ai consultant" in this space.
There was already another attempt at agentic patterns earlier:
There's a recurring theme in these agentic engineering threads that is worth calling out: the lessons, are almost always stated as universal – but are deeply dependent on team size, code base maturity, test coverage, and risk tolerance. What gets presented as a “win” for a well instrumented backend service could easily guide those working on UI-heavy or old code down the wrong path. The art of this might be less about discovering the correct pattern, and more about truthfully declaring when a pattern applies.
I work as a consultant so I navigate different codebases, old to new, typescript to javascript, massive to small, frontend only to full stack.
Claude Code experience is massively different depending on the codebase.
Good E2E strongly typed codebase? Can one shot any feature, some small QA, some polishing and it's usually good to ship.
Plain javascript? Object oriented? Injection? Overall magic? Claude can work there but is not a pleasant experience and I wouldn't say it accelerates you that much.
That's a good call out. The reason I'm doing this as a website and not a book is that this stuff changes all the time and I want to update it, so one of the things I'll try to do is add notes about when and where each pattern works as those constraints become clear.
I use AI in my workflow mostly for simple boilerplate, or to troubleshoot issues/docs.
I've dipped into agentic work now and again, but never been very impressed with the output (well, that there is any functioning output is insanely impressive, but it isn't code I want to be on the hook for complaining).
I hear a lot of people saying the same, but similarly a bunch of people I respect saying they barely write code anymore. It feels a little tricky to square these up sometimes.
Anyway, really looking forward to trying some if these patterns as the book develops to see if that makes a difference. Understanding how other peopke really use these tools is a big gap for me.
One thing I rarely see mentioned is that often creating code by hand is simply faster (at least for me) than using AI. Creating a plan for AI, waiting for execution, verifying, prompting again etc. can take more time than just doing it on my own with a plan in my head (and maybe some notes). Creating something from scratch or doing advanced refactoring is almost always faster with AI, but most of my daily tasks are bugs or features that are 10% coding and 90% knowing how to do it.
I think this is the main point where many people’s work differs. Most of my work I know roughly what needs changing and how things are structured but I jump between codebases often enough that I can’t always remember the exact classes/functions where changes are needed. But I can vaguely gesture at those specific changes that need to be made and have the AI find the places that need changing and then I can review the result.
I rarely get the luxury of working in a single codebase for a long enough period of time to get so familiar with it that I can jump to particular functions without much thought. That means AI is usually a better starting point than me fumbling around trying to find what I think exists but I don’t know where it is.
I've heard people say that these coding agents are just tools and don't replace the thinking. That's fine but the problem for me is that the act of coding is when I do my thinking!
I'm thinking about how to solve the problem and how to express it in the programming language such that it is easy to maintain. Getting someone/something else to do that doesn't help me.
But different strokes for different folks, I suppose.
Yes, it's often faster if you sit around waiting. What I will do instead is prompt the AI to create various plans, do other stuff while they do, review and approve the plans, do other stuff while multiple plans are being implemented, and then review and revise the output.
And I have the AI deal with "knowing how to do it" as well. Often it's slower to have it do enough research to know how to do it, but my time is more expensive than Claude's time, and so as long as I'm not sitting around waiting it's a net win.
I do this too, but then you need some method to handle it, because now you have to read and test and verify multiple work streams. It can become overwhelming. In the past week I had the following problems from parallel agents:
Gemini running an benchmark- everything ran smoothly for an hour. But on verification it had hallucinated the model used for judging, invalidating the whole run.
Another task used Opus and I manually specified the model to use. It still used the wrong model.
This type of hallucination has happened to me at least 4-5 times in the past fortnight using opus 4.6 and gemini-3.1-pro. GLM-5 does not seem to hallucinate so much.
So if you are not actively monitoring your agent and making the corrections, you need something else that is.
You need a harness, yes, and you need quality gates the agent can't mess with, and that just kicks the work back with a stern message to fix the problems. Otherwise you're wasting your time reviewing incomplete work.
Glancing at what it's doing is part of your multitasking rounds.
Also instead of just prompting, having it write a quick summary of exactly what it will do where the AI writes a plan including class names branch names file locations specific tests etc. is helpful before I hit go, since the code outline is smaller and quicker to correct.
That takes more wall clock time per agent, but gets better results, so fewer redo steps.
For me it _can_ be faster to code than to instruct but it takes me significantly less effort to write the prompt than the actual code. So a few hours of concentrates coding leave me completely drained of energy while after a few hours with the agents I still have a lot of mental energy. That's the huge difference for me and I don't want to go back.
Thats interesting. While i do get mentally tired after a session of focused coding, i feel like i have accomplished something. Using AI for coding feels similar to spending hours doom scrolling reels. Less engaging but Im drained as hell at the end.
My way of phrasing this: I need to activate my personal transformers on my inner embeddings space to really figure what is it that I truly want to
write.
I delegate to agents what I hate doing, e.g. when creating a SaaS web app, the last thing I want to waste my time on is the landing page with about/pricing/login and Stripe integration frontend/backend - I'll just tell Claude Code (with Qwen3-Coder-Next-Q8 running locally on RTX Pro 6000) to make all this basic stuff for me so that I can focus on the actual core of the app. It then churns for half an hour, spews out the first version where I need to spend another half an hour to fix bugs by pointing errors to Claude Code and then in 1 hour it's all done. I can also tell it to avoid all the node.js garbage and do it all in plain HTML/JS/CSS.
But can you mentally "keep hold" (for lack of a better term) of those tasks that are getting executed in parallel? Honestly asking.
Because, after they're done/have finished executing, I guess you still have to "check" their output, integrate their results into the bigger project they're (supposedly) part of etc, and for me the context-switching required to do all that is mentally taxing. But maybe this only happens because my brain is not young enough, that's why I'm asking.
I think the difference is that you're applying a standard of correctness or personal understanding of the code you're pushing that is being relaxed in the "agentic workflows"
I have the AI integrate their results themselves. That's if anything one of the things they do best. I also have them do reviews and test their own work first before I check it, and that usually makes the remaining verification fairly quick and painless.
> I've dipped into agentic work now and again, but never been very impressed with the output (well, that there is any functioning output is insanely impressive, but it isn't code I want to be on the hook for complaining).
> I hear a lot of people saying the same, but similarly a bunch of people I respect saying they barely write code anymore. It feels a little tricky to square these up sometimes.
It squares up just fine.
You ever read a blog post or comment and think "Yeah, this is definitely AI generated"? If you can recognise it, would you accept a blog post, reviewed by you, for your own blog/site?
I won't; I'll think "eww" and rewrite.
The developers with good AI experiences don't get the same "eww" feeling when reading AI-generated code. The developers with poor AI experiences get that "eww" feeling all the time when reviewing AI code and decide not to accept the code.
I also will rewrite both text and code created by Gen AI. I've found the best workflow for me is not to refine what I've written, but instead to use it to help me get over humps and/or crank through some of the drudgery. And then I go back and edit, fixing any issues I spot and to reshape it to be in my own voice.
I’m not OP but every time I post a comment with this sentiment I get told “the latest models are what you need”. If every 3 months you are saying “it’s ready as long as you use the latest model”, then it wasn’t ready 3 months ago and it’s not likely to be ready now.
To answer your question, I’ve tried both Claude code and Antigravity in the last 2 weeks and I’m still finding them struggling. AG with Gemini regularly gets stuck on simple issues and loops until I run out of requests, and Claude still just regularly goes on wild tangents not actually solving the problem.
I don’t think that’s true. Claude Opus 4.5/4.6 in Cursor have marked the big shift for me. Before that, agentic development mostly made me want to just do it myself, because it was getting stuck or going on tangents.
I think it can (and is) shifting very rapidly. Everyone is different, and I’m sure models are better at different types of work (or styles of working), but it doesn’t take much to make it too frustrating to use. Which also means it doesn’t take much to make it super useful.
> I don’t think that’s true. Claude Opus 4.5/4.6 in Cursor.
Opus 4.6 has been out for less than a month. If it was a big shift surely we'd see a massive difference over 4.5 which was november. I think this proves the point, you're not seeing seisimic shifts every 3 months and you're not even clear about which model was the fix.
> I think it can (and is) shifting very rapidly.
Shifting, maybe. But shuffling deck chairs every 3 months.
It depends on what you're handling. Frontend (not css), swagger, mundane CRUD is where it shines. Something more complex that need a bit harder calculation usually make the agents struggling.
Especially good to navigate the code if you're unfamiliar with it (the code). If you have known the code for good, you'll find it's usually faster to debug and code by yourself.
I thought this too and then I discovered plan mode. If you just prompt agent mode it will be terrible, but coming up with a plan first has really made a big difference and I rarely write code at all now
Have you tried it with something like OpenSpec? Strangely, taking the time to lay out the steps in a large task helps immensely. It's the difference between the behavior you describe and just letting it run productively for segments of ten or fifteen minutes.
Agree, it’s strange, I will just assume that the people who say this are building react apps. I still have so much ”certainly, I should not do this in a completely insane way, let me fix that” … -400+2. It’s not always, and it is better than it was, but that’s it.
At this point though, after Claude C Compiler, you've got to give us more details to better understand the dichotomy. What do you consider simple issues?
Perfect example. You mean the C compiler that literally failed to compile a hello world [0] (which was given in it's readme)?
> What do you consider simple issues?
Hallucinating APIs for well documented libraries/interfaces, ignoring explicit instructions for how to do things, and making very simple logic errors in 30-100 line scripts.
As an example, I asked Claude code to help me with a Roblox game last weekend, and specifically asked it to "create a shop GUI for <X> which scales with the UI, and opens when you press E next to the character". It proceeded to create a GUI with absolute sizings, get stuck on an API hallucination for handling input, and also, when I got it unstuck, it didn't actually work.
Claude C compiler is 100k LOC that doesn’t do anything useful, and cost $20k plus the cost of an expert engineer creating a custom harness and babysitting it.
But the most important thing is that they were reverse engineering gcc by using it as an oracle. And it had gcc and thousands of other c compilers in its training set.
So if you are a large corporation looking to copy GPL code so that you can use it without worrying about the license, and the project you want to copy is a text transformer with a rigorously defined set of inputs and outputs, have at it.
Pretty recently (a couple weeks ago). I give agentic workflows a go every couple of weeks or so.
I should say, I don't find them abysmal, but I tend to work in codebases where I understand them, and the patterns really well. The use cases I've tried so far, do sort of work, just not yet at least, faster than I'm able to actual write the code myself.
> My experience with what coding assistants are good for shifted from:
> smart autocomplete -> targeted changes/additions -> full engineering
Define "full engineering". Because if you say "full engineering" I would expect the agent to get some expected product output details as input and produce all by itself the right implementation for the context (i.e. company) it lives in.
I still write code but do not push everything off to the agent. Try my best to write small tasks. ~20% of the time I have to get in there. If someone says they're absolutely not writing a line of code they must have amazing guardrails.
> It feels a little tricky to square these up sometimes.
In my experience, this heavily depends on the task, and there's a massive chasm between tasks where it's a good and bad fit. I can definitely imagine people working only on one side of this chasm and being perplexed by the other side.
My experience is that the first iteration output from a single agent is not what I want to be on the hook for. What squares it for me with "not writing code anymore" is the iterative process to improve outputs:
1) Having review loops between agents (spawn separate "reviewer" agents) and clear tests / eval criteria improved results quite a bit for me.
2) Reviewing manually and giving instructions for improvements is necessary to have code I can own
Is that… actually faster than just doing it yourself, tho? Like, “I could write the right thing, or I could have this robot write the wrong thing and then nag it til it corrects itself” seems to suggest a fairly obvious choice.
I’ve yet to see these things do well on anything but trivial boilerplate.
I was in the same boat as you until I saw DHH post about how he’s changed his use of agents. In his talk with Lex Fridman his approach was similar to mine and it really felt like a kernel of sanity amongst the hype. So when he said he’s changed his approach I had another look. I’m using agents (Claude code) every day now. I still write code every day too. (So does Dax Raad from OpenCode to throw a bit more weight behind this stance). I’m not convinced the models can own a production code base and that therefore engineers need to maintain their skills sufficiently to be responsible. I find agents helpful for a lot of stuff, usually heavily patterned code with a lot of prior art. I find CC consistently sucks at writing polars code. I honestly don’t enjoy using agents at all and I don’t think anyone can honestly claim they know how this is going to shake out. But I feel by using the tools myself I have a much stronger sense of reality amongst the hype.
I strongly agree with that last statement—I hate using agents because their code smells awful even if it works. But I have to use them now because otherwise I’m going to wake up one day and be 100% obsolete and never even notice how it happened.
These lessons get obliterated with every new LLM generation. Like how LangChain started on stupid models with small context, creating some crazy architecture around it to bypass their limitations that got completely obliterated when GPT-3.5 was released, yet people still use it and overcomplicate things. Rather look at where the puck is going, we might soon not need more than a single agent to do everything given context size keeps increasing, agent can use more tools and we might get some in-call context cleanup at some point as well that would allow an agent to spin forever instead of calling subagents due to context size limitations.
I'm trying to include patterns that work independently of model releases.
It's tricky though. Take "red/green TDD" for example - it's perfectly possible that models will start defaulting to doing that anyway pretty soon.
In that case it's only three words so it doesn't feel hugely wasteful if it turns out not to be necessary - and there's still value in understanding what it means even if you no longer have to explicitly tell the agents to do it.
The biggest takeaway for me from LLMs is that the implementation details no longer. If you have a sufficiently detailed tests and requirements, there is going to be a robot that will roll the dice until it fits the tests and requirements.
I've experimented with agentic coding/engineering a lot recently. My observation is that software that is easily tested are perfect for this sort of agentic loop.
In one of my experiments I had the simple goal of "making Linux binaries smaller to download using better compression" [1]. Compression is perfect for this. Easily validated (binary -> compress -> decompress -> binary) so each iteration should make a dent otherwise the attempt is thrown out.
Lessons I learned from my attempts:
- Do not micro-manage. AI is probably good at coming up with ideas and does not need your input too much
- Test harness is everything, if you don't have a way of validating the work, the loop will go stray
- Let the iterations experiment. Let AI explore ideas and break things in its experiment. The iteration might take longer but those experiments are valuable for the next iteration
- Keep some .md files as scratch pad in between sessions so each iteration in the loop can learn from previous experiments and attempts
You have to have really good tests as it fucks up in strange ways people don't (because I think experienced programmers run loops in their brain as they code)
Good news - agents are good at open ended adding new tests and finding bugs. Do that. Also do unit tests and playwright. Testing everything via web driving seems insane pre agents but now its more than doable.
the .md scratch pad point is underrated, and the format matters more than people realize.
summaries ("tried X, tried Y, settled on Z") are better than nothing, but the next iteration can mostly reconstruct them from test results anyway. what's actually irreplaceable is the constraint log: "approach B rejected because latency spikes above N ms on target hardware" means the agent doesn't re-propose B the next session. without it, every iteration rediscovers the same dead ends.
ended up splitting it into decisions.md and rejections.md. counter-intuitively, rejections.md turned out to be the more useful file. the decisions are visible in the code. the rejections are invisible — and invisible constraints are exactly what agents repeatedly violate.
"Test harness is everything, if you don't have a way of validating the work, the loop will go stray"
This is the most important piece to using AI coding agents. They are truly magical machines that can make easy work of a large number of development, general purpose computing, and data collection tasks, but without deterministic and executable checks and tests, you can't guarantee anything from one iteration of the loop to the next.
The test harness point is the one that really sticks for me too. We've been using agentic loops for browser automation work, and the domain has a natural validation signal: either the browser session behaves the way a real user would, or it doesn't. That binary feedback closes the loop really cleanly.
The tricky part in our case is that "behaves correctly" has two layers - functional (did it navigate correctly?) and behavioral (does it look human to detection systems?). Agents are fine with the first layer but have no intuition for the second. Injecting behavioral validation into the loop was the thing that actually made it useful.
The .md scratch pad between sessions is underrated. We ended up formalizing it into a short decisions log - not a summary of what happened, just the non-obvious choices and why. The difference between "we tried X" and "we tried X, it failed because Y, so we use Z instead" is huge for the next session.
browser automation at scale - specifically the problem of running many isolated browser sessions that each look like distinct, real users to detection systems. the behavioral validation layer I mentioned is the part that makes agentic loops actually useful for this: the agent needs to know not just "did the task succeed" but "did it succeed without triggering signals that would get the session flagged".
the interesting engineering problem is that the two feedback loops run on different timescales - functional feedback is immediate (did the click work?) but behavioral feedback is lagged and probabilistic (the session might get flagged 10 requests from now based on something that happened 5 requests ago). teaching an agent to reason about that second loop is the unsolved part.
fair question. i shared a technical experience because it was directly relevant to the test harness discussion - the behavioral vs functional validation layers, the lagged feedback problem. if that reads as promotion, i get it, but it wasn't the intent. the engineering problem is real regardless of who's solving it.
Today I gave a lecture to my undergraduate data structures students about the evolution of CPU and GPU architectures since the late 1970s. The main themes:
- Through the last two decades of the 20th century, Moore’s Law held and ensured that more transistors could be packed into next year’s chips that could run at faster and faster clock speeds. Software floated on a rising tide of hardware performance so writing fast code wasn’t always worth the effort.
- Power consumption doesn’t vary with transistor density but varies with the cube of clock frequency, so by the early 2000s Intel hit a wall and couldn’t push the clock above ~4GHz with normal heat dissipation methods. Multi-core processors were the only way to keep the performance increasing year after year.
- Up to this point the CPU could squeeze out performance increases by parallelizing sequential code through clever scheduling tricks (and compilers could provide an assist by unrolling loops) but with multiple cores software developers could no longer pretend that concurrent programming was only something that academics and HPC clusters cared about.
CS curricula are mostly still stuck in the early 2000s, or at least it feels that way. We teach big-O and use it to show that mergesort or quicksort will beat the pants off of bubble sort, but topics like Amdahl’s Law are buried in an upper-level elective when in fact it is much more directly relevant to the performance of real code, on real present-day workloads, than a typical big-O analysis.
In any case, I used all this as justification for teaching bitonic sort to 2nd and 3rd year undergrads.
My point here is that Simon’s assertion that “code is cheap” feels a lot like the kind of paradigm shift that comes from realizing that in a world with easily accessible massively parallel compute hardware, the things that matter for writing performant software have completely shifted: minimizing branching and data dependencies produces code that looks profoundly different than what most developers are used to. e.g. running 5 linear passes over a column might actually be faster than a single merged pass if those 5 passes touch different memory and the merged pass has to wait to shuffle all that data in and out of the cache because it doesn’t fit.
What all this means for the software development process I can’t say, but the payoff will be tremendous (10-100x, just like with properly parallelized code) for those who can see the new paradigm first and exploit it.
Yesterday I wrote a post about exactly this. Software development, as the act of manually producing code, is dying. A new discipline is being born. It is much closer to proper engineering.
Like an engineer overseeing the construction of a bridge, the job is not to lay bricks. It is to ensure the structure does not collapse.
The marginal cost of code is collapsing. That single fact changes everything.
Quite a heavy-lifting word here. You understand why people flagged that post right? It's painfully non-human. I'm all for utilizing LLM, but I highly suggest you read Simon's posts. He's obviously a heavy AI user, but even his blog posts aren't that inorganic and that's why he became the new HN blog babe.
[0]: I personally believe Simon writes with his own voice, but who knows?
How paranoid do you want to get? Simone's written enough, such that you could just feed his blog to AI and ask it to write in his voice. Which, taken to the logical extreme, means that the last time he went to visit OpenAI, he was captured, and locked in a dungeon, and his online presence is now entirely AI with the right prompt. In fact, that's happened to everyone on this site, and we're all LLMs just predicting the next word at each other.
There's no actual way to determine if any words are from a silicon token generator or meat-based generator. It's not AI, it's human! Emdash. You're absolutely right!
This is such a strange take. Your words remind me of past crypto hype cycles, where people pushed web3.0 and NFT FOMO hysteria.
Engineering is
the practical application of science and mathematics to solve problems. It sounds like you're maybe describing construction management instead. I'm not denying that there's value here, but what you're espousing seems divorced from reality. Good luck vibecoding a nontrivial actuarial model, then having it to pass the laundry list of reviews and having large firms actually pick it up.
> This is such a strange take. Your words remind me of past crypto hype cycles, where people pushed web3.0 and NFT FOMO hysteria.
Thats a little harsh. I think most everyone would agree we're in a transformative time for engineering. Sure theres hype, but the adoption in our profession (assuming you're an engineer) isn't waning.
I would not equate software engineering to "proper" engineering insofar as being uttered in the same sentence as mechanical, chemical, or electrical engineering.
The cost of code is collapsing because web development is not broadly rigorous, robust software was never a priority, and everyone knows it. The people complaining that AI isn't good enough yet don't grasp that neither are many who are in the profession currently.
> The people complaining that AI isn't good enough yet don't grasp that neither are many who are in the profession currently.
I think the externalities are being ignored. Having time and money to train engineers is expensive. Having all the data of your users being stolen is a slap in the wrist.
So replacing those bad worekrs with AI is fine. Unless you remove the incentives to be fast instead of good, then yeah AI can be good enough for some cases.
The claim here is profound: comprehension of the codebase at the function level is no longer necessary
It's not profound. It's not profound when I read the exact same awed blog post about how "agentic" is the future and you don't even need to know code anymore.
It wasn't profound the first time, and it's even dumber that people keep repeating it - maybe they take all the time they saved not writing, and use it to not read.
The formal engineering disciplines are not defined by the construction vs design distinction so much as the regulatory gates they have passed and the ethical burdens they shoulder for society's benefit.
I've recently got into red/greed TDD with claude code, and I have to agree that it seems like the right way to go.
As my projects were growing in complexity and scope, I found myself worrying that we were building things that would subtly break other parts of the application. Because of the limited context windows, it was clear that after a certain size, Claude kind of stops understanding how the work you're doing interacts with the rest of the system. Tests help protect against that.
Red/green TDD specifically ensures that the current work is quite focused on the thing that you're actually trying to accomplish, in that you can observe a concrete change in behaviour as a result of the change, with the added benefit of growing the test suite over time.
It's also easier than ever to create comprehensive integration test suites - my most valuable tests are tests that test entire user facing workflows with only UI elements, using a real backend.
Red/green is especially good with claude because even now with opus 4.6, claude can throw out a little comment like “//Implementation on hold until X/Y/Z: return { true }” and proceed to completely skip implementation based on the inline skip comment for a longgg time. It used to do this aggressively even in the tests, but by and large red/green prompting helps immensely - it tells the agent “think of failing tests as SUCCESS right now” - then you’ll get lots of them.
I’ve always been partial to integration tests too. Hand coding made integration tests feel bad; you’re almost doubling the code output in some cases - especially if you end up needing to mock a bunch of servers. Nowadays that’s cheap, which is super helpful.
Granted it doesn't always pay attention to Claude.md but one thing I've done is in my block of rules it must always follow is to never leave something unimplemented w/ placeholders unless explicitly told to do so. It's made this mostly go away for me.
Yeah, I've always _preferred_ integration tests, but the cost of building them was so great. Now the cost is effectively eliminated, and if you make a change that genuinely does affect an integration test (changing the text on a button, for example) it's easy to smart-find-and-replace and fix them up. So I'm using them a lot more.
The only problem is... they still take much longer to _run_ than unit tests, and they do tend to be more flaky (although Claude is helpful in fixing flaky tests too). I'm grateful for the extra safety, but it makes deployments that much slower. I've not really found a solution to that part beyond parallelising.
I see where Simon is coming from with these patterns but I wonder where large software companies stand regarding their agentic engineering practices? Is Google creating in-house code using agents against its monorepo? Has Microsoft outsourced Windows source code advancements to a dark factory yet?
Very much agree with the idea of red/green TDD and have seen really good results during agentic coding. I've found adding a linting step in between increases efficiency as well and fails a bit faster. So it becomes..
Test fail -> implement -> linter -> test pass
Another idea I've thought about using is docs driven development. So the instructions might look like..
Write doc for feat/bug > test fail > implement > lint > test pass
I primarily use AI for understanding codebases myself. My prompt is:
"deeply understand this codebase, clearly noting async/sync nature, entry points and external integration. Once understood prepare for follow up questions from me in a rapid fire pattern, your goal is to keep responses concise and always cite code snippets to ensure responses are factual and not hallucinated. With every response ask me if this particular piece of knowledge should be persistent into codebase.md"
Both the concise and structure nature (code snippets) help me gain knowledge of the entire codebase - as I progressively ask complex questions on the codebase.
I contribute to an open source spec based project management tool. I spend about a day back and forth iterating on a spec, using ai to refine the spec itself. Sometimes feeding it in and out of Claude/gemini telling each other where the feedback has come from. The spec is the value. Using the ai pm tool I break it down into n tasks and sub tasks and dependencies. I then trigger Claude in teams mode to accomplish the project. It can be left alone over night. I wake up in the morning with n prs merged.
Linear walkthrough: I ask my agents to give me a numbered tree. Controlling tree size specifies granularity. Numbering means it's simple to refer to points for discussion.
Other things that I feel are useful:
- Very strict typing/static analysis
- Denying tool usage with a hook telling the agent why+what they should do (instead of simple denial, or dangerously accepting everything)
For web apps, explictly asking the agent to build in sensible checkpoints and validate at the checkpoint using Playwright has been very successful for me so far. It prevents the agent from strating off course and struggling to find its way back. That and always using plan mode first, and reviewing the plan for evidence of sensible checkpoints. /opusplan to save tokens!
The best thing I read in this was "Hoard things you know how to do" > basically get an LLM to mutate an existing function you know is 1. well written and 2. works. If you have many such components you're still assembling code rapidly but using building blocks you actually understand in depth, rather than getting an LLM to shit out something verbose.
This sort of thing is available using utilities like spec kit/spec kitty/etc. But yes it does make it do better, including writing its own checklists so that it comes back to the tasks it identified early on without distraction.
Only until March 6th, I'm selling site-wide sponsorship a week at a time. Those sponsors get no influence over what I write about at all - I started this entire guide without even mentioning it to them.
The underlying technology is still improving at a rapid pace. Many of last year's tricks are a waste of tokens now. Some ideas seem less fragile: knowing two things allows you to imagine the confluence of the two so you know to ask. Other things are less so: I'm a big fan of the test-based iteration loop; it is so effective that I suspect almost all users have arrived at it independently[0]. But the emergent properties of models are so hard to actually imagine. A future sufficiently-smart intelligence may take a different approach that is less search and more proof. I wouldn't bet on it, but I've been surprised too many times over the last few years.
I mainly work with documents as a white collar worker but have vibe coded a few bits.
The thing I keep coming back to is that it's all code. Almost all white collar professions have at least some key outputs in code. Whether you are a store manager filling out reports or a marketing firm or a teacher, there is so much code.
This means you can give claude code a branded document template, fill it out, include images etc. and uploaded to our cloud hosting.
With this same guidance and taste, I'm doing close to the work of 5 people.
Setup: Claude code with full API access to all my digital spaces + tmux running 3-5 tasks in parallel
Is there anything about reviewing the generated code? Not by the author but by another human being.
Colleagues don’t usually like to review AI generated code. If they use AI to review code, then that misses the point of doing the review. If they do the review manually (the old way) it becomes a bottleneck (we are faster at producing code now than we are at reviewing it)
The most important thing you need to understand with working with agents for coding is that now you design a production line. And that has nothing to do (mostly) with designing or orchestrating agents.
Take a guitar, for example. You don't industrialize the manufacture of guitars by speeding up the same practices that artisans used to build them. You don't create machines that resemble individual artisans in their previous roles (like everyone seems to be trying to do with AI and software). You become Leo Fender, and you design a new kind of guitar that is made to be manufactured at another level of scale magnitude. You need to be Leo Fender though (not a talented guitarrist, but definitely a technical master).
To me, it sounds too early to describe patterns, since we haven't met the Ford/Fender/etc equivalent of this yet. I do appreciate the attempt though.
A broken test doesn’t make the agentic coding tool go “ooooh I made a bad assumption” any more than a type error or linter does
All a broken test does it prompt me to prompt back “fix tests”
I have no clue which one broke or why or what was missed, and it doesnt matter. Actual regressions are different and not dependent on these tests, and I follow along from type errors and LLM observability
People come up with the most insane workflow for agents. They complete about 80% of the work but that last 20% is basically equivalent to you doing the whole thing piece wise (with the help of AI). Except the latter gives you peace of mind.
I am still not sold on agentic coding. We’ll probably get there within the next couple of years.
I'm curious what you've used it for? I was firmly in your camp until about a month ago when i used codex to dust off an old side project. I hadn't touched the project in six months. This was literally my first prompt:
"Explain the codebase to a newcomer. What is the general structure, what are the important things to know, and what are some pointers for things to learn next?"
Once I saw the output I giddyup'd and haven't looked back.
Do you think there’s a chance that the hundreds of thousands or millions of developers - real developers - using these tools, might actually find them useful?
Dismissing everything AI as slop strikes me as an attitude that is not going to age well. You’ll miss the boat when it does come (and I believe it already has).
I think you’ve got to make hay while the sun shines. Nobody knows how this is all going to play out, I just want to make sure I’m at the forefront of it.
I think the relative comfort we've enjoyed as software engineers is going to disappear eventually. I just want to be the last to go.
My whole career, I've remained valuable by staying at the forefront of what is possible and connecting that to users' needs. Nothing has changed about my approach from that perspective.
I'm not an investor so I have no idea how they should think.
patterns that may help increase subjective perception of reliability from non-deterministic text generators trained on the theft of millions of developer's work for the past 25 years.
I think it's nonsensical to insist that it would only be a subjective improvement. The tests either exist and ensure that there aren't bugs in certain areas, or they don't. The agent is either in a feedback loop with those tests and continues to work until it has satisfied them or it doesn't.
Red-Green TDD is one of the main "agent patterns" Simon proposes, so it seemed relevant.
Also, the same thing applies to feedback loops with compilers and linters as well: they provide objective feedback that then the AI goes and fixes, verifiably resolving the feedback.
Even with less verifiable things like using specifications, the fact that it relies on less objective grounding metrics doesn't mean there's no change in the model's behavior. I'm sure if you looked at the code that a model produced and the amount of intervention necessary to get there for a model that was asked to produce something without a specification versus with one, you would definitely see an objective difference on average. We're already getting objective studies regarding AGENTS.MD
We're going to do it again, aren't we? We're going to take something simple and sensible ("write tests first", "small composable modules", etc.), give it a fancy complicated name ("Behavior-Constrained Implementation Lifecycle pattern", "Boundary-Scoped Processing Constructs pattern", etc.), and create an entire industry of consultants and experts selling books and enterprise coaching around it, each swearing they have the secret sauce and the right incantations.
The damn thing _talks_. You can just _speak_ to it. You can just ask it to do what you want.
I'm confused. Are you criticising the article, or simply expressing concern for what may happen?
The context suggests the former, but your criticisms bear no relation to the linked content. If anything, your edict to "write tests first" is even more succinctly expressed as "Red/green TDD".
People are rushing to be the first one to coin something and hit it big. Imagine the amount of $$$ you could get for being an "expert ai consultant" in this space.
There was already another attempt at agentic patterns earlier:
https://agentic-patterns.com/
Absolute hot air garbage.
Which pieces of my writing are garbage?
Has anyone staked a claim to "Agile AI" yet?
I suggest "AIgile" for brevity.
I've seen several already. There's a huge business opportunity (at our expense, of course).
You haven’t heard of spec driven development?!? Haha.
> The damn thing _talks_. You can just _speak_ to it. You can just ask it to do what you want.
But can it pass the butter?
There's a recurring theme in these agentic engineering threads that is worth calling out: the lessons, are almost always stated as universal – but are deeply dependent on team size, code base maturity, test coverage, and risk tolerance. What gets presented as a “win” for a well instrumented backend service could easily guide those working on UI-heavy or old code down the wrong path. The art of this might be less about discovering the correct pattern, and more about truthfully declaring when a pattern applies.
I work as a consultant so I navigate different codebases, old to new, typescript to javascript, massive to small, frontend only to full stack.
Claude Code experience is massively different depending on the codebase.
Good E2E strongly typed codebase? Can one shot any feature, some small QA, some polishing and it's usually good to ship.
Plain javascript? Object oriented? Injection? Overall magic? Claude can work there but is not a pleasant experience and I wouldn't say it accelerates you that much.
"...typescript to javascript"
Country AND Western!
That's a good call out. The reason I'm doing this as a website and not a book is that this stuff changes all the time and I want to update it, so one of the things I'll try to do is add notes about when and where each pattern works as those constraints become clear.
Agreed. AND some are universal -- right now, agentic workflows benefit from independent source-of-truth checkins A LOT.
A lot of Simon's tools are making harnesses for this so it can get integrated:
showboat - create a demo, validate the code generates the demo. This is making a documentation source of truth
rodney - validate visually and with navigation that things work like you expect
red-green tests are conceptually the same - once we have these tests then the agent can loop more successfully.
So, I think there are some "universals" or at least "universals for now" that do transcend team/deployment specificity
I use AI in my workflow mostly for simple boilerplate, or to troubleshoot issues/docs.
I've dipped into agentic work now and again, but never been very impressed with the output (well, that there is any functioning output is insanely impressive, but it isn't code I want to be on the hook for complaining).
I hear a lot of people saying the same, but similarly a bunch of people I respect saying they barely write code anymore. It feels a little tricky to square these up sometimes.
Anyway, really looking forward to trying some if these patterns as the book develops to see if that makes a difference. Understanding how other peopke really use these tools is a big gap for me.
One thing I rarely see mentioned is that often creating code by hand is simply faster (at least for me) than using AI. Creating a plan for AI, waiting for execution, verifying, prompting again etc. can take more time than just doing it on my own with a plan in my head (and maybe some notes). Creating something from scratch or doing advanced refactoring is almost always faster with AI, but most of my daily tasks are bugs or features that are 10% coding and 90% knowing how to do it.
> 10% coding and 90% knowing how to do it
I think this is the main point where many people’s work differs. Most of my work I know roughly what needs changing and how things are structured but I jump between codebases often enough that I can’t always remember the exact classes/functions where changes are needed. But I can vaguely gesture at those specific changes that need to be made and have the AI find the places that need changing and then I can review the result.
I rarely get the luxury of working in a single codebase for a long enough period of time to get so familiar with it that I can jump to particular functions without much thought. That means AI is usually a better starting point than me fumbling around trying to find what I think exists but I don’t know where it is.
I've heard people say that these coding agents are just tools and don't replace the thinking. That's fine but the problem for me is that the act of coding is when I do my thinking!
I'm thinking about how to solve the problem and how to express it in the programming language such that it is easy to maintain. Getting someone/something else to do that doesn't help me.
But different strokes for different folks, I suppose.
Yes, it's often faster if you sit around waiting. What I will do instead is prompt the AI to create various plans, do other stuff while they do, review and approve the plans, do other stuff while multiple plans are being implemented, and then review and revise the output.
And I have the AI deal with "knowing how to do it" as well. Often it's slower to have it do enough research to know how to do it, but my time is more expensive than Claude's time, and so as long as I'm not sitting around waiting it's a net win.
I do this too, but then you need some method to handle it, because now you have to read and test and verify multiple work streams. It can become overwhelming. In the past week I had the following problems from parallel agents:
Gemini running an benchmark- everything ran smoothly for an hour. But on verification it had hallucinated the model used for judging, invalidating the whole run.
Another task used Opus and I manually specified the model to use. It still used the wrong model.
This type of hallucination has happened to me at least 4-5 times in the past fortnight using opus 4.6 and gemini-3.1-pro. GLM-5 does not seem to hallucinate so much.
So if you are not actively monitoring your agent and making the corrections, you need something else that is.
You need a harness, yes, and you need quality gates the agent can't mess with, and that just kicks the work back with a stern message to fix the problems. Otherwise you're wasting your time reviewing incomplete work.
Here is an example where the prompt was only a few hundred tokens and the output reasoning chain was correct, but the actual function call was wrong https://x.com/xundecidability/status/2005647216741105962?s=2...
Glancing at what it's doing is part of your multitasking rounds.
Also instead of just prompting, having it write a quick summary of exactly what it will do where the AI writes a plan including class names branch names file locations specific tests etc. is helpful before I hit go, since the code outline is smaller and quicker to correct.
That takes more wall clock time per agent, but gets better results, so fewer redo steps.
Here is an example where the prompt was only a few hundred tokens and the output reasoning chain was correct, but the actual function call was wrong https://x.com/xundecidability/status/2005647216741105962?s=2...
I as a human have typos too - and sometimes they're the hardest thing to catch in code review because you know what you meant.
Hopefully there is some of lint process to catch my human hallucinations and typos.
For me it _can_ be faster to code than to instruct but it takes me significantly less effort to write the prompt than the actual code. So a few hours of concentrates coding leave me completely drained of energy while after a few hours with the agents I still have a lot of mental energy. That's the huge difference for me and I don't want to go back.
Thats interesting. While i do get mentally tired after a session of focused coding, i feel like i have accomplished something. Using AI for coding feels similar to spending hours doom scrolling reels. Less engaging but Im drained as hell at the end.
I'd argue you still have to stay engaged, if not more-so. Its a different type of engagement. Look at you: You're the CTO now.
My way of phrasing this: I need to activate my personal transformers on my inner embeddings space to really figure what is it that I truly want to write.
I delegate to agents what I hate doing, e.g. when creating a SaaS web app, the last thing I want to waste my time on is the landing page with about/pricing/login and Stripe integration frontend/backend - I'll just tell Claude Code (with Qwen3-Coder-Next-Q8 running locally on RTX Pro 6000) to make all this basic stuff for me so that I can focus on the actual core of the app. It then churns for half an hour, spews out the first version where I need to spend another half an hour to fix bugs by pointing errors to Claude Code and then in 1 hour it's all done. I can also tell it to avoid all the node.js garbage and do it all in plain HTML/JS/CSS.
The rebuttal to this would be that you can do many such tasks in parallel.
I’m not sure it’s really true in practice yet, but that would certainly be the claim.
But can you mentally "keep hold" (for lack of a better term) of those tasks that are getting executed in parallel? Honestly asking.
Because, after they're done/have finished executing, I guess you still have to "check" their output, integrate their results into the bigger project they're (supposedly) part of etc, and for me the context-switching required to do all that is mentally taxing. But maybe this only happens because my brain is not young enough, that's why I'm asking.
The type of dev who is allowing AI to do all of their work does not care about the quality of said work.
I think the difference is that you're applying a standard of correctness or personal understanding of the code you're pushing that is being relaxed in the "agentic workflows"
I have the AI integrate their results themselves. That's if anything one of the things they do best. I also have them do reviews and test their own work first before I check it, and that usually makes the remaining verification fairly quick and painless.
> I've dipped into agentic work now and again, but never been very impressed with the output (well, that there is any functioning output is insanely impressive, but it isn't code I want to be on the hook for complaining).
> I hear a lot of people saying the same, but similarly a bunch of people I respect saying they barely write code anymore. It feels a little tricky to square these up sometimes.
It squares up just fine.
You ever read a blog post or comment and think "Yeah, this is definitely AI generated"? If you can recognise it, would you accept a blog post, reviewed by you, for your own blog/site?
I won't; I'll think "eww" and rewrite.
The developers with good AI experiences don't get the same "eww" feeling when reading AI-generated code. The developers with poor AI experiences get that "eww" feeling all the time when reviewing AI code and decide not to accept the code.
Well, that's my theory anyway.
I also will rewrite both text and code created by Gen AI. I've found the best workflow for me is not to refine what I've written, but instead to use it to help me get over humps and/or crank through some of the drudgery. And then I go back and edit, fixing any issues I spot and to reshape it to be in my own voice.
I do this with code too.
When was the last time you tried?
I think trying agents to do larger tasks was always very hit or miss, up to about the end of last year.
In the past couple of months I have found them to have gotten a lot better (and I'm not the only one).
My experience with what coding assistants are good for shifted from:
smart autocomplete -> targeted changes/additions -> full engineering
I’m not OP but every time I post a comment with this sentiment I get told “the latest models are what you need”. If every 3 months you are saying “it’s ready as long as you use the latest model”, then it wasn’t ready 3 months ago and it’s not likely to be ready now.
To answer your question, I’ve tried both Claude code and Antigravity in the last 2 weeks and I’m still finding them struggling. AG with Gemini regularly gets stuck on simple issues and loops until I run out of requests, and Claude still just regularly goes on wild tangents not actually solving the problem.
I don’t think that’s true. Claude Opus 4.5/4.6 in Cursor have marked the big shift for me. Before that, agentic development mostly made me want to just do it myself, because it was getting stuck or going on tangents.
I think it can (and is) shifting very rapidly. Everyone is different, and I’m sure models are better at different types of work (or styles of working), but it doesn’t take much to make it too frustrating to use. Which also means it doesn’t take much to make it super useful.
> I don’t think that’s true. Claude Opus 4.5/4.6 in Cursor.
Opus 4.6 has been out for less than a month. If it was a big shift surely we'd see a massive difference over 4.5 which was november. I think this proves the point, you're not seeing seisimic shifts every 3 months and you're not even clear about which model was the fix.
> I think it can (and is) shifting very rapidly.
Shifting, maybe. But shuffling deck chairs every 3 months.
I interpreted their comment to mean 4.5 was the shift, which was nov last year. "Before that" meaning pre 4.5.
It depends on what you're handling. Frontend (not css), swagger, mundane CRUD is where it shines. Something more complex that need a bit harder calculation usually make the agents struggling.
Especially good to navigate the code if you're unfamiliar with it (the code). If you have known the code for good, you'll find it's usually faster to debug and code by yourself.
Opus 4.6 with claude code vscode extension
I thought this too and then I discovered plan mode. If you just prompt agent mode it will be terrible, but coming up with a plan first has really made a big difference and I rarely write code at all now
Have you tried it with something like OpenSpec? Strangely, taking the time to lay out the steps in a large task helps immensely. It's the difference between the behavior you describe and just letting it run productively for segments of ten or fifteen minutes.
> Have you tried it with something like OpenSpec?
No. The parent comment said I needed a new model, which I've tried. Being told "just try something else aswell" kind of proves the point.
Agree, it’s strange, I will just assume that the people who say this are building react apps. I still have so much ”certainly, I should not do this in a completely insane way, let me fix that” … -400+2. It’s not always, and it is better than it was, but that’s it.
I'm an ML engineer, so it's mostly been setting up data processing/training code in PyTorch, if that helps.
At this point though, after Claude C Compiler, you've got to give us more details to better understand the dichotomy. What do you consider simple issues?
> At this point though, after Claude C Compiler,
Perfect example. You mean the C compiler that literally failed to compile a hello world [0] (which was given in it's readme)?
> What do you consider simple issues?
Hallucinating APIs for well documented libraries/interfaces, ignoring explicit instructions for how to do things, and making very simple logic errors in 30-100 line scripts.
As an example, I asked Claude code to help me with a Roblox game last weekend, and specifically asked it to "create a shop GUI for <X> which scales with the UI, and opens when you press E next to the character". It proceeded to create a GUI with absolute sizings, get stuck on an API hallucination for handling input, and also, when I got it unstuck, it didn't actually work.
[0] https://github.com/anthropics/claudes-c-compiler/issues/1
Claude C compiler is 100k LOC that doesn’t do anything useful, and cost $20k plus the cost of an expert engineer creating a custom harness and babysitting it.
But the most important thing is that they were reverse engineering gcc by using it as an oracle. And it had gcc and thousands of other c compilers in its training set.
So if you are a large corporation looking to copy GPL code so that you can use it without worrying about the license, and the project you want to copy is a text transformer with a rigorously defined set of inputs and outputs, have at it.
> When was the last time you tried?
Pretty recently (a couple weeks ago). I give agentic workflows a go every couple of weeks or so.
I should say, I don't find them abysmal, but I tend to work in codebases where I understand them, and the patterns really well. The use cases I've tried so far, do sort of work, just not yet at least, faster than I'm able to actual write the code myself.
> My experience with what coding assistants are good for shifted from:
> smart autocomplete -> targeted changes/additions -> full engineering
Define "full engineering". Because if you say "full engineering" I would expect the agent to get some expected product output details as input and produce all by itself the right implementation for the context (i.e. company) it lives in.
I still write code but do not push everything off to the agent. Try my best to write small tasks. ~20% of the time I have to get in there. If someone says they're absolutely not writing a line of code they must have amazing guardrails.
> It feels a little tricky to square these up sometimes.
In my experience, this heavily depends on the task, and there's a massive chasm between tasks where it's a good and bad fit. I can definitely imagine people working only on one side of this chasm and being perplexed by the other side.
My experience is that the first iteration output from a single agent is not what I want to be on the hook for. What squares it for me with "not writing code anymore" is the iterative process to improve outputs:
1) Having review loops between agents (spawn separate "reviewer" agents) and clear tests / eval criteria improved results quite a bit for me. 2) Reviewing manually and giving instructions for improvements is necessary to have code I can own
Is that… actually faster than just doing it yourself, tho? Like, “I could write the right thing, or I could have this robot write the wrong thing and then nag it til it corrects itself” seems to suggest a fairly obvious choice.
I’ve yet to see these things do well on anything but trivial boilerplate.
In my experience, sometimes. Not that often, depends on the task.
The benefit is I can keep some things ticking over while I’m in meetings, to be honest.
I was in the same boat as you until I saw DHH post about how he’s changed his use of agents. In his talk with Lex Fridman his approach was similar to mine and it really felt like a kernel of sanity amongst the hype. So when he said he’s changed his approach I had another look. I’m using agents (Claude code) every day now. I still write code every day too. (So does Dax Raad from OpenCode to throw a bit more weight behind this stance). I’m not convinced the models can own a production code base and that therefore engineers need to maintain their skills sufficiently to be responsible. I find agents helpful for a lot of stuff, usually heavily patterned code with a lot of prior art. I find CC consistently sucks at writing polars code. I honestly don’t enjoy using agents at all and I don’t think anyone can honestly claim they know how this is going to shake out. But I feel by using the tools myself I have a much stronger sense of reality amongst the hype.
I strongly agree with that last statement—I hate using agents because their code smells awful even if it works. But I have to use them now because otherwise I’m going to wake up one day and be 100% obsolete and never even notice how it happened.
These lessons get obliterated with every new LLM generation. Like how LangChain started on stupid models with small context, creating some crazy architecture around it to bypass their limitations that got completely obliterated when GPT-3.5 was released, yet people still use it and overcomplicate things. Rather look at where the puck is going, we might soon not need more than a single agent to do everything given context size keeps increasing, agent can use more tools and we might get some in-call context cleanup at some point as well that would allow an agent to spin forever instead of calling subagents due to context size limitations.
I'm trying to include patterns that work independently of model releases.
It's tricky though. Take "red/green TDD" for example - it's perfectly possible that models will start defaulting to doing that anyway pretty soon.
In that case it's only three words so it doesn't feel hugely wasteful if it turns out not to be necessary - and there's still value in understanding what it means even if you no longer have to explicitly tell the agents to do it.
The biggest takeaway for me from LLMs is that the implementation details no longer. If you have a sufficiently detailed tests and requirements, there is going to be a robot that will roll the dice until it fits the tests and requirements.
I've experimented with agentic coding/engineering a lot recently. My observation is that software that is easily tested are perfect for this sort of agentic loop.
In one of my experiments I had the simple goal of "making Linux binaries smaller to download using better compression" [1]. Compression is perfect for this. Easily validated (binary -> compress -> decompress -> binary) so each iteration should make a dent otherwise the attempt is thrown out.
Lessons I learned from my attempts:
- Do not micro-manage. AI is probably good at coming up with ideas and does not need your input too much
- Test harness is everything, if you don't have a way of validating the work, the loop will go stray
- Let the iterations experiment. Let AI explore ideas and break things in its experiment. The iteration might take longer but those experiments are valuable for the next iteration
- Keep some .md files as scratch pad in between sessions so each iteration in the loop can learn from previous experiments and attempts
[1] https://github.com/mohsen1/fesh
You have to have really good tests as it fucks up in strange ways people don't (because I think experienced programmers run loops in their brain as they code)
Good news - agents are good at open ended adding new tests and finding bugs. Do that. Also do unit tests and playwright. Testing everything via web driving seems insane pre agents but now its more than doable.
the .md scratch pad point is underrated, and the format matters more than people realize.
summaries ("tried X, tried Y, settled on Z") are better than nothing, but the next iteration can mostly reconstruct them from test results anyway. what's actually irreplaceable is the constraint log: "approach B rejected because latency spikes above N ms on target hardware" means the agent doesn't re-propose B the next session. without it, every iteration rediscovers the same dead ends.
ended up splitting it into decisions.md and rejections.md. counter-intuitively, rejections.md turned out to be the more useful file. the decisions are visible in the code. the rejections are invisible — and invisible constraints are exactly what agents repeatedly violate.
"Test harness is everything, if you don't have a way of validating the work, the loop will go stray"
This is the most important piece to using AI coding agents. They are truly magical machines that can make easy work of a large number of development, general purpose computing, and data collection tasks, but without deterministic and executable checks and tests, you can't guarantee anything from one iteration of the loop to the next.
The test harness point is the one that really sticks for me too. We've been using agentic loops for browser automation work, and the domain has a natural validation signal: either the browser session behaves the way a real user would, or it doesn't. That binary feedback closes the loop really cleanly.
The tricky part in our case is that "behaves correctly" has two layers - functional (did it navigate correctly?) and behavioral (does it look human to detection systems?). Agents are fine with the first layer but have no intuition for the second. Injecting behavioral validation into the loop was the thing that actually made it useful.
The .md scratch pad between sessions is underrated. We ended up formalizing it into a short decisions log - not a summary of what happened, just the non-obvious choices and why. The difference between "we tried X" and "we tried X, it failed because Y, so we use Z instead" is huge for the next session.
What are you developing that technology for?
browser automation at scale - specifically the problem of running many isolated browser sessions that each look like distinct, real users to detection systems. the behavioral validation layer I mentioned is the part that makes agentic loops actually useful for this: the agent needs to know not just "did the task succeed" but "did it succeed without triggering signals that would get the session flagged".
the interesting engineering problem is that the two feedback loops run on different timescales - functional feedback is immediate (did the click work?) but behavioral feedback is lagged and probabilistic (the session might get flagged 10 requests from now based on something that happened 5 requests ago). teaching an agent to reason about that second loop is the unsolved part.
so spam?
fair question. i shared a technical experience because it was directly relevant to the test harness discussion - the behavioral vs functional validation layers, the lagged feedback problem. if that reads as promotion, i get it, but it wasn't the intent. the engineering problem is real regardless of who's solving it.
Today I gave a lecture to my undergraduate data structures students about the evolution of CPU and GPU architectures since the late 1970s. The main themes:
- Through the last two decades of the 20th century, Moore’s Law held and ensured that more transistors could be packed into next year’s chips that could run at faster and faster clock speeds. Software floated on a rising tide of hardware performance so writing fast code wasn’t always worth the effort.
- Power consumption doesn’t vary with transistor density but varies with the cube of clock frequency, so by the early 2000s Intel hit a wall and couldn’t push the clock above ~4GHz with normal heat dissipation methods. Multi-core processors were the only way to keep the performance increasing year after year.
- Up to this point the CPU could squeeze out performance increases by parallelizing sequential code through clever scheduling tricks (and compilers could provide an assist by unrolling loops) but with multiple cores software developers could no longer pretend that concurrent programming was only something that academics and HPC clusters cared about.
CS curricula are mostly still stuck in the early 2000s, or at least it feels that way. We teach big-O and use it to show that mergesort or quicksort will beat the pants off of bubble sort, but topics like Amdahl’s Law are buried in an upper-level elective when in fact it is much more directly relevant to the performance of real code, on real present-day workloads, than a typical big-O analysis.
In any case, I used all this as justification for teaching bitonic sort to 2nd and 3rd year undergrads.
My point here is that Simon’s assertion that “code is cheap” feels a lot like the kind of paradigm shift that comes from realizing that in a world with easily accessible massively parallel compute hardware, the things that matter for writing performant software have completely shifted: minimizing branching and data dependencies produces code that looks profoundly different than what most developers are used to. e.g. running 5 linear passes over a column might actually be faster than a single merged pass if those 5 passes touch different memory and the merged pass has to wait to shuffle all that data in and out of the cache because it doesn’t fit.
What all this means for the software development process I can’t say, but the payoff will be tremendous (10-100x, just like with properly parallelized code) for those who can see the new paradigm first and exploit it.
Yesterday I wrote a post about exactly this. Software development, as the act of manually producing code, is dying. A new discipline is being born. It is much closer to proper engineering.
Like an engineer overseeing the construction of a bridge, the job is not to lay bricks. It is to ensure the structure does not collapse.
The marginal cost of code is collapsing. That single fact changes everything.
https://nonstructured.com/zen-of-ai-coding/
> wrote
Quite a heavy-lifting word here. You understand why people flagged that post right? It's painfully non-human. I'm all for utilizing LLM, but I highly suggest you read Simon's posts. He's obviously a heavy AI user, but even his blog posts aren't that inorganic and that's why he became the new HN blog babe.
[0]: I personally believe Simon writes with his own voice, but who knows?
How paranoid do you want to get? Simone's written enough, such that you could just feed his blog to AI and ask it to write in his voice. Which, taken to the logical extreme, means that the last time he went to visit OpenAI, he was captured, and locked in a dungeon, and his online presence is now entirely AI with the right prompt. In fact, that's happened to everyone on this site, and we're all LLMs just predicting the next word at each other.
There's no actual way to determine if any words are from a silicon token generator or meat-based generator. It's not AI, it's human! Emdash. You're absolutely right!
system failure.
We have the entire web built on technical debt and LLMs mostly trained on that, what could go wrong? Cost will reside somewhere else if not on code
This is such a strange take. Your words remind me of past crypto hype cycles, where people pushed web3.0 and NFT FOMO hysteria.
Engineering is the practical application of science and mathematics to solve problems. It sounds like you're maybe describing construction management instead. I'm not denying that there's value here, but what you're espousing seems divorced from reality. Good luck vibecoding a nontrivial actuarial model, then having it to pass the laundry list of reviews and having large firms actually pick it up.
> This is such a strange take. Your words remind me of past crypto hype cycles, where people pushed web3.0 and NFT FOMO hysteria.
Thats a little harsh. I think most everyone would agree we're in a transformative time for engineering. Sure theres hype, but the adoption in our profession (assuming you're an engineer) isn't waning.
> It is much closer to proper engineering.
I would not equate software engineering to "proper" engineering insofar as being uttered in the same sentence as mechanical, chemical, or electrical engineering.
The cost of code is collapsing because web development is not broadly rigorous, robust software was never a priority, and everyone knows it. The people complaining that AI isn't good enough yet don't grasp that neither are many who are in the profession currently.
> The people complaining that AI isn't good enough yet don't grasp that neither are many who are in the profession currently.
I think the externalities are being ignored. Having time and money to train engineers is expensive. Having all the data of your users being stolen is a slap in the wrist.
So replacing those bad worekrs with AI is fine. Unless you remove the incentives to be fast instead of good, then yeah AI can be good enough for some cases.
Indeed, it's like those complaining self-driving cars occasionally crash when their crash rates are up to 90% less than humans . . .
It's not pleasant to read this.
It's not profound. It's not profound when I read the exact same awed blog post about how "agentic" is the future and you don't even need to know code anymore.It wasn't profound the first time, and it's even dumber that people keep repeating it - maybe they take all the time they saved not writing, and use it to not read.
Agree. This is a transition from being "in" the loop to being "on" the loop.
You didn't write that and you shouldn't believe that you did.
The formal engineering disciplines are not defined by the construction vs design distinction so much as the regulatory gates they have passed and the ethical burdens they shoulder for society's benefit.
https://www.slater.dev/2025/09/its-time-to-license-software-...
I've recently got into red/greed TDD with claude code, and I have to agree that it seems like the right way to go.
As my projects were growing in complexity and scope, I found myself worrying that we were building things that would subtly break other parts of the application. Because of the limited context windows, it was clear that after a certain size, Claude kind of stops understanding how the work you're doing interacts with the rest of the system. Tests help protect against that.
Red/green TDD specifically ensures that the current work is quite focused on the thing that you're actually trying to accomplish, in that you can observe a concrete change in behaviour as a result of the change, with the added benefit of growing the test suite over time.
It's also easier than ever to create comprehensive integration test suites - my most valuable tests are tests that test entire user facing workflows with only UI elements, using a real backend.
Red/green is especially good with claude because even now with opus 4.6, claude can throw out a little comment like “//Implementation on hold until X/Y/Z: return { true }” and proceed to completely skip implementation based on the inline skip comment for a longgg time. It used to do this aggressively even in the tests, but by and large red/green prompting helps immensely - it tells the agent “think of failing tests as SUCCESS right now” - then you’ll get lots of them.
I’ve always been partial to integration tests too. Hand coding made integration tests feel bad; you’re almost doubling the code output in some cases - especially if you end up needing to mock a bunch of servers. Nowadays that’s cheap, which is super helpful.
Granted it doesn't always pay attention to Claude.md but one thing I've done is in my block of rules it must always follow is to never leave something unimplemented w/ placeholders unless explicitly told to do so. It's made this mostly go away for me.
Yeah, I've always _preferred_ integration tests, but the cost of building them was so great. Now the cost is effectively eliminated, and if you make a change that genuinely does affect an integration test (changing the text on a button, for example) it's easy to smart-find-and-replace and fix them up. So I'm using them a lot more.
The only problem is... they still take much longer to _run_ than unit tests, and they do tend to be more flaky (although Claude is helpful in fixing flaky tests too). I'm grateful for the extra safety, but it makes deployments that much slower. I've not really found a solution to that part beyond parallelising.
I find StrongDM's Dark Factory principles more immediately actionable (sorry, Simon!): https://factory.strongdm.ai/principles
Not sure there's anything to be sorry for, he literally wrote about it a few weeks ago:
https://simonwillison.net/2026/Feb/7/software-factory/
I second that, sometimes it's defensibly worth throwing token fuel at the problem and validate as you go.
I see where Simon is coming from with these patterns but I wonder where large software companies stand regarding their agentic engineering practices? Is Google creating in-house code using agents against its monorepo? Has Microsoft outsourced Windows source code advancements to a dark factory yet?
Very much agree with the idea of red/green TDD and have seen really good results during agentic coding. I've found adding a linting step in between increases efficiency as well and fails a bit faster. So it becomes..
Test fail -> implement -> linter -> test pass
Another idea I've thought about using is docs driven development. So the instructions might look like..
Write doc for feat/bug > test fail > implement > lint > test pass
I primarily use AI for understanding codebases myself. My prompt is:
"deeply understand this codebase, clearly noting async/sync nature, entry points and external integration. Once understood prepare for follow up questions from me in a rapid fire pattern, your goal is to keep responses concise and always cite code snippets to ensure responses are factual and not hallucinated. With every response ask me if this particular piece of knowledge should be persistent into codebase.md"
Both the concise and structure nature (code snippets) help me gain knowledge of the entire codebase - as I progressively ask complex questions on the codebase.
Ahh, I tend to find software based engineering skills and workflows as the agentic engineering patterns.
I distilled multiple software books into these flows and skills. With more books to come.
Here is an example https://github.com/ryanthedev/code-foundations
I contribute to an open source spec based project management tool. I spend about a day back and forth iterating on a spec, using ai to refine the spec itself. Sometimes feeding it in and out of Claude/gemini telling each other where the feedback has come from. The spec is the value. Using the ai pm tool I break it down into n tasks and sub tasks and dependencies. I then trigger Claude in teams mode to accomplish the project. It can be left alone over night. I wake up in the morning with n prs merged.
The patterns in the article might be a starter, but there's so much more to cover:
agents role (Orchestrator, QA etc.), agents communication, thinking patterns, iteration patterns, feature folders, time-aware changelog tracking, prompt enforcing, real time steering.
We might really need a public Wiki for that (C2 [1] style)
[1]: https://wiki.c2.com/
Isn’t this pretty much how everyone uses agents?
Feels like it’s a lot of words to say what amounts to make the agent do the steps we know works well for building software.
G is posting this slop so Anthropic sends him his dinner invitation this month, give him a break.
Linear walkthrough: I ask my agents to give me a numbered tree. Controlling tree size specifies granularity. Numbering means it's simple to refer to points for discussion.
Other things that I feel are useful:
- Very strict typing/static analysis
- Denying tool usage with a hook telling the agent why+what they should do (instead of simple denial, or dangerously accepting everything)
- Using different models for code review
For web apps, explictly asking the agent to build in sensible checkpoints and validate at the checkpoint using Playwright has been very successful for me so far. It prevents the agent from strating off course and struggling to find its way back. That and always using plan mode first, and reviewing the plan for evidence of sensible checkpoints. /opusplan to save tokens!
The best thing I read in this was "Hoard things you know how to do" > basically get an LLM to mutate an existing function you know is 1. well written and 2. works. If you have many such components you're still assembling code rapidly but using building blocks you actually understand in depth, rather than getting an LLM to shit out something verbose.
I really like the idea of agent coding patterns. This feels like it could be expanded easily with more content though. Off the top of my head:
- tell the agent to write a plan, review the plan, tell the agent to implement the plan
- allow the agent to “self discover” the test harness (eg. “Validate this c compiler against gcc”)
- queue a bunch of tasks with // todo … and yolo “fix all the todo tasks”
- validate against a known output (“translate this to rust and ensure it emits the same byte or byte output as you go”)
- pick a suitable language for the task (“go is best for this task because I tried several languages and it did the best for this domain in go”)
This sort of thing is available using utilities like spec kit/spec kitty/etc. But yes it does make it do better, including writing its own checklists so that it comes back to the tasks it identified early on without distraction.
PSA: This is sponsored by Augment code.
Only until March 6th, I'm selling site-wide sponsorship a week at a time. Those sponsors get no influence over what I write about at all - I started this entire guide without even mentioning it to them.
Thanks, I did not mean it as an accusation of bias. Just something I saw on the page and shared. Appreciate you writing this and sharing.
Impressive, he keeps finding ways to abuse the HN audience.
Have anyone create a browser extension to remove everything related to simonw?
Is there a market for this like OOP patterns that used to sell in the 90s?
The underlying technology is still improving at a rapid pace. Many of last year's tricks are a waste of tokens now. Some ideas seem less fragile: knowing two things allows you to imagine the confluence of the two so you know to ask. Other things are less so: I'm a big fan of the test-based iteration loop; it is so effective that I suspect almost all users have arrived at it independently[0]. But the emergent properties of models are so hard to actually imagine. A future sufficiently-smart intelligence may take a different approach that is less search and more proof. I wouldn't bet on it, but I've been surprised too many times over the last few years.
0: https://wiki.roshangeorge.dev/w/Blog/2025-12-01/Grounding_Yo...
It definitely feels like everyone is trying to sell you something that is supposed to help you build rather than actually building useful stuff.
Which is oddly close to how investment advice is given. If these techniques work so well, why give them up for free?
everybody's trying to become the next Uncle Bob
I mainly work with documents as a white collar worker but have vibe coded a few bits.
The thing I keep coming back to is that it's all code. Almost all white collar professions have at least some key outputs in code. Whether you are a store manager filling out reports or a marketing firm or a teacher, there is so much code.
This means you can give claude code a branded document template, fill it out, include images etc. and uploaded to our cloud hosting.
With this same guidance and taste, I'm doing close to the work of 5 people.
Setup: Claude code with full API access to all my digital spaces + tmux running 3-5 tasks in parallel
Is there anything about reviewing the generated code? Not by the author but by another human being.
Colleagues don’t usually like to review AI generated code. If they use AI to review code, then that misses the point of doing the review. If they do the review manually (the old way) it becomes a bottleneck (we are faster at producing code now than we are at reviewing it)
This chapter describes a technique for making code reviews less mentally burdensome: https://simonwillison.net/guides/agentic-engineering-pattern...
I'm hoping to add more on that topic as I discover other patterns that are useful there.
Asking for a walkthrough of the codebase? Sure you linked to the right page?
I was expecting tips on code review instead based on your comment and GP.
It's the closest I have to touching on code review so far.
The most important thing you need to understand with working with agents for coding is that now you design a production line. And that has nothing to do (mostly) with designing or orchestrating agents.
Take a guitar, for example. You don't industrialize the manufacture of guitars by speeding up the same practices that artisans used to build them. You don't create machines that resemble individual artisans in their previous roles (like everyone seems to be trying to do with AI and software). You become Leo Fender, and you design a new kind of guitar that is made to be manufactured at another level of scale magnitude. You need to be Leo Fender though (not a talented guitarrist, but definitely a technical master).
To me, it sounds too early to describe patterns, since we haven't met the Ford/Fender/etc equivalent of this yet. I do appreciate the attempt though.
I dont currently have confidence in TDD
A broken test doesn’t make the agentic coding tool go “ooooh I made a bad assumption” any more than a type error or linter does
All a broken test does it prompt me to prompt back “fix tests”
I have no clue which one broke or why or what was missed, and it doesnt matter. Actual regressions are different and not dependent on these tests, and I follow along from type errors and LLM observability
People come up with the most insane workflow for agents. They complete about 80% of the work but that last 20% is basically equivalent to you doing the whole thing piece wise (with the help of AI). Except the latter gives you peace of mind.
I am still not sold on agentic coding. We’ll probably get there within the next couple of years.
I'm curious what you've used it for? I was firmly in your camp until about a month ago when i used codex to dust off an old side project. I hadn't touched the project in six months. This was literally my first prompt:
"Explain the codebase to a newcomer. What is the general structure, what are the important things to know, and what are some pointers for things to learn next?"
Once I saw the output I giddyup'd and haven't looked back.
Any word on patterns for security and deployment to prod?
Not yet, I'm still trying to figure out what the effective patterns for that are myself!
Slop Engineering Patterns
Do you think there’s a chance that the hundreds of thousands or millions of developers - real developers - using these tools, might actually find them useful?
Dismissing everything AI as slop strikes me as an attitude that is not going to age well. You’ll miss the boat when it does come (and I believe it already has).
> You’ll miss the boat when it does come
Is the boat:
1) unmissable since the tools get better all the time and are intelligent
or
2) nearly-impossible to board since the tools will replace most of the developers
or
3) a boat of small productivity improvements?
?
Personally today I think it’s 3.
Eventually I do think it will be 2.
I think you’ve got to make hay while the sun shines. Nobody knows how this is all going to play out, I just want to make sure I’m at the forefront of it.
So you think the tools will be intelligent yet somehow hard to master?
And the progress is slowing down in such a way, that knowledge learned today will not be outdated anymore?
Should investors be worried, since AGI is not coming anymore?
My advice is not to get hung up on whether this stuff is "intelligent" or caught out by the AGI hype.
We didn't ask if type-based autocomplete was "intelligent" before we started using that.
Treat coding agents as tools and figure out what they can and cannot do and how best to use them.
No, I think they will be very easy to use.
I think the relative comfort we've enjoyed as software engineers is going to disappear eventually. I just want to be the last to go.
My whole career, I've remained valuable by staying at the forefront of what is possible and connecting that to users' needs. Nothing has changed about my approach from that perspective.
I'm not an investor so I have no idea how they should think.
patterns that may help increase subjective perception of reliability from non-deterministic text generators trained on the theft of millions of developer's work for the past 25 years.
I think it's nonsensical to insist that it would only be a subjective improvement. The tests either exist and ensure that there aren't bugs in certain areas, or they don't. The agent is either in a feedback loop with those tests and continues to work until it has satisfied them or it doesn't.
That sounds like a very specific implementation strategy related to TDD
Red-Green TDD is one of the main "agent patterns" Simon proposes, so it seemed relevant.
Also, the same thing applies to feedback loops with compilers and linters as well: they provide objective feedback that then the AI goes and fixes, verifiably resolving the feedback.
Even with less verifiable things like using specifications, the fact that it relies on less objective grounding metrics doesn't mean there's no change in the model's behavior. I'm sure if you looked at the code that a model produced and the amount of intervention necessary to get there for a model that was asked to produce something without a specification versus with one, you would definitely see an objective difference on average. We're already getting objective studies regarding AGENTS.MD
I really hate smelly statements like this or that is cheap now. They reek of carelessness.