I wonder if the choice of programming language for vibecoding actually would become more relevant rather than less so. Initial collective instinct suggests that perhaps languages don't really matter anymore since LLMs help promote PL egalitarianism, but then if it's as the OP describes it - humans are now becoming code reviewers, maybe different aspects of various languages start granting practical advantages, like Clojure connected to a REPL allowing rapid execution/validation of generated code blocks, or advanced static type systems (Haskel, Rust, etc.) providing an edge from another angle?
I'm still calibrating myself on the size of task that I can get Claude Code to do before I have to intervene.
I call this problem the "goldilocks" problem. The task has to be large enough that it outweighs the time necessary to write out a sufficiently detailed specification AND to review and fix the output. It has to be small enough that Claude doesn't get overwhelmed.
The issue with this is, writing a "sufficiently detailed specification" is task dependent. Sometimes a single sentence is enough, other times a paragraph or two, sometimes a couple of pages is necessary. And the "review and fix" phase again is totally dependent and completely unknown. I can usually estimate the spec time but the review and fix phase is a dice roll dependent on the output of the agent.
And the "overwhelming" metric is again not clear. Sometimes Claude Code can crush significant tasks in one shot. Other times it can get stuck or lost. I haven't fully developed an intuition for this yet, how to differentiate these.
What I can say, this is an entirely new skill. It isn't like architecting large systems for human development. It isn't like programming. It is its own thing.
This is why I'm still dubious about the overall productivity increase we'll see from AI once all the dust settles.
I think it's undeniable that in narrow well controlled use cases the AI does give you a bump. Once you move beyond that though the time you have to spend on cleanup starts to seriously eat into any efficiency gains.
And if you're in a domain you know very little about, I think any use case beyond helping you learn a little quicker is a net negative.
>I haven't fully developed an intuition for this yet, how to differentiate these.
The big issue is that, even though there is a logical side to it, part is adapting to a close system that can change under your feet. New model, new prompt, there goes your practice.
Absolutely. And what I find fascinating that this experience is highly personal. I read probably 876 different “How I code with LLMs” and I can honestly say not a single thing I read and tried (and I tried A LOT) “worked” for me…
My experience is: AI written prompts are overly long and overly specific. I prefer to write the instructions myself and then direct the LLM to ask clarifying questions or provide an implementation plan. Depending on the size of change I go 1-3 rounds of clarifications until Claude indicates it is ready and provides a plan that I can review.
I do this in a task_descrtiption.md file and I include the clarifications in its own section (the files follow a task.template.md format).
What bothers me is this: Claude & I work hard on a subtle issue; eventually (often after wiping Claude's memory clean and trying again) we collectively come to a solution that works.
But the insights gleaned from that battle are (for Claude) lost forever as soon as I start on a new task.
The way LLM's (fail to) handle memory and in-situ learning (beyond prompt engineering and working within the context window) is just clearly deficient compared to how human minds work.
This illustrates a fundamental truth of maintaining software with LLMs: While programmers can use LLMs to produce huge amounts of code in a short time, they still need to read and understand it. It is simply not possible to delegate understanding a huge codebase to an AI, at least not yet.
In my experience, the real "pain" of programming lies in forcing yourself to absorb a flood of information and connecting the dots. Writing code is, in many ways, like taking a walk: you engage in a cognitively light activity that lets ideas shuffle, settle, and mature in the background.
When LLMs write all the code for you, you lose that essential mental rest. The quiet moments where you internalize concepts, spot hidden bugs, and develop a mental map of the system.
MCP up Playwright, have a detailed spec, and tell claude to generate a detailed test plan for every story in the spec, then keep iterating on a test -> fix -> ... loop until every single component has been fully tested. If you get claude to write all the components (usually by subfolder) out to todos, there's a good chance it'll go >1 hour before it tries to stop, and if you have an anti-stopping hook it can go quite a bit longer.
Can you elaborate on what you mean by anti stopping hook? Sometime I take breaks, go on walks, etc and it would be cool of Claude tried different things and even branches etc that I could review when back.
Basically, all LLMs are "lazy" to some degree and are looking for ways to terminate responses early to conform to their training distribution. As a result, sometimes an agent will want to stop and phone home even if you have multiple rows of all caps saying DO NOT STOP UNTIL YOUR ENTIRE TODO LIST IS COMPLETE (seriously). Claude code has a hook for when the main agent and subagents try to stop, and you can reject their stop attempt with a message. They can still override that message and stop but the change of turn and the fresh "DO NOT STOP ..." that's at the front of context seem to keep it revving for a long time.
Programming and vibe coding are two entirely separate disciplines. The process of producing software, and the end result, is wildly different between them.
People who vibe code don't care about the code, but about producing something that delivers value, whatever that may be. Code is just an intermediate artifact to achieve that goal. ML tools are great for this.
People who program care about the code. They want to understand how it works, what it does, in addition to whether it achieves what they need. They may also care about its quality, efficiency, maintainability, and other criteria. ML tools can be helpful for programming, but they're not a panacea. There is no shortcut for building robust, high quality software. A human still needs to understand whatever the tool produces, and ensure that it meets their quality criteria. Maybe this will change, and future generations of this tech will produce high quality software without hand-holding, but frankly, I wouldn't bet on the current approaches to get us there.
When building a project from scratch using AI, it can be tempting to give in to the vibe and ignore the structure/architecture and let it evolve naturally. This is a bad idea when humans do it, and it's also a bad idea when LLM agents do it. You have to be considering architecture, dataflow, etc from the beginning, and always stay on top of it without letting it drift.
I have tried READMEs scattered through the codebase but I still have trouble keeping the agent aware of the overall architecture we built.
This should be called the eternal, unbearable slowness of code review, because the author writes that the AI actually churns out code extremely rapidly. The (hopefully capable, attentive, careful) human is the bottleneck here, as it should be
> ... I’ll keep pulling PRs locally, adding more git hooks to enforce code quality, and zooming through coding tasks—only to realize ChatGPT and Claude hallucinated library features and I now have to rip out Clerk and implement GitHub OAuth from scratch.
I don't get this, how many git hooks do you need to identify that Claude had hallucinated a library feature? Wouldn't a single hook running your tests identify that?
I don't have a ton of tests. From what I've seen, Claude will often just update the tests to no-op so tests passing isn't trustworthy.
My workflow is often to plan with ChatGPT and what I was getting at here is ChatGPT can often hallucinate features of 3rd party libraries. I usually dump the plan from ChatGPT straight into Claude Code and only look at the details when I'm testing.
That said, I've become more careful in auditing the plans so I don't run in to issues like this.
Tell Claude to use a code review sub agent after every significant change set, tell them to run the tests and evaluate the change set, don't tell Claude it wrote the code, and give them strict review instructions. Works like a charm.
Yes. Go on ChatGPT, explain what you're doing (claude code, trying to get it to be more rigorous with itself and reduce defects) then click deep research and tell it you'd like it to look up code review best practices, AI code review, smells/patterns to look out for in AI code, etc. Then have it take the result of that and generate a XML structured document with a flowchart of the code review best practices it discovered, cribbing from an established schema for element names/attributes when possible, and put it in fenced xml blocks in your subagent. You can also tell claude code to do deep research, you just have to be a little specific about what it should go after.
cool, can you think of any differences between a human engineer, who is presumably employed by an employer and subject to review and evaluation by a manager and inherently assumed to be capable of receiving feedback and reliably applying it on a go-forward basis to their future work, and an LLM, when they each make this same kind of mistake?
The issue I have is that it produce code that is unmaintainable. Poor modularity, code duplication, hidden errors producing spurious bugs, ...
Nonetheless it's ability to produce code that works is impressive, it's useful for learning, to generate throwaway code...
For example I can ask for a piece of code generating stats from logs. The code is not meant to last and will have few users (the devs), so maintainability is not an issue.
I've found LLMs to be very good at writing design docs and finding problems in code.
Currently they're better at locating problems than fixing them without direction. Gemini seems smarter and better at architecture and best practices. Claude seems dumber but is more focused on getting things done.
The right solution is going to be a variety of tools and LLMs interacting with each other. But it's going to take real humans having real experience with LLMs to get there. It's not something that you can just dream up on paper and have it work out well since it depends so much on the details of the current models.
Maybe I’ve misunderstood this, so correct me if I’m wrong… do actual professional developers let enough code be generated to include entire libraries that handle things as important as authentication, and then build on top of it without making sure the previously generated code actually does what it’s supposed to? Just accept local PRs written by AI, with a very sternly worded “now you better not make any bullshit” system prompt? All this just in time to ramp up AI penetration tools. Jesus.
It’s kind of crazy to me how the cool kid take on software development, as recent as 3 years ago, was: strictly-typed everything, ‘real men’ don’t use garbage collection, everything must be optimized to death even when it isn’t really necessary, etc. and now it seems to be ‘you don’t seriously expect me to look at ‘every single line of code’ I submit, do you?’
I'm trying to prototype extremely quickly and I'm working on my project alone so yes, often I accept PRs without looking too closely at the code if my local testing succeeds.
I'm using Typescript and Rust and I think it's critical to use strict typing with LLMs to catch simple bugs.
I've worked at Uber as an infra engineer and at Gem as an engineering manager so I do consider myself an "actual professional developer". The critical bit is the context of the project I'm working on. If I were at a tech company building software, I'd be much more reticent to ship AI generated PRs whole cloth.
The mistake you’re making is assuming it’s the same group of people saying both things. The “strictly typed, no GC, optimize everything” crowd hasn’t suddenly turned into the “lol I don’t read my AI-generated PRs” crowd. Those are two different tribes of devs with completely different value systems.
What’s changed isn’t that the same engineers did a 180 on principles, it’s that the discourse got hijacked by a new set of people who think shipping fast with AI is cooler than sweating over type systems. The obsession with performance purity was always more of a niche cultural flex than a universal law, and now the flex du jour is “look how much I can outsource to the machine.”
No— I’m not assuming that but I absolutely can see how what I wrote could come across like that. Dominant voices in many communities change more frequently than the core principles of the people in them— some people just get louder, more visible, and more numerous and others fall to the wayside, especially when there’s marketing or some sort of ‘movement’ involved. (And there’s all sorts of ‘movements’ involved in tech hype and some of them are bovine.)
Your read on the situation concurs with mine. Cheers.
I've no idea why, but the phrase "it's addicting" is really annoying, I'm pretty certain it should "it's addictive". I've started seeing it everywhere. (Note, I haven't completely lost my mind, it's in that article).
Prompting it better during development can really help here.
I have an emerging workflow orchestrated by Claude Code custom commands and subagents that turns even an informal description of a feature into a full fledged PRD, then an "architect" command researches and produces a well thought out and documented technical design. I can review that design document and then give it to the "planner" command, which breaks it down into Phases and Tasks. Then I have a "developer" command iterate through through and implement the Phases one by one. After each phase it runs a detailed code review using my "review" subagent.
Since I've started using this document-driven, guided workflow I've seen quality of the output noticeably improve.
I’m reaching the same conclusion…
I have been subscribing to LLMs for a couple of years, and trying to find the right balance and workflow that gets the best out of human and machine.
I now think TDD can play a big part. I don’t have much of a background in unit testing.
For a recent TypeScript utility mini project, I took an outside-in approach using mocks where necessary.
This started as a prototyping and modelling phase, getting the design right before committing to implementation code. This was about refining the types and function signatures, and mocking the components that didn’t exist at that point.
The LLM didn’t have involvement at this stage, as it was about the problem domain, the shape and flow of the data.
Moving on from there, I was able to save a lot of time because SuperMaven in Cursor had enough context and understanding at that point to make very precise guesses about what I wanted, so I could tab autocomplete through a reasonable amount of boilerplate implementation code.
I was also able to get away with writing a couple of happy path tests for most components, and get the agentic LLM to generate sad path tests. Most of which I kept, including one that smoked out a flaw in my design.
That’s essentially the process I’m gravitating towards. Human begins the process, models the design, sets the constraints, and then the LLM saves time in a limited and supervised way whilst being kept on a short leash.
I don’t have much in the way of tests right now but I am building with Typescript and Rust so that catches many basic bugs.
I don’t find the issue to be breaking other parts of the app, more-so that new features don’t work as advertised by Claude.
One of my takeaways here is that I should give Claude an integration test harness and tell it that it must finish running that successfully before committing any code.
AI tools seem excellent at getting through boilerplate stuff at the start of a project. But as time goes on and you have to think about what you are doing, it'll be faster to write it yourself than to convey it in natural language to an LLM. I don't see this as an issue with the tool, but just getting a better idea of what it is really good for.
The role of a software engineer is to condense the (often unclear) requirements, business domain knowledge, existing code (if any) and their skills/experience into a representation of the solution in a very concise language: a programming language.
Having to instead express all that (including the business-related part, since the agent has no context of that) in a verbose language (English) feels counter-productive, and is counter-productive in my experience.
I've successfully one-shotted easy self-contained, throwaway tasks ("make me a program that fills Redis with random keys and values" - Claude will one-shot that) but when it comes to working with complex existing codebases I've never seen the benefits - having to explain all the context to the agent and correcting its mistakes takes longer than just doing it myself (worse, it's unpredictable - I know roughly how long something will take, but it's impossible to tell in advance whether an agent will one-shot it successfully or require longer babysitting than just doing it manually from the beginning).
We are going to end up having boilerplate natural language text, that's been tested and proven to get the same output every time. Then we'll have a sort of transpiler and maybe a sub language of English, to make prompting easier. Then we will source control those prompts. What we actually do today, with extra steps.
Somewhat related, I Found cursor/VS was slowing to the point of being unusable. Turning on privacy mode helped, but the main culprit was extremely verbose logging. Running `fatrace -c --command=cursor` discovered the issue.
The disk in question was an HDD and the problem disappeared (or is better hidden) after symlinking the log dir to an SSD.
As for code itself, I've never had an issue with slowness. If anything it's the verbosity of wanting to explain itself and excess logging in the code it creates.
I've never done QA.
Just thinking about doing QA makes my head swirl.
But yes, because of LLMs I am now a part time QA engineer, and I think that it's kinda helping me be a better developer.
Im working on a massive feature at work, something I can't just give to an agent and I already feel like something changed in how I think about every little piece of code im adding. didn't see that coming.
Well yeah, as the app scales it will bump up against context limits. Giving it sandboxed areas to do specific tasks will speed it up again, but that’s not possible with everything.
Gemini CLI is pretty weak, but the Gemini 2.5 pro is still the best for long contexts. Claude is great but it crumbles as you start to get in the 50-100k range. I find Gemini doesn't start to crack until the 150-200k range. It's too bad the tooling around it is mediocre at best.
One of my favorite patterns is to use repomix to take a project repo and turn it into a single file, drop it in gemini and chat with it for a while about how to improve the codebase, then ask it to create a hyper-detailed set of instructions for claude to implement the changes we discussed. The planning tends to be much better than Opus because you've got your whole codebase in context and you've been chatting and steering the model for a bit, plus it can save you ~200-300k tokens.
Even it's slow, you can run multiple agents. You can have one doing changes, while another writes documentation, while another does security checks, while another looks for optimizations. Persist finding to markdown files to track progress and for cross-agent knowledge sharing if need. And do whatever else while it's all running. This has been my experience.
If you set up your agents correctly, they can run for hours. My record is around 4 hours for a "prod/launch readiness" on a 90k LoC codebase, and that same codebase had a marathon lint and mypy plan that fixed ~700 issues over 6 hours (split around 3/3 due to API limits)
My employer hosts one of the largest Ruby on Rails apps in the world. I've noticed that Claude Code takes a long time to grep for what it needs. Cursor is much better at this (probably because of local project indexing). Due to this, I favor Cursor over CC in my day to day workflows. In smaller code bases, both are pretty fast.
for projects of any non-trivial size, you should have a (local) MCP server that wraps/bridges some LSP over the local repo, so that when the LLM needs to find some identifier X, or callers, or implementations, or etc., it can ask the LSP directly rather than needing to do a grep or whatever
You can prompt Claude into (assess -> fix -> validate) -> assess -> ... loops pretty easily. You can do this with unit test coverage, and sometimes it's nice to come back to 100% coverage of the codebase (and the code review agent kept the tests from being hot garbage). You can push this with front end tests, playwright, etc to really get deep into validating your application without actually slogging through PRs (yet!).
My pattern with claude code is to let stuff simmer in the background with a detailed PRD, and just turn the screws with progressively more testing and type checking. I'll use repomix to put my entire codebase into gemini 2.5 pro, chat with it for a bit and then ask it to generate a highly detailed work plan for claude code to make the codebase more production hardened/launch ready. If I don't burn my plan tokens first, that gemini prompt can keep claude running for like ~3 hours usually. If you repeat this gemini plan -> claude implement step a few times gemini will eventually start to tell you to stop being a chicken and launch your great app.
type safety, integration testing, and thorough readmes are now cheap, I don't know why any developer would not be using them with claude code. even if all the LLM services go under tomorrow you'll still have code that practically autocompletes itself.
Mistral may not be the smartest chat assistant out there, but I've stopped using others entirely given how slow they are compared to Mistral (which runs inference with Cerebras).
Waiting for an AI to complete its task isn't a fun thing at all, and I'd chose the fast 70% correct response any day over the slow 90% correct one. Because by the time the slow one gives you its first attempt, you'd have clarified you need and fixed the output from the fast one.
Sure if we get to the point where the slow system is 100% right, then it's no big deal if it's slow, but we're still far from that point.
I wonder if the choice of programming language for vibecoding actually would become more relevant rather than less so. Initial collective instinct suggests that perhaps languages don't really matter anymore since LLMs help promote PL egalitarianism, but then if it's as the OP describes it - humans are now becoming code reviewers, maybe different aspects of various languages start granting practical advantages, like Clojure connected to a REPL allowing rapid execution/validation of generated code blocks, or advanced static type systems (Haskel, Rust, etc.) providing an edge from another angle?
I'm still calibrating myself on the size of task that I can get Claude Code to do before I have to intervene.
I call this problem the "goldilocks" problem. The task has to be large enough that it outweighs the time necessary to write out a sufficiently detailed specification AND to review and fix the output. It has to be small enough that Claude doesn't get overwhelmed.
The issue with this is, writing a "sufficiently detailed specification" is task dependent. Sometimes a single sentence is enough, other times a paragraph or two, sometimes a couple of pages is necessary. And the "review and fix" phase again is totally dependent and completely unknown. I can usually estimate the spec time but the review and fix phase is a dice roll dependent on the output of the agent.
And the "overwhelming" metric is again not clear. Sometimes Claude Code can crush significant tasks in one shot. Other times it can get stuck or lost. I haven't fully developed an intuition for this yet, how to differentiate these.
What I can say, this is an entirely new skill. It isn't like architecting large systems for human development. It isn't like programming. It is its own thing.
This is why I'm still dubious about the overall productivity increase we'll see from AI once all the dust settles.
I think it's undeniable that in narrow well controlled use cases the AI does give you a bump. Once you move beyond that though the time you have to spend on cleanup starts to seriously eat into any efficiency gains.
And if you're in a domain you know very little about, I think any use case beyond helping you learn a little quicker is a net negative.
"It isn't like programming. It is its own thing."
You articulated what I was wrestling with in the post perfectly.
>I haven't fully developed an intuition for this yet, how to differentiate these.
The big issue is that, even though there is a logical side to it, part is adapting to a close system that can change under your feet. New model, new prompt, there goes your practice.
It isn't like programming. It is its own thing.
Absolutely. And what I find fascinating that this experience is highly personal. I read probably 876 different “How I code with LLMs” and I can honestly say not a single thing I read and tried (and I tried A LOT) “worked” for me…
For the longer ones, are you using AI to help you write the specs?
My experience is: AI written prompts are overly long and overly specific. I prefer to write the instructions myself and then direct the LLM to ask clarifying questions or provide an implementation plan. Depending on the size of change I go 1-3 rounds of clarifications until Claude indicates it is ready and provides a plan that I can review.
I do this in a task_descrtiption.md file and I include the clarifications in its own section (the files follow a task.template.md format).
What bothers me is this: Claude & I work hard on a subtle issue; eventually (often after wiping Claude's memory clean and trying again) we collectively come to a solution that works.
But the insights gleaned from that battle are (for Claude) lost forever as soon as I start on a new task.
The way LLM's (fail to) handle memory and in-situ learning (beyond prompt engineering and working within the context window) is just clearly deficient compared to how human minds work.
This illustrates a fundamental truth of maintaining software with LLMs: While programmers can use LLMs to produce huge amounts of code in a short time, they still need to read and understand it. It is simply not possible to delegate understanding a huge codebase to an AI, at least not yet.
In my experience, the real "pain" of programming lies in forcing yourself to absorb a flood of information and connecting the dots. Writing code is, in many ways, like taking a walk: you engage in a cognitively light activity that lets ideas shuffle, settle, and mature in the background.
When LLMs write all the code for you, you lose that essential mental rest. The quiet moments where you internalize concepts, spot hidden bugs, and develop a mental map of the system.
As a staff swe I spend way more time reading, understanding code, and then QAing features.
Writing code is my favorite part of the job, why would I outsource it so I can spend even more time reading and QAing?
100% yes. QA'ing a bunch of LLM generated code feels like a mental flood. Losing that mental rest is a great way to put it.
Another way of saying thing is only an AI reviewer could cope with the flood of code an AI can produce.
But AI reviewers can do little beyond checking coding standards.
MCP up Playwright, have a detailed spec, and tell claude to generate a detailed test plan for every story in the spec, then keep iterating on a test -> fix -> ... loop until every single component has been fully tested. If you get claude to write all the components (usually by subfolder) out to todos, there's a good chance it'll go >1 hour before it tries to stop, and if you have an anti-stopping hook it can go quite a bit longer.
Youve got to be doing the most unoriginal work on the planet if this doesnt produce a bowl of disfunctional spaghetti
Every sentence you will ever write in your entire life will be made from a finite set of letters. The magic is in how you arrange them.
If you have a really detailed, well thought out spec, you do TDD and you have regular code review and refactor loops, agentic coding stays manageable.
Can you elaborate on what you mean by anti stopping hook? Sometime I take breaks, go on walks, etc and it would be cool of Claude tried different things and even branches etc that I could review when back.
Basically, all LLMs are "lazy" to some degree and are looking for ways to terminate responses early to conform to their training distribution. As a result, sometimes an agent will want to stop and phone home even if you have multiple rows of all caps saying DO NOT STOP UNTIL YOUR ENTIRE TODO LIST IS COMPLETE (seriously). Claude code has a hook for when the main agent and subagents try to stop, and you can reject their stop attempt with a message. They can still override that message and stop but the change of turn and the fresh "DO NOT STOP ..." that's at the front of context seem to keep it revving for a long time.
Programming and vibe coding are two entirely separate disciplines. The process of producing software, and the end result, is wildly different between them.
People who vibe code don't care about the code, but about producing something that delivers value, whatever that may be. Code is just an intermediate artifact to achieve that goal. ML tools are great for this.
People who program care about the code. They want to understand how it works, what it does, in addition to whether it achieves what they need. They may also care about its quality, efficiency, maintainability, and other criteria. ML tools can be helpful for programming, but they're not a panacea. There is no shortcut for building robust, high quality software. A human still needs to understand whatever the tool produces, and ensure that it meets their quality criteria. Maybe this will change, and future generations of this tech will produce high quality software without hand-holding, but frankly, I wouldn't bet on the current approaches to get us there.
When building a project from scratch using AI, it can be tempting to give in to the vibe and ignore the structure/architecture and let it evolve naturally. This is a bad idea when humans do it, and it's also a bad idea when LLM agents do it. You have to be considering architecture, dataflow, etc from the beginning, and always stay on top of it without letting it drift.
I have tried READMEs scattered through the codebase but I still have trouble keeping the agent aware of the overall architecture we built.
This should be called the eternal, unbearable slowness of code review, because the author writes that the AI actually churns out code extremely rapidly. The (hopefully capable, attentive, careful) human is the bottleneck here, as it should be
Ooh, that's a good title for another post! And yes, I agree with you.
Initially I would barely read any of the code generated and as my project has grown in size, I have approached the limits of that approach.
Often because Claude Code makes very poor architectural choices.
Welcome to vibe/agentic engineering
If only code and application quality could be measured in LoC - middle managers everywhere would rejoice
> ... I’ll keep pulling PRs locally, adding more git hooks to enforce code quality, and zooming through coding tasks—only to realize ChatGPT and Claude hallucinated library features and I now have to rip out Clerk and implement GitHub OAuth from scratch.
I don't get this, how many git hooks do you need to identify that Claude had hallucinated a library feature? Wouldn't a single hook running your tests identify that?
I don't have a ton of tests. From what I've seen, Claude will often just update the tests to no-op so tests passing isn't trustworthy.
My workflow is often to plan with ChatGPT and what I was getting at here is ChatGPT can often hallucinate features of 3rd party libraries. I usually dump the plan from ChatGPT straight into Claude Code and only look at the details when I'm testing.
That said, I've become more careful in auditing the plans so I don't run in to issues like this.
Tell Claude to use a code review sub agent after every significant change set, tell them to run the tests and evaluate the change set, don't tell Claude it wrote the code, and give them strict review instructions. Works like a charm.
Any tips on writing productive review sub agent instruction?
Yes. Go on ChatGPT, explain what you're doing (claude code, trying to get it to be more rigorous with itself and reduce defects) then click deep research and tell it you'd like it to look up code review best practices, AI code review, smells/patterns to look out for in AI code, etc. Then have it take the result of that and generate a XML structured document with a flowchart of the code review best practices it discovered, cribbing from an established schema for element names/attributes when possible, and put it in fenced xml blocks in your subagent. You can also tell claude code to do deep research, you just have to be a little specific about what it should go after.
Interesting. I had not thought about a code review sub agent. I will give that a shot.
They probably don't have any tests, or the tests that the LLM creates are flawed and not detecting these problems
Just tell the AI "and make sure you don't add bugs or break anything"
Works every time
Yesterday Claude Code assured me the following:
• Good news! The code is compiling successfully (the errors shown are related to an existing macro issue, not our new code).
When infact, it managed to insert 10 compilation errors that were not at all related with any macros.
The other day I had Claude proudly proclaim it fixed the bug, by deleting log line that exposed the bug...
I tried using agents in Cursor and when it runs into issues it will just rip out the offending code :)
"hallucinated" library features are identified even earlier, when claude builds your project. i also don't get what author is talking about.
AI agents have been known to rip out mocks so that the tests pass.
I have had human devs do that too
cool, can you think of any differences between a human engineer, who is presumably employed by an employer and subject to review and evaluation by a manager and inherently assumed to be capable of receiving feedback and reliably applying it on a go-forward basis to their future work, and an LLM, when they each make this same kind of mistake?
Yes, the difference is about $197,600 of playing fair or $57,600 if offshoring.
The issue I have is that it produce code that is unmaintainable. Poor modularity, code duplication, hidden errors producing spurious bugs, ...
Nonetheless it's ability to produce code that works is impressive, it's useful for learning, to generate throwaway code...
For example I can ask for a piece of code generating stats from logs. The code is not meant to last and will have few users (the devs), so maintainability is not an issue.
I've found LLMs to be very good at writing design docs and finding problems in code.
Currently they're better at locating problems than fixing them without direction. Gemini seems smarter and better at architecture and best practices. Claude seems dumber but is more focused on getting things done.
The right solution is going to be a variety of tools and LLMs interacting with each other. But it's going to take real humans having real experience with LLMs to get there. It's not something that you can just dream up on paper and have it work out well since it depends so much on the details of the current models.
Slow is smooth, smooth is fast.
Maybe I’ve misunderstood this, so correct me if I’m wrong… do actual professional developers let enough code be generated to include entire libraries that handle things as important as authentication, and then build on top of it without making sure the previously generated code actually does what it’s supposed to? Just accept local PRs written by AI, with a very sternly worded “now you better not make any bullshit” system prompt? All this just in time to ramp up AI penetration tools. Jesus.
It’s kind of crazy to me how the cool kid take on software development, as recent as 3 years ago, was: strictly-typed everything, ‘real men’ don’t use garbage collection, everything must be optimized to death even when it isn’t really necessary, etc. and now it seems to be ‘you don’t seriously expect me to look at ‘every single line of code’ I submit, do you?’
I'm trying to prototype extremely quickly and I'm working on my project alone so yes, often I accept PRs without looking too closely at the code if my local testing succeeds.
I'm using Typescript and Rust and I think it's critical to use strict typing with LLMs to catch simple bugs.
I've worked at Uber as an infra engineer and at Gem as an engineering manager so I do consider myself an "actual professional developer". The critical bit is the context of the project I'm working on. If I were at a tech company building software, I'd be much more reticent to ship AI generated PRs whole cloth.
Well, prototyping is indeed a whole different ball of wax.
The mistake you’re making is assuming it’s the same group of people saying both things. The “strictly typed, no GC, optimize everything” crowd hasn’t suddenly turned into the “lol I don’t read my AI-generated PRs” crowd. Those are two different tribes of devs with completely different value systems.
What’s changed isn’t that the same engineers did a 180 on principles, it’s that the discourse got hijacked by a new set of people who think shipping fast with AI is cooler than sweating over type systems. The obsession with performance purity was always more of a niche cultural flex than a universal law, and now the flex du jour is “look how much I can outsource to the machine.”
No— I’m not assuming that but I absolutely can see how what I wrote could come across like that. Dominant voices in many communities change more frequently than the core principles of the people in them— some people just get louder, more visible, and more numerous and others fall to the wayside, especially when there’s marketing or some sort of ‘movement’ involved. (And there’s all sorts of ‘movements’ involved in tech hype and some of them are bovine.)
Your read on the situation concurs with mine. Cheers.
I've no idea why, but the phrase "it's addicting" is really annoying, I'm pretty certain it should "it's addictive". I've started seeing it everywhere. (Note, I haven't completely lost my mind, it's in that article).
Haha fair enough. Fixed!
Prompting it better during development can really help here.
I have an emerging workflow orchestrated by Claude Code custom commands and subagents that turns even an informal description of a feature into a full fledged PRD, then an "architect" command researches and produces a well thought out and documented technical design. I can review that design document and then give it to the "planner" command, which breaks it down into Phases and Tasks. Then I have a "developer" command iterate through through and implement the Phases one by one. After each phase it runs a detailed code review using my "review" subagent.
Since I've started using this document-driven, guided workflow I've seen quality of the output noticeably improve.
please share these discrete instructions.md you're describing
I wonder if the author is using automated tests.
My hunch is that good automated testing is an enormous factor with respect to how productive you can get with coding agent tools.
Thorough tests? Just like working without LLMs you can confidently make changes without fear of breaking other parts of the application.
No tests at all? Any change you make is a roll of the dice with respect to how it affects the rest of your existing code.
I’m reaching the same conclusion… I have been subscribing to LLMs for a couple of years, and trying to find the right balance and workflow that gets the best out of human and machine.
I now think TDD can play a big part. I don’t have much of a background in unit testing. For a recent TypeScript utility mini project, I took an outside-in approach using mocks where necessary. This started as a prototyping and modelling phase, getting the design right before committing to implementation code. This was about refining the types and function signatures, and mocking the components that didn’t exist at that point. The LLM didn’t have involvement at this stage, as it was about the problem domain, the shape and flow of the data. Moving on from there, I was able to save a lot of time because SuperMaven in Cursor had enough context and understanding at that point to make very precise guesses about what I wanted, so I could tab autocomplete through a reasonable amount of boilerplate implementation code. I was also able to get away with writing a couple of happy path tests for most components, and get the agentic LLM to generate sad path tests. Most of which I kept, including one that smoked out a flaw in my design.
That’s essentially the process I’m gravitating towards. Human begins the process, models the design, sets the constraints, and then the LLM saves time in a limited and supervised way whilst being kept on a short leash.
I don’t have much in the way of tests right now but I am building with Typescript and Rust so that catches many basic bugs.
I don’t find the issue to be breaking other parts of the app, more-so that new features don’t work as advertised by Claude.
One of my takeaways here is that I should give Claude an integration test harness and tell it that it must finish running that successfully before committing any code.
AI tools seem excellent at getting through boilerplate stuff at the start of a project. But as time goes on and you have to think about what you are doing, it'll be faster to write it yourself than to convey it in natural language to an LLM. I don't see this as an issue with the tool, but just getting a better idea of what it is really good for.
The role of a software engineer is to condense the (often unclear) requirements, business domain knowledge, existing code (if any) and their skills/experience into a representation of the solution in a very concise language: a programming language.
Having to instead express all that (including the business-related part, since the agent has no context of that) in a verbose language (English) feels counter-productive, and is counter-productive in my experience.
I've successfully one-shotted easy self-contained, throwaway tasks ("make me a program that fills Redis with random keys and values" - Claude will one-shot that) but when it comes to working with complex existing codebases I've never seen the benefits - having to explain all the context to the agent and correcting its mistakes takes longer than just doing it myself (worse, it's unpredictable - I know roughly how long something will take, but it's impossible to tell in advance whether an agent will one-shot it successfully or require longer babysitting than just doing it manually from the beginning).
We are going to end up having boilerplate natural language text, that's been tested and proven to get the same output every time. Then we'll have a sort of transpiler and maybe a sub language of English, to make prompting easier. Then we will source control those prompts. What we actually do today, with extra steps.
Somewhat related, I Found cursor/VS was slowing to the point of being unusable. Turning on privacy mode helped, but the main culprit was extremely verbose logging. Running `fatrace -c --command=cursor` discovered the issue.
The disk in question was an HDD and the problem disappeared (or is better hidden) after symlinking the log dir to an SSD.
As for code itself, I've never had an issue with slowness. If anything it's the verbosity of wanting to explain itself and excess logging in the code it creates.
I've never done QA. Just thinking about doing QA makes my head swirl. But yes, because of LLMs I am now a part time QA engineer, and I think that it's kinda helping me be a better developer. Im working on a massive feature at work, something I can't just give to an agent and I already feel like something changed in how I think about every little piece of code im adding. didn't see that coming.
Well yeah, as the app scales it will bump up against context limits. Giving it sandboxed areas to do specific tasks will speed it up again, but that’s not possible with everything.
Gemini CLI is pretty weak, but the Gemini 2.5 pro is still the best for long contexts. Claude is great but it crumbles as you start to get in the 50-100k range. I find Gemini doesn't start to crack until the 150-200k range. It's too bad the tooling around it is mediocre at best.
One of my favorite patterns is to use repomix to take a project repo and turn it into a single file, drop it in gemini and chat with it for a while about how to improve the codebase, then ask it to create a hyper-detailed set of instructions for claude to implement the changes we discussed. The planning tends to be much better than Opus because you've got your whole codebase in context and you've been chatting and steering the model for a bit, plus it can save you ~200-300k tokens.
Even it's slow, you can run multiple agents. You can have one doing changes, while another writes documentation, while another does security checks, while another looks for optimizations. Persist finding to markdown files to track progress and for cross-agent knowledge sharing if need. And do whatever else while it's all running. This has been my experience.
But then you have to keep all those tasks in your head and be ready to jump into any of them.
The check-ins are much more frequent and the instructions much lower level than what you’d give to a team if you were running it.
Do you have an example of a large application you’ve released with this methodology that has real paying users that isn’t in the AI space?
If you set up your agents correctly, they can run for hours. My record is around 4 hours for a "prod/launch readiness" on a 90k LoC codebase, and that same codebase had a marathon lint and mypy plan that fixed ~700 issues over 6 hours (split around 3/3 due to API limits)
OP says in 2nd paragraph that they are using multiple agents in parallel. In fact, that's what their app does.
if they are modifying the same code, then you have to merge all of different changes so it's not really parallel.
IME it's faster to not try to edit the same code in parallel because of the cost of merging.
My employer hosts one of the largest Ruby on Rails apps in the world. I've noticed that Claude Code takes a long time to grep for what it needs. Cursor is much better at this (probably because of local project indexing). Due to this, I favor Cursor over CC in my day to day workflows. In smaller code bases, both are pretty fast.
for projects of any non-trivial size, you should have a (local) MCP server that wraps/bridges some LSP over the local repo, so that when the LLM needs to find some identifier X, or callers, or implementations, or etc., it can ask the LSP directly rather than needing to do a grep or whatever
I swear Claude will end up in a loop if you’re not careful and you end up grinding on it for so long.
You can prompt Claude into (assess -> fix -> validate) -> assess -> ... loops pretty easily. You can do this with unit test coverage, and sometimes it's nice to come back to 100% coverage of the codebase (and the code review agent kept the tests from being hot garbage). You can push this with front end tests, playwright, etc to really get deep into validating your application without actually slogging through PRs (yet!).
My pattern with claude code is to let stuff simmer in the background with a detailed PRD, and just turn the screws with progressively more testing and type checking. I'll use repomix to put my entire codebase into gemini 2.5 pro, chat with it for a bit and then ask it to generate a highly detailed work plan for claude code to make the codebase more production hardened/launch ready. If I don't burn my plan tokens first, that gemini prompt can keep claude running for like ~3 hours usually. If you repeat this gemini plan -> claude implement step a few times gemini will eventually start to tell you to stop being a chicken and launch your great app.
type safety, integration testing, and thorough readmes are now cheap, I don't know why any developer would not be using them with claude code. even if all the LLM services go under tomorrow you'll still have code that practically autocompletes itself.
I split large large task in 4-5 small sub tasks, each in new conversation to save tokens and it does a pretty good job.
Mistral may not be the smartest chat assistant out there, but I've stopped using others entirely given how slow they are compared to Mistral (which runs inference with Cerebras).
Waiting for an AI to complete its task isn't a fun thing at all, and I'd chose the fast 70% correct response any day over the slow 90% correct one. Because by the time the slow one gives you its first attempt, you'd have clarified you need and fixed the output from the fast one.
Sure if we get to the point where the slow system is 100% right, then it's no big deal if it's slow, but we're still far from that point.
I don't find it slow at all. Just not very enjoyable. I used to love making function, writing for loops etc. all day long.
I do not enjoy spelling out tasks in English and checking that they are done correctly.