Interesting read, and some interesting ideas, but there's a problem with statements like these:
> Sean proposes that in the AI future, the specs will become the real code. That in two years, you'll be opening python files in your IDE with about the same frequency that, today, you might open up a hex editor to read assembly.
> It was uncomfortable at first. I had to learn to let go of reading every line of PR code. I still read the tests pretty carefully, but the specs became our source of truth for what was being built and why.
This doesn't make sense as long as LLMs are non-deterministic. The prompt could be perfect, but there's no way to guarantee that the LLM will turn it into a reasonable implementation.
With compilers, I don't need to crack open a hex editor on every build to check the assembly. The compiler is deterministic and well-understood, not to mention well-tested. Even if there's a bug in it, the bug will be deterministic and debuggable. LLMs are neither.
Humans don't make mistakes nearly as much, the mistakes they do make are way more predictable (they're easier to spot in code review), and they don't tend to make the kinds of catastrophic mistakes that could sink a business. They also tend to cause codebases to rapidly deteriorate, since even very disciplined reviewers can miss the kinds of strange and unpredictable stuff an LLM will do. Redundant code isn't evident in a diff, and things like tautological tests, or useless tests where they're mocking everything and only actually testing the mocks. Or they'll write a bunch of redundant code because they really just aggressively avoid code re-use unless you are very specific.
The real problem is just that they don't have brains, and can't think. They generate text that is optimized to look the most right, but not to be the most right. That means they're deceptive right off the bat. When a human is wrong, it usually looks wrong. When an LLM is wrong, it's generating the most correct looking thing it possibly could while still being wrong, with no consideration for actual correctness. It has no idea what "correctness" even means, or any ideas at all, because it's a computer doing matmul.
They are text summarization/regurgitation, pattern matching machines. They regurgitate summaries of things seen in their training data, and that training data was written by humans who can think. We just let ourselves get duped into believing the machine is the where the thinking is coming from and not the (likely uncompensated) author(s) whose work was regurgitated for you.
>The real problem is just that they don't have brains, and can't think.
That would have had more weight if you haven't just described junior developer behavior beforehand.
"LLMs can't think" is anthropocentric cope. It's the old AI effect all over again - people would rather die than admit that there's very little practical difference between their own "thinking" and that of an AI chatbot.
> That would have had more weight if you haven't just described junior developer behavior beforehand.
Effectively telling that junior developers "don't have brains" is in very bad taste and offensively wrong.
> people would rather die than admit that there's very little practical difference between their own "thinking" and that of an AI chatbot.
Would you like to elaborate on this?
I was told that McDonalds employees would have been replaced by now, self-driving cars will be driving the streets and new medicines would have been discovered.
It's been a couple of years that "AI" is out, and no singularity yet.
LLMs use the same type of "abstract thinking" process as humans. Which is why they can struggle with 6-digit multiplication (unlike computer code, very much like humans), but not with parsing out metaphors or describing what love is (unlike computer code, very much like humans). The capability profile of an LLM is amusingly humanlike.
Setting the bar for "AI" at "singularity" is a bit like setting requirements for "fusion" at "creating a star more powerful than the Sun". Very good for dismissing all existing fusion research, but not any good for actually understanding fusion.
If we had two humans, one with IQ 80 and another with IQ 120, we wouldn't say that one of them isn't "thinking". It's just that one of them is much worse at "thinking" than the other. Which is where a lot of LLMs are currently at. They are, for all intents and purposes, thinking. Are they any good at it though? Depends on what you want from them. Sometimes they're good enough, and sometimes they aren't.
> LLMs use the same type of "abstract thinking" process as humans
It's surprising you say that, considering we don't actually understand the mechanisms behind how humans think.
We do know that human brains are so good at patterns, they'll even see patterns and such that aren't actually there.
LLMs are a pile of statistics that can mimic human speech patterns if you don't tax them too hard. Anyone who thinks otherwise is just Clever Hans-ing themselves.
We understand the outcomes well enough. LLMs converge onto a similar process by being trained on human-made text. Is LLM reasoning a 1:1 replica of what the human brain does? No, but it does something very similar in function.
I see no reason to think that humans are anything more than "a pile of statistics that can mimic human speech patterns if you don't tax them too hard". Humans can get offended when you point it out though. It's too dismissive of their unique human gift of intelligence that a chatbot clearly doesn't have.
We do not, in fact, "understand the outcomes well enough" lol.
I don't really care if you want to have an AI waifu or whatever. I'm pointing out that you're vastly underestimating the complexity behind human brains and cognition.
And that complex human brain of yours is attributing behaviors to a statistical model that the model does not, in fact, possess.
"We don't fully understand how a bird works, and thus: "wind tunnel" is useless, Wright brothers are utter fools, what their crude mechanical contraptions are doing isn't actually flight, and heavier than air flight is obviously unattainable."
I see no reason whatsoever to believe that what your wet meat brain is doing now is any different from what an LLM does. And the similarity in outcomes is rather evident. We found a pathway to intelligence that doesn't involve copying a human brain 1:1 - who would have thought?
I think saying that "LLMs can produce outcomes akin to those produced by human intelligence (in many but not all cases)" and "LLMs are intelligent" to both be fairly defensible.
> I see no reason whatsoever to believe that what your wet meat brain is doing now is any different from what an LLM does.
I don't think this follows though. Birds and planes can both fly, but a bird and a plane are clearly not doing the same thing to achieve flight. Interestingly, both birds and planes excel at different aspects of flight. It seems at least plausible (imo likely) that there are meaningful differences in how intelligence is implemented in LLMs and humans, and that that might manifest as some aspects of intelligence being accessible to LLMs but not humans and vice versa.
> It seems at least plausible (imo likely) that there are meaningful differences in how intelligence is implemented in LLMs and humans
Intelligence isn’t "implemented" in an LLM at all. The model doesn’t carry a reasoning engine or a mental model of the world. It generates tokens by mathematically matching patterns: each new token is chosen to best fit the statistical patterns it learned from its training data and the immediate context you give it. In effect, it’s producing a compressed, context-aware summary of the most relevant pieces of its training data, one token at a time.
The training data is where the intelligence happened, and that's because it was generated by human brains.
There doesn't seem to be much consensus on defining what intelligence is. For the definitions of at least some reasonable people of sound mind, I think it is defensible to call them intelligent, even if I don't necessarily agree. I sometimes call them "intelligent" because many of the things they do seem to me like they should require intelligence.
That said, to whatever extent they're intelligent or not, by almost any definition of intelligence, I don't think they're achieving it through the same mechanism that humans do. That is my main argument. I thing confident arguments that "LLMs think just like humans" are very bad, given that we clearly don't understand how humans achieve intelligence and the vastly different substrates and constraints that humans and LLMs are working with.
I guess to me, how is the ability to represent the statistical distribution of outcomes of almost any combination of scenarios, represented as textual data not a form of world model?
I think you're looking at it too abstractly. An LLM isn't representing anything, it has a bag of numbers that some other algorithm produced for it. When you give it some numbers, it takes them and does matrix operations with them in order to randomly select a token from a softmax distribution, one at a time, until the EOS token is generated.
If they don't have any training data that covers a particular concept, they can't map it onto a world model and make predictions about that concept based on an understanding of the world and how it works. [This video](https://www.youtube.com/watch?v=160F8F8mXlo) illustrates it pretty well. These things may or may not end up being fixed in the models, but that's only because they've been further trained with the specific examples. Brains have world models. Cats see a cup of water, and they know exactly what will happen when you tip it over (and you can bet they're gonna do it).
That video is a poor and mis-understood analysis of an old version of ChatGPT.
Analyzing an image generation failure modes from the dall-e family of models isn't really helpful in understanding if the invoking LLM has a robust world model or not.
The point of me sharing the video was to use the full glass of wine as an example for how generative AI models doing inference lack a true world model. The example was just as relevant now as it was then, and it applies to inference being done by LMs and SD models in the same way. Nothing has fundamentally changed in how these models work. Getting better at edge cases doesn't give them a world model.
That's the point though. Look at any end-to-end image model. Currently I think nano banana (Gemini 2.5 Flash) is probably the best in prod. (Looks like ChatGPT has regressed the image pipeline right now with GPT-5, but not sure)
SD models have a much higher propensity to fixate on proximal in distribution solutions because of the way they de-noise.
For example.. you can ask nano banana for a "Completely full wine glass in zero g" which I'm pretty sure is way more out of distribution, the model does a reasonable job at approximating what they might look like.
That's a fairly bad example. They don't have any trouble taking unrelated things and sticking them together. A world model isn't required for you to take two unrelated things and stick them together. If I ask it to put a frog on the moon, it can know what frogs look like and what the the moon looks like, and put the frog on the moon.
But what it won't be able to do, which does require a world model, is put a frog on the moon, and be able to imagine what that frog's body would look like on the moon in the vacuum of space as it dies a horrible death.
Your example is a good one. The frog won't work because ethically the model won't want to show a dead frog very easily, BUT if you ask nano-banana for:
"Create an image of what a watermelon would look like after being teleported to the surface of the moon for 30 seconds."
> "We don't fully understand how a bird works, and thus: "wind tunnel" is useless, Wright brothers are utter fools, what their crude mechanical contraptions are doing isn't actually flight, and heavier than air flight is obviously unattainable."
Completely false equivalency. We did in fact back then completely understand "how a bird works", how the physics of flight work. The problem getting man-made flying vehicles off the ground was mostly about not having good enough materials to build one (plus some economics-related issues).
Whereas in case of AI, we are very far from even slightly understanding how our brains work, how the actual thinking happens.
One of the Wright brothers achievements was to realize the published tables of flight physics was wrong and to carefully redo it with their own wind tunnel until they had a correct model from which to design a flying vehicle
https://humansofdata.atlan.com/2019/07/historical-humans-of-...
"Anthropocentric cope >:(" is one of the funniest things I've read this week, so genuinely thank you for that.
"LLMs think like people do" is the equivalent of flat earth theory or UFO bros.
Flerfers run on ignorance, misunderstanding and oppositional defiant disorder. You can easily prove the earth is round in quite a lot of ways (the Greeks did it) but the flerfers either don't know them or refuse to apply them.
There are quite a lot of reasons to believe brains work differently than LLMs (and ways to prove it) you just don't know them or refuse to believe them.
It's neat tech, and I use them. They're just wayyyyyyyy overhyped and we don't need to anthropomorphize them lol
This is wrong on so many levels. I feel like this is what I would have said if I never took a neuroscience class, or actually used an LLM for any real work beyond just poking around ChatGPT from time to time between TED talks.
There is no actual object-level argument in your reply, making it pretty useless. I’m left trying to infer what you might be talking about, and frankly it’s not obvious to me.
For example, what relevance is neuroscience here? Artificial neural nets and real brains are entirely different substrates. The “neural net” part is a misnomer. We shouldn’t expect them to work the same way.
What’s relevant is the psychology literature. Do artificial minds behave like real minds? In many ways they do — LLMs exhibit the same sorts fallacies and biases as human minds. Not exactly 1:1, but surprisingly close.
I didn't say brains and ANNs are the same, in fact I am making quite the opposite argument here.
LLMs exhibit these biases and fallacies because they regurgitate the biases and fallacies that were written by the humans that produced their training data.
Living in Silicon Valley, there are MANY self driving cars driving around right now. At the stop light the other day, I was between 3 of them without any humans in them.
It is so weird when people pull self driving cars out as some kind of counter example. Just because something doesn't happen on the most optimistic time scale, doesn't mean it isn't happening. They just happen slowly and then all at once.
15 years ago they said truck drivers would be obsolete in 1-2 years. They are still not obsolete, and they aren't on track to be any time soon, either.
Not really, code even in high level languages is always lower level than English just for computer nonsense reasons. Example: "read a CSV file and add a column containing the multiple of the price and quantity columns".
That's about 20 words. Show me the programming language that can express that entire feature in 20 words. Even very English-like languages like Python or Kotlin might just about do it, if you're working in something else like C++ then no.
In practice, this spec will expand to changes to your dependency lists (and therefore you must know what library is used for CSV parsing in your language, the AI knows this stuff better than you), then there's some file handling, error handling if the file doesn't exist, maybe some UI like flags or other configuration, working out what the column names are, writing the loop, saving it back out, writing unit tests. Any reasonable programmer will produce a very similar PR given this spec but the diff will be much larger than the spec.
Not only is this shorter, but it contains all of the critical information that you left out of your english prompt: where is the csv? what are the input columns named? what are output columns named? what do you want to do with the output?
I also find it easier to read than your english prompt.
You have to count the words in the functions you call to get the correct length of the implementation, which in this case is far far more than 20 words. read_csv has more than 20 arguments, you can't even write the function definition in under 20 words.
Otherwise, I can run every program by importing one function (or an object with a single method, or what have you) and just running that function. That is obviously a stupid way to count.
It isn't a joke, you need the Kolmogorov complexity of the code that implements the feature, which has nothing to do with the fact that you're using someone else's solution. You may not have to think about all the code needed to parse a CSV, but someone did and that's a cost of the feature, whether you want to think about it or not.
Again, if someone else writes a 100,000 line function for you, and they wrap it in a "do_the_thing()" method, you calling it is still calling a 100,000 line function, the computer still has to run those lines and if something goes wrong, SOMEONE has to go digging in it. Ignoring the costs you don't pay is ridiculous.
We are comparing between a) asking an LLM to write code to parse a csv and b) writing code to parse a csv.
In both cases, they'll use a csv library, and a bajillion items of lower-level code. Application code is always standing on the shoulders of giants. Nobody is going to manually write assembly or machine code to parse a csv.
The original contention, which I was refuting, is that it's quicker and easier to use an LLM to write the python than it is to just write the python.
Kolmogorov complexity seems pretty irrelevant to this question.
>"read a CSV file and add a column containing the multiple of the price and quantity columns"
This is an underspecification if you want to reliably repeatably produce similar code.
The biggest difference is that some developers will read the whole CSV into memory before doing the computations. In practice the difference between those implementation is huge.
Another big difference is how you represent the price field. If you parse them as floats and the quantity is big enough, you'll end up with errors. Even if quantity is small, you'll have to deal with rounding in your new column.
You didn't even specific the name of the new column, so the name is going to be different every time you run the LLM.
What happens if you run this on a file the program has already been ran on?
And these are just a few of the reasonable ways of fitting that spec but producing wildly different programs. Making a spec that has a good chance of producing a reasonably similar program each time looks more like:
“Read input.csv (UTF-8, comma-delimited, header row). Read it line by line, do not load the entire file into memory. Parse the price and quantity columns as numbers, stripping currency symbols and thousands separators; interpret decimals using a dot (.). Treat blanks as null and leave the result null for such rows. Compute per-row line_total = round(Decimal(price) * Decimal(quantity), 2). Append line_total as the last column (name the column "Total") without reordering existing columns, and write to output.csv, preserving quoting and delimiter. Do not overwrite existing columns. Do not evaluate or emit spreadsheet formulas.”
And even then you couldn't just check this in and expect the same code to be generated each time, you'd need a large test suite--just to constraint the LLM. And even then the LLM would still occasionally find ways to generate code that passes the tests but does thing you don't want it to.
Given that they all use pseudo-random (and not actually random) numbers, they are "deterministic" in the sense that given a fixed seed, they will produce a fixed result...
But perhaps that's not what was meant by deterministic. Something like an understandable process producing an answer rather than a pile of linear algebra?
I was thinking the exact same thing: if you don’t change the weights, use identical “temperature” etc, the same prompt will yield the same output. Under the hood it’s still deterministic code running on a deterministic machine
You can just change your definition of "AI". Back in the 60s the pinnacle of AI was things like automatic symbolic integration and these would certainly be completely deterministic. Nowadays people associate "AI" with stuff like LLMs and diffusion etc. that have randomness included in to make them seem "organic", but it doesn't have to be that way.
I actually think a large part of people's amazement with the current breed of AI is the random aspect. It's long been known that random numbers are cool (see Knuth volume 2, in particular where he says randomness make computer-generated graphics and music more "appealing"). Unfortunately being amazed by graphics and music (and now text) output is one thing, making logical decisions with real consequences is quite another.
"The prompt could be perfect, but there's no way to guarantee that the LLM will turn it into a reasonable implementation."
I think it is worse than that. The prompt, written in natural language, is by its very nature vague and incomplete, which is great if you are aiming for creative artistry. I am also really happy that we are able to search for dates using phrases like "get me something close to a weekend, but not on Tuesdays" on a booking website instead of picking dates from a dropdown box.
However, if natural language was the right tool for software requirements, software engineering would have been a solved problem long ago. We got rightfully excited with LLMs, but now we are trying to solve every problem with it. IMO, for requirements specification, the situation is similar to earlier efforts using formal systems and full verification, but at the exact opposite end. Similar to formal software verification, I expect this phase to end up as a partially failed experiment that will teach us new ways to think about software development. It will create real value in some domains and it will be totally abandoned in others. Interesting times...
> With compilers, I don't need to crack open a hex editor on every build to check the assembly.
The tooling is better than just cracking open the assembly but in some areas people do effectively do this, usually to check for vectorization of hot loops, since various things can mean a compiler fails to do it. I used to use Intel VTune to do this in the HPC scientific world.
“This doesn't make sense as long as LLMs are non-deterministic.”
I think this is a logical error. Non-determinism is orthogonal to probability of being correct. LLMs can remain non-deterministic while being made more and more reliable. I think “guarantee” is not a meaningful standard because a) I don’t think there can be such a thing as a perfect prompt, and b) humans do not meet that standard today.
We also have to pretend that anyone has ever been any good at writing descriptive, detailed, clear and precise specs or documentation. That might be a skillset that appears in the workforce, but absolutely not in 2 years. A technical writer that deeply understands software engineering so they can prompt correctly but is happy not actually looking at code and just goes along with whatever the agent generates? I don't buy it.
This seems like a typical engineer forgets people aren't machines line of thinking.
This. Even with Junior Devs, implementation is always more or less deterministic (based on ones abilities/skills/aptitude). With AI models, you get totally different implementations even when specifically given clear directions via prompt.
> Neither are humans, so this argument doesn't really stand.
Even when we give a spec to a human and tell them to implement it, we scrutinize and test the code they produce. We don't just hand over a spec and blindly accept the result. And that's despite the fact that humans have a lot more common sense, and the ability to ask questions when a requirement is ambiguous.
> This doesn't make sense as long as LLMs are non-deterministic.
I think we will find ways around this. Because humans are also non-deterministic. So what do we do? We review our code, test it, etc. LLMs could do a lot more of that. Eg, they could maintain and run extensive testing, among other ways to validate that behavior matches the spec.
> there's no way to guarantee that the LLM will turn it into a reasonable implementation.
There's also no way to guarantee that you're not going to get hit by a meteor strike tomorrow. It doesn't have to be provably deterministic at a computer science PhD level for people without PhDs to say eh, it's fine. Okay, it's not deterministic. What does that mean in practice? Given the same spec.md file, at the layer of abstraction where we're no longer writing code by hand, who cares, because of a lack of determinism, if the variable for the filename object is called filename or fname or file or name as long as the code is doing something reasonable? If it works, if it passes tests, if we presume that the stoichastic parrot is going to parrot out its training data sufficiently close each time, why is it important?
As far as compilers being deterministic, there's a fascinating detail we ran into with Ksplice. They're not. They're only sufficiently enough that we trust them to be fine. There was this bug we kept tripping, back in roughly 2006, where GCC would swap registers used for a variable, resulting in the Ksplice patch being larger than it had to be, to include handling the register swap as well. The bug has since been fixed, exposing the details of why it was choosing different registers, but unfortunately I don't remember enough details about it. So don't believe me if you don't want to, but the point is, we trust the c compiler, given a function that takes in variables a, b, c, d, that a, b, c, and d will be map them to r0, r1, r2, or r3. We don't actually care what the order that mapping goes, so long as it works.
So the leap, that some have made, and others have not, is that LLMs aren't going to randomly flip out and delete all your data. Which is funny, because that's actually happened on replit. Despite that, despite the fact that LLMs still hallucinate total bullshit and goes off the rail; some people trust LLMs enough to convert a spec to working code. Personally, I think we're not there yet and won't be while GPU time isn't free. (Arguably it is already because anybody can just start typing into chat.com, but that's propped up by VC funding. That isn't infinite, so we'll have to see where we're at in a couple of years.)
That addresses the determinism part. The other part that was raised is debuggable. Again, I don't think we're at a place where we can get rid of generated code any time soon, and as long as code is being generated, then we can debug it using traditional techniques. As far as debugging LLMs themselves, it's not zero. They're not mainstream yet, but it's an active area of research. We can abliterate models and fine tune them (or whatever) to answer "how do you make cocaine", counter to their training. So they're not total black boxes.
Thus, even if traditional software development dies off, the new field is LLM creation and editing. As with new technologies,
porn picks it up first. Llama and other downlodable models (they're not open source https://www.downloadableisnotopensource.org/ ). Downloadable models have been fine tuned or whatever to generate adult content, despite being trained not to. So that's new jobs being created in a new field.
What does "it works" mean to you? For me, that'd be deterministic behavior, and your description about brute forcing LLMs to the desired result through a feedback loop with tests is just that. I mean, sure, if something gives the same result 100% of the time, or 90% of the time, or fuck it, even 80-50% of the time, that's all deterministic in the end, isn't it?
The interesting thing is, for something to be deterministic that thing doesn't need to be defined first. I'd guess we can get an understanding of day/night-cycles without understanding anything about the solar system. In that same vein your Ksplice GCC bug doesn't sound nondeterministic. What did you choose to do in the case of the observed Ksplice behavior? Did you debug and help with the patch, or did you just pick another compiler? It seems that somebody did the investigation to bring GCC closer to the "same result 100% of the time", and I truly have to thank that person.
But here we are and LLMs and the "90% of the time"-approach are praised as the next abstraction in programming, and I just don't get it. The feedback loop is hailed as the new runtime, whereas it should be build time only. LLMs take advantage of the solid foundations we built and provide an NLP-interface on top - to produce code, and do that fast. That's not abstraction in the sense of programming, like Assembly/C++/Blender, but rather abstraction in the sense of distance, like PC/Network/Cloud. We use these "abstractions in distance" to widen reach, design impact and shift responsibilities.
Having been writing a lot of AWS CDK/IAC code lately, I'm looking at this as the "spec" being the infrastructure code and the implementation being the deployed services based on the infrastructure code.
It would be an absolute clown show if AWS could take the same infrastructure code and perform the deployment of the services somehow differently each time... so non-deterministically. There's already all kinds of external variables other than the infra code which can affect the deployment, such as existing deployed services which sometimes need to be (manually) destroyed for the new deployment to succeed.
The fundamental frustration most engineers have with AI coding is that they are used to the act of _writing_ code being expensive, and the accumulation of _understanding_ happening for free during the former. AI makes the code free, but the understanding part is just as expensive as it always was (although, maybe the 'research' technique can help here).
But let's assume you're much better than average at understanding code by reviewing it -- you have another frustrating experience to get through with AI. Pre-AI, let's say 4 days of the week are spend writing new code, while 1 day is spent fixing unforseen issues (perhaps incorrect assumption) that came up after production integration or showing things to real users. Post-AI, someone might be able to write those 4 days worth of code in 1 day, but making decisions about unexpected issues after integration doesn't get compressed -- that still takes 1 day.
So post-AI, your time switches almost entirely from the fun, creative act of writing code to the more frustrating experience of figuring out what's wrong with a lot of code that is almost correct. But you're way ahead -- you've tested your assumptions much faster, but unfortunately that means nearly all of your time will now be spent in a state of feeling dumb and trying to figure out why your assumptions are wrong. If your assumptions were right, you'd just move forward without noticing.
I've used this pattern on two separate codebases. One was ~500k LOC apache airflow monolith repo (I am a data engineer). The other was a greenfield flutter side project (I don't know dart, flutter, or really much of anything regarding mobile development).
All I know is that it works. On the greenfield project the code is simple enough to mostly just run `/create_plan` and skip research altogether. You still get the benefit of the agents and everything.
The key is really truly reviewing the documents that the AI spits out. Ask yourself if it covered the edge cases that you're worried about or if it truly picked the right tech for the job. For instance, did it break out of your sqlite pattern and suggest using postgres or something like that. These are very simple checks that you can spot in an instant. Usually chatting with the agent after the plan is created is enough to REPL-edit the plan directly with claude code while it's got it all in context.
At my day job I've got to use github copilot, so I had to tweak the prompts a bit, but the intentional compaction between steps still happens, just not quite as efficiently because copilot doesn't support sub-agents in the same way as claude code. However, I am still able to keep productivity up.
-------
A personal aside.
Immediately before AI assisted coding really took off, I started to feel really depressed that my job was turning into a really boring thing for me. Everything just felt like such a chore. The death by a million paper cuts is real in a large codebase with the interplay and idiosyncrasies of multiple repos, teams, personalities, etc. The main benefit of AI assisted coding for me personally seems to be smoothing over those paper cuts.
I derive pleasure from building things that work. Every little thing that held up that ultimate goal was sucking the pleasure out of the activity that I spent most of my day trying to do. I am much happier now having impressed myself with what I can build if I stick to it.
I appreciate the share. Yes as I said it was a pretty dang uncomfortable to transition to this new way of working but now that it’s settled we’re never going back
I built a package which I use for large codebase work[0].
It starts with /feature, and takes a description. Then it analyzes the codebase and asks questions.
Once I’ve answered questions, it writes a plan in markdown. There will be 8-10 markdowns files with descriptions of what it wants to do and full code samples.
Then it does a “code critic” step where it looks for errors. Importantly, this code critic is wrong about 60% of the time. I review its critique and erase a bunch of dumb issues it’s invented.
By that point, I have a concise folder of changes along with my original description, and it’s been checked over. Then all I do is say “go” to Claude Code and it’s off to the races doing each specific task.
This helps it keep from going off the rails, and I’m usually confident that the changes it made were the changes I wanted.
I use this workflow a few times per day for all the bigger tasks and then use regular Claude code when I can be pretty specific about what I want done. It’s proven to be a pretty efficient workflow.
I will never understand why anyone wants to go through all this. I don't believe for a second this is more productive than regular coding with a little help from the LLM.
I got access to Kiro from Amazon this week and they’re doing something similar. First a requirements document is written based on your prompt, then a design document and finally a task list.
At first I thought that was pretty compelling, since it includes more edge cases and examples that you otherwise miss.
In the end all that planning still results in a lot of pretty mediocre code that I ended up throwing away most of the time.
Maybe there is a learning curve and I need to tweak the requirements more tho.
For me personally, the most successful approach has been a fast iteration loop with small and focused problems. Being able to generate prototypes based on your actual code and exploring different solutions has been very productive. Interestingly, I kind of have a similar workflow where I use Copilot in ask mode for exploration, before switching to agent mode for implementation, sounds similar to Kiro, but somehow it’s more successful.
Anyways, trying to generate lots of code at once has almost always been a disaster and even the most detailed prompt doesn’t really help much. I’d love to see how the code and projects of people claiming to run more than 5 LLMs concurrently look like, because with the tools I’m using, that would be a mess pretty fast.
I doubt there's much you could do to make the output better. And I think that's what really bothers me. We are layering all this bullshit on to try and make these things more useful then they are, but it's like building a house on sand. The underlying tech is impressive for what it is, and has plenty of interesting use cases in specific areas, but it flat out isn't what these corporations want people to believe it is. And none of it justifies the massive expenditure of resources we've seen.
It’s not necessarily faster to do this for a single task. But it’s faster when you can do 2-3 tasks at the same time. Agentic coding increases throughout.
Until you reach the human bottle neck of having to context switch, verify all the work, presumably tell them to fix it, and then switch back to what you were doing or review something else.
I believe people are being honest when they say these things speed them up, because I'm sure it does seem that way to them. But reality doesn't line up with the perception.
True, if you are in a big company with lots of people, you won't benefit much from the improved throughput of agentic coding.
A greenfield startup however with agentic coding in it's DNA will be able to run loops around a big company with lots of human bottlenecks.
The question becomes, will greenfield startups, doing agentic coding from the ground up, replace big companies with these human bottlenecks like you describe?
What does a startup, built using agentic coding with proper engineering practices, look like when it becomes a big corporation & succeeds?
That's not my point at all. Doesn't matter where you work, if a developer is working in a code base with a bunch of agents, they are always going to be the bottleneck. All the agent threads have to merge back to the developer thread at some point. The more agent threads the more context switching that has to occur, the smaller and smaller the productivity improvement gets, until you eventually end up in the negative.
I can believe a single developer with one agent doing some small stuff and using some other LLM tools can get a modest productivity boost. But having 5 or 10 of these things doing shit all at once? No way. Any gains are offset by having to merge and quality check all that work.
I've always assumed it is because they can't do the regular coding themselves. If you compare spending months on trying to shake a coding agent into not exploding too much with spending years on learning to code, the effort makes more sense
I'm in the same boat. I'm 20 years into my SWE career, I can write all the things Claude Code writes for me now but it still makes me faster and deliver better quality features (Like accessibility features, transitions, nice to have bells and whistles) I may not had time or even thought of to do otherwise. And all that with documentation and tests.
There is a chunk of devs using AI that do it not because they believe it makes them more productive in the present but because it might do so in the near future thanks to advances on AI tech/models, and then some do it because they think it might be required from them to do it this way by their bosses at some point in the future, so they can show preparedness and give the impression of being up to date with how the field evolves, even if at the end it turns out it doesn't speed up things that much.
That line of thinking makes no sense to me honestly.
We are years into this, and while the models have gotten better, the guard rails that have to be put on these things to keep the outputs even semi useful are crazy. Look into the system prompts for Claude sometime. And then we have to layer all these additional workflows on top... Despite the hype I don't see any way we get to this actually being a more productive way to work anytime soon.
And not only are we paying money for the privilege to work slower (in some cases people are shelling out for multiple services) but we're paying with our time. There is no way working this way doesn't degrade your fundamental skills, and (maybe) worse the understanding of how things actually work.
Although I suppose we can all take solice in the fact that our jobs aren't going anywhere soon. If this is what it takes to make these things work.
And most importantly, we're paying with our brain and skills degradation. Once all these services stop being subsidised there will be a massive amount of programmers who no longer can code.
I'm sorry to be blunt here, but the fact you're looking at idiotic use of Claude.md system prompts tells me you're not actually looking at the most productive users, and your opinion doesn't even cover 'where we are'.
I don't blame people who think this. I've stopped visiting Ai Subreddits because the average comment and post is just terrible, with some straight up delusional.
But broadly speaking - in my experience - either you have your documentation set up correctly and cleanly such that a new junior hire could come in and build or fix something in a few days without too many questions. Or you don't. That same distinction seems to cut between teams who get the most out of AI and those that insist everybody must be losing more time than it costs.
---
I suspect we could even flip it around: the cost it takes to get an AI functioning in your code base is a good proxy for technical debt.
The claim I was responding to you was that some people use our friends the magic robots not because they think they are useful now, but because they think they might be useful in the future.
You spend a few minutes generating a spec, then agents go off and do their coding, often lasting 10-30 minutes, including running and fixing lints, adding and running tests, ...
Then you come back and review.
But you had 10 of these running at the same time!
You become a manager of AI agents.
For many, this will be a shitty way to spend their time.... But it is very likely the future of this profession.
Anyway… watch the videos the OP has of the coding live streams. Thats the most interesting part of this post: actual real examples of people really using these tools in a way that is transferable and specifically detailed enough to copy and do yourself.
For each process, say you spend 3 minutes generating a spec. Presumably you also spend 5 minutes in PR and merging.
You can’t do 10 of these processes at once, because there’s 8 minutes of human administration which can’t be parallelised for every ~20min block of parallelisable work undertaken by Claude. You can have two, and intermittently three, parallel process at once under the regime described here.
The number you have running is irrelevant. Primarily because humans are absolutely terrible at multitasking and context switching. An endless number of studies have been done on this. Each context switch cost you a non-trivial amount of time. And yes, even in the same project, especially big ones, you will be context switching each time one of these finishes it's work.
That coupled with the fact that you have to meticulously review every single thing the AI does is going to obliterate any perceived gains you get from going through all the trouble to set this up. And on top of that it's going to be expensive as fuck quick on a non trivial code base.
And before someone says "well you don't have to be that thorough with reviews", in a professional settings absolutely you do. Every single AI policy in every single company out there makes the employee using the tool solely responsible for the output of the AI. Maybe you can speed run when you're fucking around on your own, but you would have to be a total moron to risk your job by not being thorough. And the more mission critical the software the more thorough you have to be.
At the end of the day a human with some degree of expertise is the bottleneck. And we are decades away from these things being able to replace a human.
How about a bug fixing use case? Let agents pick bugs from Jira and let it do some research and thinking, setting up data and environment for reproduction. Let it write a unit test manifesting the bug (making it failing test). Let it take a shot at implementing the fix. If it succeeds, let it make a PR.
This can all be done autonomously without user interaction. Now many bugs can be few lines of code and might be relatively easy to review. Some of these bug fixes may fail, may be wrong etc. but even if half of them were good, this is absolutely worth it. In my specific experience the success rate was around 70%, and the rest of the fixes were not all worthless but provided some more insight into the bug.
The biggest challenge i found with LLMs on large codebase is making the same mistakes again and again How do keep track of the architecture decisions in context of every tasks on the large codebase ?
Very very clear, unambiguous, prompts and agent rules. Use strong language like "must" and "critical" and "never" etc. I would also try working on smaller sections of a large codebase at a time too if things are too inaccurate.
The AI coding tools are going to be looking at other files in the project to help with context. Ambiguity is the death of AI effectiveness. You have to keep things clear and so that may require addressing smaller sections at a time. Unless you can really configure the tools in ways to isolate things.
This is why I like tools that have a lot of control and are transparent. If you ask a tool what the full system and user prompt is and it doesn't tell you? Run away from that tool as fast as you can.
You need to have introspections here. You have to be able to see what causes a behavior you don't want and be able to correct it. Any tool that takes that away from you is one that won't work.
> Use strong language like "must" and "critical" and "never" etc.
Truly we live in the stupidest timeline. Imagine if you had a domestic robot but when you asked it make you breakfast you had to preface your request with “it’s critical that you don’t kill me.”
Or when you asked it to do the laundry you had to remember to tell it that it “must not make its own doors by knocking holes in the wall” and hope that it listens.
There is even a chance our timeline might include that robot too at some point...
Book recommendation no one asked for but which is essentially about some guy living through multiple more or less stupid timelines: Count to Eschaton series by John C. Wright
I start my sessions with something like `!cat ./docs/*` and I can start asking questions. Make sure you regularly ask it to point out any inconsistencies or ambiguity in the docs.
In some sense “the same mistakes again and again” is either a prompting problem or a “you” problem insofar as your expectations differ from the machine overlords.
This article is like a bookmark in time of where I exactly gave up (in July) managing context in Claude code.
I made specs for every part of the code in a separate folder and that had in it logs on every feature I worked on. It was an API server in python with many services like accounts, notifications, subscriptions etc.
It got to the point where managing context became extremely challenging. Claude would not be able to determine business logic properly and it can get complex. e.g. if you want to do a simple RBAC system with an account and profile with a junction table for roles joining an account with profile. In the end what kind of worked was I had to give it UML diagrams of the relationship with examples to make it understand and behave better.
> the number one concern "what happens if we end up owning this codebase but ... don't know how to steer a model on how to make progress"
> Research lets you get up to speed quickly on flows and functionality
This is the _je ne sais quoi_ that people who are comfortable with AI have made peace with and those who are not have not. If you don't know what the code base does or how to make progress you are effectively trusting the system that built the thing you don't understand to understand the thing and teach you. And then from that understanding you're going to direct the teacher to make changes to the system it taught you to understand. Which suggests a certain _je ne sais quoi_ about human intelligence that isn't present in the system, but which would be necessary to create an understanding of the thing under consideration. Which leads to your understanding being questionable because it was sourced from something that _lacks_ that _je ne sais quoi_. But the order time of failure here is "lifetimes". Of features, of codebases, of persons.
So I can attest to the fact that all of the things proposed in this article actually works. And you can try it out yourself on any arbitrary code base within few minutes.
This is how: I work for a company called NonBioS.ai - we already implement most of what is mentioned in this article. Actually we implemented this about 6 months back and what we have now is an advanced version of the same flow. Every user in NonBioS gets a full linux VM with root access. You can ask nonbios to pull in your source code and ask it to implement any feature. The context is all managed automatically through a process we call "Strategic Forgetting" which is in someways an advanced version of the logic in this article.
Strategic Forgetting handles the context automatically - think of it like automatic compaction. It evaluates information retention based on several key factors:
1. Relevance Scoring: We assess how directly information contributes to the current objective vs. being tangential noise
2. Temporal Decay: Information gets weighted by recency and frequency of use - rarely accessed context naturally fades
3. Retrievability: If data can be easily reconstructed from system state or documentation, it's a candidate for pruning
4. Source Priority: User-provided context gets higher retention weight than inferred or generated content
The algorithm runs continuously during coding sessions, creating a dynamic "working memory" that stays lean and focused. Think of it like how you naturally filter out background conversations to focus on what matters.
And we have tried it out in very complex code bases and it works pretty well. Once you know how well it works, you will not have a hard time believing that the days of using IDE's to edit code is probably numbered.
Also - you can try it out for yourself very quickly at NonBioS.ai. We have a very generous free tier that will be enough for the biggest code base you can throw at nonbios. However, big feature implementations or larger refactorings might take time longer than what is afforded in the free tier.
It's strange that author is bragging that this 35K LOC was researched and implemented in 7 hours, but there are 40 commits spanning across 7 days. Was it 1 hour per day or what?
Also quite funny that one of the latest commits is "ignore some tests" :D
FWIW I think your style is better and more honest than most advocates. But I'd really love to see some examples of things that completely failed. Because there have to be some, right? But you hardly ever see an article from an AI advocate about something that failed, nor from an AI skeptic about something that succeeded. Yet I think these would be the types of things that people would truly learn from. But maybe it's not in anyone's financial interest to cross borders like that, for those who are heavily vested in the ecosystem.
But, yeah, looking again, that was a pretty big omission. And even moreso, a missed opportunity! I think if this had been called out more explicitly, then rather than arguing whether this is a realistic workflow or not, we'd be seeing more thoughtful conversation about how to fix the remaining problems.
I don't mean to sound discouraging. Keep up the good work!
I think what the OP is asking for is an article _like this one_ about where you go in-depth into what you tried, where the system went, and more specifically what went wrong (even if it's just a list of "undifferentiated issues"). Because "we tried a thing. It didn't work. We bailed out." doesn't show off the rough edges of the tool in a way that helps people understand "the shape of the elephant".
You do acknowledge this but this doesn't make the "spent 7 hours and shipped 35k LOC" claim factually correct or true. It sure sounds good but it's disingenuous, because shipping != making progress. Shipping code means deploying it to the end users.
I'm always amazed when I seen xKLOC metrics being thrown out like it matters somehow. The bar has always been shipped code. If it's not being used, it's merely a playground or learning exercise.
we generated 35K LOC in 7 hours, 7 days of fixes and we shipped it.
This at least makes it clearer that it is on par with what it would take a senior BAML team member to accomplish this, which is kind of impressive on it's own. Not sure about ignoring the tests though
There are a lot of people declaring this, proclaiming that about working with AI, but nobody presents the details. Talk is cheap, show me the prompts. What will be useful is to check in all the prompts along with code. Every commit generated by AI should include a prompt log recording all the prompts that led to the change. One should be able to walkthrough the prompt log just as they may go through the commit log and observe firsthand how the code was developed.
I agree, the rare times when someone has shared prompts and AI generated code I have not been impressed at all. It very quickly accrues technical debt and lacks organization. I suspect the people who say it’s amazing are like data engineers who are used to putting everything in one script file, React devs where the patterns and organization are well defined and constrained, or people who don’t code and don’t even understand the issues in their generated code yet.
A few weeks later, @hellovai and I paired on shipping 35k LOC to BAML, adding cancellation support and WASM compilation - features the team estimated would take a senior engineer 3-5 days each.
Sorry, had they effectively estimated that an engineer should produce 4-6KLOC per day (that's before genAI)?
It seems we're still collectively trying to figure out the boundaries of "delegation" versus "abstraction" which I personally don't think are the same thing, though they are certainly related and if you squint a bit you can easily argue for one or the other in many situations.
> We've gotten claude code to handle 300k LOC Rust codebases, ship a week's worth of work in a day, and maintain code quality that passes expert review.
This seems more like delegation just like if one delegated a coding task to another engineer and reviewed it.
> That in two years, you'll be opening python files in your IDE with about the same frequency that, today, you might open up a hex editor to read assembly (which, for most of us, is never).
This seems more like abstraction just like if one considers Python a sort of higher level layer above C and C a higher level layer above Assembly, except now the language is English.
I would say its much more about abstraction and the leverage abstractions give you.
You'll also note that while I talk about "spec driven development", most of the tactical stuff we've proven out is downstream of having a good spec.
But in the end a good spec is probably "the right abstraction" and most of these techniques fall out as implementation details. But to paraphrase sandy metz - better to stay in the details than to accidentally build against the wrong abstraction (https://sandimetz.com/blog/2016/1/20/the-wrong-abstraction)
I don't think delegation is right - when me and vaibhav shipped a week's worth of work in a day, we were DEEPLY engaged with the work, we didn't step away from the desk, we were constantly resteering and probably sent 50+ user messages that day, in addition to some point-edits to markdown files along the way.
I continue to write codebases in programming languages, not English. LLM agents just help me manipulate that code. They are tools that do work for me. That is delegation, not abstraction.
To write and review a good spec, you also need to understand your codebase. How are you going to do that without reading the code? We are not getting abstracted away from our codebases.
For it to be an abstraction, we would need our coding agents to not only write all of our code, they would also need to explain it all to us. I am very skeptical that this is how developers will work in the near future. Software development would become increasingly unreliable as we won't even understand what our codebases actually do. We would just interact with a squishy lossy English layer.
No not really. They didn’t need to spend a lot of time looking at the output because (especially back then) they mostly knew exactly what the assembly was going to look like.
With an LLM, you don’t need to move down to the code layer so you can optimize a tight loop. You need to look at the code so you can verify that the LLM didn’t write a completely different program that what you asked it to write.
Probably at first when the compiler was bad at producing good assembly. But even then, the compiler would still always produce code that matches the rules of the language. This is not the case with LLMs. There is no indication that in the future LLMs will become deterministic such that we could literally write codebases in English and then "compile" them using an LLM into a programming language of our choice and rely on the behaviour of the final program matching our expectations.
This is why LLMs are categorically not compilers. They are not translating English code into some other type of code. They are taking English direction and then writing/editing code based upon that. They are working on a codebase alongside us, as tools. And then you still compile that code using an actual compiler.
We will start to trust these tools more and more, and probably spend less time reviewing the code they produce over time. But I do not see a future where professional developers completely disregard the actual codebase and rely entirely on LLMs for code that matters. That would require a completely different category of tools than what we have today.
I mean, the ones who were actually _writing_ a C compiler, sure, and to some who were in performance critical spaces (early C compilers were not _good_). But normal programmers, checking for correctness, no, absolutely not. Where did you get that idea?
(The golden age of looking at compiler-generated assembly would've been rather later, when processors added SIMD instructions and compilers started trying to get clever about using them.)
if you haven't tried the research -> plan -> implementation approach here, you are missing out on how good LLMs are. it completely changed my perspective.
the key part was really just explicitly thinking about different levels of abstraction at different levels of vibecoding. I was doing it before, but not explicitly in discrete steps and that was where i got into messes. The prior approach made check pointing / reverting very difficult.
When i think of everything in phases, i do similar stuff w/ my git commits at "phase" levels, which makes design decision easier to make.
I also do spend ~4-5 hours cleaning up the code at the very very end once everything works. But its still way faster than writing hard features myself.
tbh I think the thing that's making this new approach so hard to adopt for many people is the word "vibecoding"
Like yes vibecoding in the lovable-esque "give me an app that does XYZ" manner is obviously ridiculous and wrong, and will result in slop. Building any serious app based on "vibes" is stupid.
But if you're doing this right, you are not "coding" in any traditional sense of the word, and you are *definitely* not relying on vibes
I'm sticking to the original definition of "vibe coding", which is AI-generated code that you don't review.
If you're properly reviewing the code, you're programming.
The challenge is finding a good term for code that's responsibly written with AI assistance. I've been calling it "AI-assisted programming" but that's WAY too long.
> but not explicitly in discrete steps and that was where i got into messes.
I've said this repeatedly, I mostly use it for boilerplate code, or when I'm having a brain fart of sorts, I still love to solve things for myself, but AI can take me from "I know I want x, y, z" to "oh look I got to x, y, z in under 30 minutes, which could have taken hours. For side projects this is fine.
I think if you do it piecemeal it should almost always be fine. When you try to tell it to do two much, you and the model both don't consider edge cases (Ask it for those too!) and are more prone for a rude awakening eventually.
Good pointers on decompositing and looking at implementation or fixing in chunks.
1. Break down the feature or bug report into a technical implementation spec. Add in COT for the splits.
2. Verify the implementation spec. Feed reviews back to your original agent that has created the spec. Edit, merge, integrate feedback.
3. Transform implementation spec into an implementation plan - logically split into modules look at dependency chain.
4. Build, test and integrate continuously with coding agents
5. Squash the commits if needed into a single one for the whole feature.
Generally has worked well as a process when working on a complex feature. You can add in HITL at each stage if you need more verification.
For larger codebases always maintain an ARCHITECTURE.md and for larger modules a DESIGN.md
I admittedly haven't tried this approach at work yet but at home while working on a side project, I'll make a new feature branch and give CLAUDE a prompt about what the feature is with as much detail as possible. i then have it generate a CLAUDE-feature.md and place an implementation plan along with any supporting information (things we have access to in the codebase, etc.).
i'll then prompt it for more based on if my interpretation of the file is missing anything or has confusing instructions or details.
usually in-between larger prompts I'll do a full /reset rather than /compact, have it reference the doc, and then iterate some more.
once it's time to try implementing I do one more /reset, then go phase by phase of the plan in increments /reset-ing between each and having it update the doc with its progress.
generally works well enough but not sure i'd trust it at work.
> It was uncomfortable at first. I had to learn to let go of reading every line of PR code. I still read the tests pretty carefully, but the specs became our source of truth for what was being built and why.
This is exactly right. Our role is shifting from writing implementation details to defining and verifying behavior.
I recently needed to add recursive uploads to a complex S3-to-SFTP Python operator that had a dozen path manipulation flags. My process was:
* Extract the existing behavior into a clear spec (i.e., get the unit tests passing).
* Expand that spec to cover the new recursive functionality.
* Hand the problem and the tests to a coding agent.
I quickly realized I didn't need to understand the old code at all. My entire focus was on whether the new code was faithful to the spec. This is the future: our value will be in demonstrating correctness through verification, while the code itself becomes an implementation detail handled by an agent.
> Our role is shifting from writing implementation details to defining and verifying behavior.
I could argue that our main job was always that - defining and verifying behavior. As in, it was a large part of the job. Time spent on writing implementation details have always been on a downward trend via higher level languages, compilers and other abstractions.
> My entire focus was on whether the new code was faithful to the spec
This may be true, but see Postel's Law, that says that the observed behavior of a heavily-used system becomes its public interface and specification, with all its quirks and implementation errors. It may be important to keep testing that the clients using the code are also faithful to the spec, and detect and handle discrepancies.
Claude Plays Pokemon showed that too. AI is bad at deciding when something is "working" - it will go in circles forever. But an AI combined with a human to occasionally course correct is a powerful combo.
If you actually define every inch of behavior, you are pretty much writing code. If there's any line in the PR that you can't instantly grok the meaning of, you probably haven't defined the full breadth of the behavior.
Maybe I am just misunderstanding. I probably am; seems like it happens more and more often these days
But.. I hate this. I hate the idea of learning to manage the machine's context to do work. This reads like a lecture in an MBA class about managing certain types of engineers, not like an engineering doc.
Never have I wanted to manage people. And never have I even considered my job would be to find the optimum path to the machine writing my code.
Maybe firmware is special (I write firmware)... I doubt it. We have a cursor subscription and are expected to use it on production codebases. Business leaders are pushing it HARD. To be a leader in my job, I don't need to know algorithms, design patterns, C, make, how to debug, how to work with memory mapped io, what wear leveling is, etc.. I need to know 'compaction' and 'context engineering'
I feel like a ship corker inspecting a riveted hull
Guess it boils down to personality, but I personally love it. I got into coding later in life, and coming from a career that involved reading and writing voluminous amounts of text in English. I got into programming because I wanted to build web applications, not out of any love for the process of programming in and of itself. The less I have to think and write in code, the better. Much happier to be reading it and reviewing it than writing it myself.
No ones like programming that much. That's like saying someone love speaking English. You have an idea and you express it. Sometimes there's additional complexity that got in the way (initializing the library, memory cleanup,...), but I put those at the same level as proper greetings in a formal letter.
It also helps starting small, get something useful done and iterate by adding more features overtime (or keeping it small).
> No ones like programming that much. That's like saying someone love speaking English. You have an idea and you express it.
I can assure you both kinds of people exist. Expressing ideas as words or code is not a one-way flow if you care enough to slow down and look closely. Words/clauses and data structures/algorithms exert their own pull on ideas and can make you think about associated and analogous ideas, alternative ways you could express your solution, whether it is even worth solving explicitly and independently of a more general problem, etc.
IMO, that’s a sign of overthinking (and one thing I try hard to not get caught in). My process is usually:
- What am I trying to do?
- What data do I have available?
- Where do they come from?
- What operations can I use?
- What’s the final state/output?
Then it’s a matter of shifting into the formal space, building and linking stuff.
What I did observe is a lot of people hate formalizing their thoughts. Instead they prefer tweaking stuff until something kinda works and they can go on to the next ticket/todo item. There’s no holistic view about the system. And they hate the 5 why’s. Something like:
- Why is the app displaying “something went wrong” when I submit the form?
- Why is the response is an error when the request is valid?
- Why is the data is persisted when the handler is failing and giving a stack trace in the log?
- Why is it complaining about missing configuration for Firebase?
- …
Ignorance is te default state of programming effort. But a lot of people have great difficulty to say I don’t know AND to go find the answer they lack.
None of this is excluded by my statement. And arguably someone else can draw a line in the sand and say most of this is overthinking somehow and you should let the machine worry about it.
I would love to let the computer do the investigative work for me, but I have to double check it, and there's not much mental energy and time saved (if you care about work quality). When I use `pgrep` to check if a process is running, I don't have to inspect the kernel memory to see if it's really there.
It's very much faster, cognitively, to just understand the project and master the tooling. Then it just becomes routine, like playing a short piano piece for the 100th time.
I've started to use agents on some very low-level code, and have middling results. For pure algorithmic stuff, it works great. But I asked it to write me some arm64 assembly and it failed miserably. It couldn't keep track of which registers were which.
Honestly - if it's such a good technique it should be built into the tool itself. I think just waiting for the tools to mature a bit will mean you can ignore a lot of the "just do xyz" crap.
It's not at senior engineer level until it asks relevant questions about lacking context instead of blindly trying to solve problems IMO.
I am still sceptical of the roi and the time i am supposed to sink into trying and learning these AI tools which seem to be replacing each other every week.
For me the biggest difficulty is I find it hard to read unverifiable documentation. It's like dyslexia - if I can't connect the text content with runnable code, I feel lost in 5 minutes.
So with this approach of spending 3 hours on planning without verification in code, that's too hard for me.
I agree the context compaction sounds good. But I'm not sure if an md file is good enough to carry the info from research to plan and implementation. Personally I often find the context is too complex or the problem is too big. I just open a new session to resolve a smaller, more specific problem in source code, then test and review the source code.
My problem is it keeps working, even when it reaches certain things it doesn't know how to do.
I've been experimenting with Github agents recently, they use GPT-5 to write loads of code, and even make sure it compiles and "runs" before ending the task.
Then you go and run it and it's just garbage, yeah it's technically building and running "something", but often it's not anything like what you asked for, and it's splurged out so much code you can't even fix it.
> Context has never been the bottleneck for me. AI just stops working when I reach certain things that AI doesn't know how to do.
It's context all the way down. That just means you need to find and give it the context to enable it to figure out how to do the thing. Docs, manuals, whatever.
Same stuff that you would use to enable a human that doesn't know how to do it to figure out how.
I've had AI totally fail several times on Swift concurrency issues, i.e. threads deadlocking or similar issues. I've also had AI totally fail on memory usage issues in Swift. In both cases I've had to go back to reasoning over the bugs myself and debugging them by hand, fixing the code by hand.
Anything it has not been trained on. Try getting AI to use OpenAI's responses API. You will have to try very hard to convince it not to use the chat completions API.
yeah once again you need the right context to override what's in the weights. It may not know how to use the responses api, so you need to provide examples in context (or tools to fetch them)
This is just an issue with people who expect AI to solve all of lifes problems before they get out of bed not realising they have no idea how AI works or what it produces and decide "it stops working because it sucks" instead of "it stops working because I don't know what I'm doing"
In my limited experiments with Gemini: it stops working when presented with a program containing fundamental concurrency flaws. Ask it to resolve a race condition or deadlock and it will flail, eventually getting caught in a loop, suggesting the same unhelpful remedies over and over.
I imagine this has to to with concurrency requiring conceptual and logical reasoning, which LLMs are known to struggle with about as badly as they do with math and arithmetic. Now, it's possible that the right language to work with the LLM in these domains is not program code, but a spec language like TLA+. However, at that point, I'd probably just spend less effort to write the potentially tricky concurrent code myself.
2. Write down the principles and assumptions behind the design and keep them current
In other words, the same thing successful human teams on complex projects do! Have we become so addicted to “attention-deficit agile” that this seems like a new technique?
Imagine, detailed specs, design documents, and RFC reviews are becoming the new hotness. Who would have thought??
yeah its kinda funny how some bigger more sophisticated eng orgs that would be called "slow and ineffective" by smaller teams are actually pretty dang well set-up to leverage AI.
All because they have been forced to master technical communication at scale.
but the reason I wrote this (and maybe a side effect of the SF bubble) is MOST of the people I have talked to, from 3-person startups to 1000+ employee public companies, are in a state where this feels novel and valuable, not a foregone conclusion or something happening automatically
> And yeah sure, let's try to spend as many tokens as possible
It'd be nice if the article included the cost for each project. A 35k LOC change in a 350k codebase with a bunch of back and forth and context rewriting over 7 hours, would that be a regular subscription, max subscription, or would that not even cover it?
I use a similar pattern but without the subagents. I get good results with it. I review and hand edit "research" and plans. I follow up and hand edit code changes. It makes me faster, especially in unfamiliar codebases.
But the write up troubles me. If I'm reading correctly, he did 1 bugfix (approved and merged) and then 2 larger PRs (1 merged, 1 still in draft over a month later). That's an insanely small sample size to draw conclusions from.
How can you talk like you've just proven the workflow works "for brownfield codebases"? You proved it worked for 2/3 tasks in 2 codebases, one failure (we can't say it works until the code is shipped IMO).
> Sean proposes that in the AI future, the specs will become the real code. That in two years, you'll be opening python files in your IDE with about the same frequency that, today, you might open up a hex editor to read assembly (which, for most of us, is never).
Only if AI code generation is correct 99.9% of the time and almost never hallucinates. We trust compilers and don't read assembly code because we know it's deterministic and the output can never be wrong (barring bugs and certain optimization issues, which are rare/one-time fixes). As long as generated code is not doing what the original "code" (in this case, specs) doing, humans need to go back to fix things themselves.
I used a similar pattern. When ask AI to do a large implementation. I ask gemini-2.5-pro to write a very detailed overview implementation plan. Then review it. Then ask gemini-2.5-pro to split the plan into multiple stages and write detail implementation plan for each stage. Then I ask claude sonnat to read the overview plan and implement the stage n. I found that this is the only way to complete a major implementation with a relatively high success rate.
This article bases its argument on the predicate that AI _at worst_ will increase developer productivity be 0-10%. But several studies have found that not to be true at all. AI can, and does, make some people less effective
There's also the more insidious gap between perceived productivity and actual productivity. Doesn't help that nobody can agree on how to measure productivity even without AI.
"AI can, and does, make some people less effective"
So those people should either stop using it or learn to use it productively. We're not doomed to live in a world where programmers start using AI, lose productivity because of it and then stay in that less productive state.
If managers are convinced by stakeholders who relentlessly put out pro-"AI" blog posts, then a subset of programmers can be forced to at least pretend to use "AI".
They can be forced to write in their performance evaluation how much (not if, because they would be fired) "AI" has improved their productivity.
Both (1) "AI can, and does, make some people less effective" and (2) "the average productivity boost (~20%) is significant" (per Stanford's analysis) can be true.
The article at the link is about how to use AI effectively in complex codebases. It emphasizes that the techniques described are "not magic", and makes very reasonable claims.
the techniques described sound like just as much work, if not more, than just writing the code. the claimed output isn't even that great, it's comparable to the speed you would expect a skilled engineer to move at in a startup environment
> the techniques described sound like just as much work, if not more, than just writing the code.
That's very fair, and I believe that's true for you and for many experienced software developers who are more productive than the average developer. For me, AI-assisted coding is a significant net win.
Yet a lot of people never bother to learn vim, and are still outstanding and productive engineers. We're surely not seeing any memos "Reflexive vim usage is now a baseline expectation at [our company]" (context: https://x.com/tobi/status/1909251946235437514)
The as-of-yet unanswered question is: Is this the same? Or will non-LLM-using engineers be left behind?
Perhaps if we get the proper thought influencers on board we can look forward to C-suite VI mandates where performance reviews become descriptions of how we’ve boosted our productivity 10x with effective use of VI keyboard agents, the magic of g-prefixed VI technology, VI-power chording, and V-selection powered column intelligence.
According to the Stanford video the only cases (statistically speaking) where that happened was high-complexity tasks for legacy / low popularity languages, no? I would imagine that is a small minority of projects. Indeed, the video cites the overall productivity boost at 15 - 20% IIRC.
Question for discussion - what steps can I take as a human to set myself up for success where success is defined by AI made me faster, more efficient etc?
In many cases (though not all) it's the same thing that makes for great engineering managers:
smart generalists with a lot of depth in maybe a couple of things (so they have an appreciation for depth and complexity) but a lot of breadth so they can effectively manage other specialists,
and having great technical communication skills - be able to communicate what you want done and how without over-specifying every detail, or under-specifying tasks in important ways.
>where success is defined by AI made me faster, more efficient etc?
I think this attitude is part of the problem to me; you're not aiming to be faster or more efficient (and using AI to get there), you're aiming to use AI (to be faster and more efficient).
A sincere approach to improvement wouldn't insist on a tool first.
Can't agree with the formula for performance, on the "/ size" part. You can have a huge codebase, but if the complexity goes up with size then you are screwed. Wouldn't a huge but simple codebase be practical and fine for AI to deal with?
The hierarchy of leverage concept is great! Love it. (Can't say I like the 1 bad line of CLAUDE.md is 100K lines of bad code; I've had some bad lines in my CLAUDE.md from time to time - I almost always let Claude write it's own CLAUDE.md.).
i mean there's also the fact that claude code injects this system message into your claude.md which means that even if your claude.md sucks you will probably be okay:
<system-reminder>
IMPORTANT: this context may or may not be relevant to your tasks. You should not respond to this context or otherwise consider it in your response unless it is highly relevant to your task.
Most of the time, it is not relevant.
</system-reminder>
lots of others have written about this so i won't go deep but its a clear product decision, but if you don't know what's in your context window, you can't respond/architect your balance between claude.md and /commands well.
I also think a lot of coding benchmarks and perhaps even RL environments are not accounting for the messy back and forth of real world software development, which is why there's always a gap between the promise and reality.
I have had a user story and a research plan and only realized deep in the implementation that a fundamental detail about how the code works was missing (specifically, that types and sdks are generated from OpenAPI spec) - this missing meant the plan was wrong (didn’t read carefully enough) and the implementation was a mess
Yeah I agree. There's a lot more needed than just the User Story, one way I'm thinking about it is that the "core" is deliverable business value, and the "shells" are context required for fine-grained details. There will likely need to be a step to verify against the acceptance criteria.
I hope to back up this hypothesis with actual data and experiments!
Hello, I noticed your privacy policy is a black page with text seemingly set to 1% or so opacity. Can you get the slopless AI to fix that when time permits?
I'm using GPT Pro and a VS extension that makes it easy to copy code from multiple files at once. I'm architecting the new version of our SaaS and using it to generate everything for me on the backend. It’s a huge help with modeling and coding, though it takes a lot of steering and correction. I think I’ll end up with a better result than if I did it alone, since it knows many patterns and details I’m not aware of (even simple things like RRULE). I’m designing this new project with a simpler, more vertical architecture in the hopes that Codex will be able to create new tables and services easily once the initial structure is ready and well documented.
yeah flat, simple code is good to start, but I find I'm still developing instincts around right balance between "when to let duplicate code sprawl" vs. "when to be the DRY police".
Re the meta of running multiple phases of "document expansion":
Research helps with complex implementations and for brownfield. But it isn't always needed - simple bugfixes can be one-shot!
So all AI workflows could be expressed with some number "N" of "document expansion phases":
N(0): vibe coding.
N(1): "write a spec then implement it while I watch".
N(2): "research then specify". At this point you start to get serious steerability.
What's N(3) and beyond? Strategy docs, industry research, monetization planning? Can AI do these too, all of it ending up in git? Interesting to muse on.
1. Go's spec and standard practices are more stable, in my experience. This means the training data is tighter and more likely to work.
2. Go's types give the llm more information on how to use something, versus the python model.
3. Python has been an entry-level accessible language for a long time. This means a lot of the code in the training set is by amateurs. Go, ime, is never someone's first language. So you effectively only get code from someone who has already has other programming experience.
4. Go doesn't do much 'weird' stuff. It's not hard to wrap your head around.
yeah i love that there is a lot of source data for "what is good idiomatic go" - the model doesn't have it all in the training set but you can easily collect coding standards for go with deep research or something
And then I find models try to write scripts/manual workflows for testing, but Go is REALLY good for doing what you might do in a bash script, and so you can steer the model to build its own feedback loop as a harness in go integration tests (we do a lot of this in github.com/humanlayer/humanlayer/tree/main/hld)
Except for ofc pushing their own product (humanlayer) and some very complex prompt template+agent setups that are probably overkill for most, the basics in this post about compaction and doing human review at the correct level are pretty good pointers. And giving a bit of a framework to think within is also neat
Verifying behavior is great and all if you can actually exhaustively test the behaviors of your system. If you can't, then not knowing what your code is actually doing is going to set you back when things do go belly up.
I love this comment because it makes perfect sense today, it made perfect sense 10 years ago, it would have made perfect sense in 1970. The principles of software engineering are not changed by the introduction of commodified machine intelligence.
i 100% agree - the folks who are best at ai-first engineering, they spend 3 days designing the test harness and then kick off an agent unsupervised for 2+ days and come back to working software.
not exactly valuable as guidance since programming languages are very easy to verify, but the https://ghuntley.com/ralph post is an example of whats possible on the very extreme end of the spectrum
To minimise context bloat and provide more holistic context, I extract on first step the important elements from a codebase via AST which then the LLM uses to determine which files to get in full for given task.
I used to do these things manually in Cursor. Then I had to take a few months off programming, and when I came back and updated Cursor I found out that it now automatically does ToDos, as well as keeps track of the context size and compresses it automatically by summarising the history when it reaches some threshold.
With this I find that most of the shenanigans of manual context window managing with putting things in markdown files is kind of unnecessary.
You still need to make it plan things, as well as guide the research it does to make sure it gets enough useful info into the context window, but in general it now seems to me like it does a really good job with preserving the information. This is with Sonnet 4
I’m not an expert in either language, but seeing a 20k LoC PR go up (linked in the article) would be an instant “lgtm, asshole” kind of review.
> I had to learn to let go of reading every line of PR code
Ah. And I’m over here struggling to get my teammates to read lines that aren’t in the PR.
Ah well, if this stuff works out it’ll be commoditized like the author said and I’ll catch up later. Hard to evaluate the article given the authors financial interest in this succeeding and my lack of domain expertise.
Dumping a huge PR across a shared codebase wherein everyone else also has to deal with the risk of you monumental changes is pretty rude as well, I would even go so far as to say that it is likely selfishly risky.
Dumping a 20k LOC PR on somebody to review especially if all/a lot of it was generated with AI is disrespectful. The appropriate response is to kick that back and tell them to make it more digestible.
A 20k LOC PR isn’t reviewable in any normal workflow/process.
The only moves are refusing to review it, taking it up the chain of authority, or rubber stamping it with a note to the effect that it’s effectively unreviewable so rubber stamping must be the desired outcome.
I don't understand this attitude. Tests are important parts of the codebase. Poorly written tests are a frequent source of headaches in my experience, either by encoding incorrect assumptions, lying about what they're testing, giving a false sense of security, adding friction to architectural changes/refactors, etc. I would never want to review even 2k lines of test changes in one go.
Preach. Also, don't forget making local testing/CI take longer to run, which costs you both compute and developer context switching.
I've heard people rave about LLMs for writing tests, so I tried having Claude Code generate some tests for a bug I fixed against some autosave functionality - (every 200ms, the auto-saver should initiate a save if the last change was in the previous 200ms). Claude wrote five tests that each waited 200ms (!) adding a needless entire second to the run-time of my test suite.
I went in to fix it by mocking out time, and in the process realized that the feature was doing a time stamp comparisons when a simpler/non-error prone approach was to increment a logical clock for each change instead.
The tests I've seen Claude write vary from junior-level to flat-out-bad. Tests are often the first consumer of a new interface, and delegating them to an LLM means you don't experience the ergonomics of the thing you just wrote.
i think the general take away for all of this is the model can write the code but you still have to design it. I don't disagree with anything you've said, and I'd say my advice is engage more, iterate more, and work in small steps to get the right patterns and rules laid out. It wont work well on day one if you don't set up the right guidelines and guardrails. That's why it's still software engineering, despite being a different interaction medium.
And if the 10k lines of tests are all garbage, now what? Because tests are the 1 place you absolutely should not delegate to AI outside of setting up the boilerplate/descriptions.
If somebody did this, it means they ignored their team's conventions and offloaded work onto colleagues for their own convenience. Being considered rude by the offender is not a concern of mine when dealing with a report who pulls this kind of antisocial crap.
I'm the owner of some of my work projects/repos. I will absolutely without a 2nd thought close a 20k LoC PR, especially an AI generated one, because the code that ends up in master is ultimately my responsibility. Unless it's something like a repo-wide linter change or whatever, there's literally never a reason to have such a massive PR. Break it down, I don't care if it ends up being 200 disparate PRs, that's actually possible to properly review compared to a single 20k line PR.
Using AI to help with code felt like working with a smart but slightly unreliable teammate. If I wasn’t clear, it just couldn’t follow. But once I learned to explain what I wanted clearly and specifically, it actually saved me time and helped me think more clearly too.
I am working on a project with ~200k LoC, entirely written with AI codegen.
These days I use Codex, with GPT-5-Codex + $200 Pro subscription. I code all day every day and haven't yet seen a single rate limiting issue.
We've come a long way. Just 3-4 months ago, LLMs would start doing a huge mess when faced with a large codebase. They would have massive problems with files with +1k LoC (I know, files should never grow this big).
Until recently, I had to religiously provide the right context to the model to get good results. Codex does not need it anymore.
Heck, even UI seems to be a solved problem now with shadcn/ui + MCP.
My personal workflow when building bigger new features:
1. Describe problem with lots of details (often recording 20-60 mins of voice, transcribe)
2. Prompt the model to create a PRD
3. CHECK the PRD, improve and enrich it - this can take hours
4. Actually have the AI agent generate the code and lots of tests
5. Use AI code review tools like CodeRabbit, or recently the /review function of Codex, iterate a few times
6. Check and verify manually - often times, there are a few minor bugs still in the implementation, but can be fixed quickly - sometimes I just create a list of what I found and pass it for improving
With this workflow, I am getting extraordinary results.
And I assume there's no actual product that customers are using that we could also demo? Because only 1 out of every 20 or so claims of awesomeness actually has a demoable product to back up those claims. The 1 who does usually has immediate problems. Like an invisible text box rendered over the submit button on their Contact Us page preventing an onClick event for that button.
In case it wasn't obvious, I have gone from rabidly bullish on AI to very bearish over the last 18 months. Because I haven't found one instance where AI is running the show and things aren't falling apart in not-always-obvious ways.
I'm kind of in the same boat although the timeline is more compressed. People claim they're more productive and that AI is capable of building large systems but I've yet to see any actual evidence of this. And the people who make these claims also seem to end up spending a ton of time prompting to the point where I wonder if it would have been faster for them to write the code manually, maybe with copilot's inline completions.
I created these demos using real data and real api connections with real databases, utilizing 100% AI code in http://betpredictor.io and https://pix2code.com; however, they barely work. At this point, I'm fixing 90% or more of every recommendation the AI gives. With you're code base being this large, you can be guaranteed that the AI will not know what needs to be edited, but I haven't written one line of hand-written code.
It is true AI-generated UIs tend to be... Weird. In weird ways. Sometimes they are consistent and work as intended, but often times they reveal weird behaviors.
Or at least this was true until recently. GPT-5 is consistently delivering more coherent and better working UIs, provided I use it with shadcn or alternative component libraries.
So while you can generate a lot of code very fast, testing UX and UI is still manual work - at least for me.
I am pretty sure, AI should not run the show. It is a sophisticated tool, but it is not a show runner - not yet.
Nothing much weird about the SwiftUI UIs GPT-5-codex generates for me. And it adapts well to building reusable/extensible components and using my existing components instead of constantly reinventing, because it is good at reading a lot of code before putting in work.
It is also good at refactoring to consolidate existing code for reusability, which makes it easier to extend and change UI in the future. Now I worry less about writing new UI or copy/pasting UI because I know I can do the refactoring easily to consolidate.
Let me summarise your comment in a few words: show me the money. If nobody is buying anything, there is no incremental value creation or augmentation of existing value in the economy that didn't already exist.
What is you opinion on what is the "right level of detail" that we should use when creating technical documents the LLM will use to implement features ?
When I started leaning heavily into LLMs I was using really detailed documentations. Not '20 minutes of voice recordings', but my specification documents would easily hit hundreds of lines even for simple features.
The result was decent, but extremely frustrating. Because it would often deliver 80% to 90% but the final 10% to 20% it could never get right.
So, what I naturally started doing was to care less about the details of the implementation and focus on the behavior I want. And this led me to simpler prompts, to the point that I don't feel the need to create a specification document anymore. I just use the plan mode in Claude Code and it is good enough for me.
One way that I started to think about this was that really specific documentations were almost as if I was 'over-fitting' my solution over other technically viable solutions the model could come up with. One example would be if I want to sort an array, I could either ask for "sort the array" or "merge sort the array". And by forcing a merge sort I may end up with a worse solution. Admittedly sort is a pretty simple and unlikely example, but this could happen with any topic. You may ask the model to use a hash-set but a better solution would be to use a bloom filter.
Given all that, do you think investing so much time into your prompts provides a good ROI compared with the alternative of not really min-maxing every single prompt?
I tend to provide detailed PRDs, because even if the first couple of iterations of the coding agent are not perfect, it tends to be easier to get there (as opposed to having a vague prompt and move on from there).
What I do sometimes is an experimental run - especially when I am stuck. I express my high-level vision, and just have the LLM code it to see what happens. I do not do it often, but it has sometimes helped me get out of being mentally stuck with some part of the application.
Funnily, I am facing this problem right now, and your post might just have reminded me, that sometimes a quick experiment can be better than 2 days of overthinking about the problem...
This mirrors my experience with AI so far - I've arrived at mostly using the plan and implement modes in Claude Code with complete but concise instructions about the behavior I want with maybe a few guide rails for the direction I'd like to see the implementation path take. Use cases and examples seem to work well.
I kind of assumed that claude code is doing most of the things described this document under the hood (but I really have no idea).
Now with the MCP server, you can instruct the coding agent to use shadcn. I often do "I you need to add new UI elements, make sure to use shadcn and the shadcn component registry to find the best fitting component"
The genius move is that the shadcn components are all based on Tailwind and get COPIED to your project. 95% of the time, the created UI views are just pixel-perfect, spacing is right, everything looks good enough. You can take it from here to personalize it more using the coding agent.
I've had success here by simply telling Codex which components to use. I initially imported all the shadcn components into my project and then I just say things like "Create a card component that includes a scrollview component and in the scrollview add a table with a dropdown component in the third column"...and Codex just knows how to add the shadcn components. This is without internet access turned on by the way.
> 1. Describe problem with lots of details (often recording 20-60 mins of voice, transcribe)
I just ask it to give me instructions for a coding agent and give it a small description of what I want to do, it looks at my code, and details what I describes as best as it can, and usually I have enough to let Junie (JetBrains AI) run on.
I can't personally justify $200 a month, I would need to see seriously strong results for that much. I use AI piecemeal because it has always been the best way to use it. I still want to understand the codebase. When things break its mostly on you to figure out what broke.
A small description can be extrapolated to a large feature, but then you have to accept the AI filling in the gaps. Sometimes that is cool, often times it misses the mark. I do not always record that much, but if I have a vague idea that I want to verbalize, I use recording. Then I take the transcript and create the PRD based on it. Then I iterate a few more times on the PRD - which yield much better results.
>I am working on a project with ~200k LoC, entirely written with AI codegen.
I’d love to see the codebase if you can share.
My experience with LLM code generation (I’ve tried all of the popular models and tools, though generally favor Claude Code with Opus and Sonnet). My time working with them leads me to suspect that your ~200k LoC project could be solved in only about 10k LoC. Their solutions are unnecessary complex (I’m guessing because they don’t “know” the problem, in the way a human does) and that compounds over time.
At this point, I would guess my most common instruction to this tools is to simplify the solution. Even when that’s part of the plan.
Don't want to come off as combative but if you code every day with codex you must not be pushing very hard, I can hit the weekly quota in <36 hours. The quota is real and if you're multi-piloting you will 100% hit it before the week is over.
Fair enough. I spend entire days working on the product, but obviously there are lots of times I am not running Codex - when reviewing PRDs, testing, talking to users, even posting on HN is good for the quota ;)
On the Pro tier? Plus/Team is only suitable for evaluating the tool and occasional help
Btw one thing that helps conserve context/tokens is to use GPT 5 Pro to read entire files (it will read more than Codex will, though Codex is good at digging) and generate plans for Codex to execute. Tools like RepoPrompt help with this (though it also looks pretty complicated)
I thought about it, but I don't think it's necessary. Grok-4-fast is actually quite a good model, you can just set up a routing proxy in front of codex and route easy queries to it, and for maybe $50/mo you'll probably never hit your GPT plan quota.
I can recommend one more thing: tell the LLM frequently to "ask me clarifying questions". It's simple, but the effect is quite dramatic, it really cuts down on ambiguity and wrong directions without having to think about every little thing ahead of time.
The "ask my clarifying questions" can be incredibly useful. It often will ask me things I hadn't thought of that were relevant, and it often suggests very interesting features.
As for when/where to do it? You can experiment. I do it after step 1.
This sounds very similar to my workflow. Do you have pre-commits or CI beyond testing? I’ve started thinking about my codebase as an RL environment with the pre-commits as hyperparameters. It’s fascinating seeing what coding patterns emerge as a result.
I think pre-commit is essential. I enforce conventional commits (+ a hook which limits commit length to 50 chars) and for Python, ruff with many options enabled. Perhaps the most important one is to enforce complexity limits. That will catch a lot of basic mistakes. Any sanity checks that you can make deterministic are a good idea. You could even add unit tests to pre-commit, but I think it's fine to have the model run pytest separately.
The models tend to be very good about syntax, but this sort of linting will often catch dead code like unused variables or arguments.
You do need to rule-prompt that the agent may need to run pre-commit multiple times to verify the changes worked, or to re-add to the commit. Also, frustratingly, you also need to be explicit that pre-commit might fail and it should fix the errors (otherwise sometimes it'll run and say "I ran pre-commit!") For commits there are some other guardrails, like blanket denying git add <wildcard>.
Claude will sometimes complain via its internal monologue when it fails a ton of linter checks and is forced to write complete docstrings for everything. Sometimes you need to nudge it to not give up, and then it will act excited when the number of errors goes down.
Very solid advice. I need to experiment more with the pre-commit stuff, I am a bit tired of reminding the model to actually run tests / checks. They seem to be as lazy about testing as your average junior dev ;)
Yes, I do have automated linting (a bit of a PITA at this scale).
On the CI side I am using Github Actions - it does the job, but haven't put much work into it yet.
Generally I have observed that using a statically typed language like Typescript helps catching issues early on. Had much worse results with Ruby.
Which of these steps do you think/wish could be automated further? Most of the latter ones seem like throwing independent AI reviewers could almost fully automate it, maybe with a "notify me" option if there's something they aren't confident about? Could PRD review be made more efficient if it was able to color code by level of uncertainty? For 1, could you point it to a feed of customer feedback or something and just have the day's draft PRD up and waiting for you when you wake up each morning?
There is definitely way too much plumbing and going back and forth.
But one thing that MUST get better soon is having the AI agent verify its own code. There are a few solutions in place, e.g. using an MCP server to give access to the browser, but these tend to be brittle and slow. And for some reason, the AI agents do not like calling these tools too much, so you kinda have to force them every time.
PRD review can be done, but AI cannot fill the missing gaps the same way a human can. Usually, when I create a new PRD, it is because I have a certain vision in my head. For that reason, the process of reviewing the PRD can be optimized by maybe 20%. OR maybe I struggle to see how tools could make me faster at reading and commenting / editing the PRD.
Agents __SHOULD NOT__ verify their own code. They know they wrote it, and they act biased. You should have a separate agent with instructions to red team the hell out of a commit, be strict, but not nitpick/bikeshed, and you should actually run multiple review agents with slightly different areas of focus since if you try to run one agent for everything it'll miss lots of stuff. A panel of security, performance, business correctness and architecture/elegance agents (armed with a good covering set of code context + the diff) will harden a PR very quickly.
Codex uses this principle - /review runs in a subthread, does not see previous context, only git diff. This is what I am using. Or I open Cursor to review code written by GPT-5 using Sonnet.
Do you have examples of this working, or any best practices on how to orchestrate it efficiently? It sounds like the right thing to do, but it doesn't seem like the tech is quite to the point where this could work in practice yet, unless I missed it. I imagine multiple agents would churn through too many tokens and have a hard time coming to a consensus.
I've been doing this with Gemini 2.5 for about 6 months now. It works quite well, it doesn't catch big architectural 100% but it's very good at line/module level logic issues and anti-patterns.
Have you considered or tried adding steps to create / review an engineering design doc? Jumping straight from PRD to a huge code change seems scary. Granted, given that it's fast and cheap to throw code away and start over, maybe engineering design is a thing of the past. But still, it seems like it would be useful to have it delineate the high-level decisions and tradeoffs before jumping straight into code; once the code is generated it's harder to think about alternative approaches.
Adding an additional layer slows things down. So the tradeoff must be worth it.
Personally, I would go without a design doc, unless you work on a mission-critical feature humans MUST specify or deeply understand. But this is my gut speaking, I need to give it a try!
Yeah I'd love to hear more about that. Like the way I imagine things working currently is "get requirement", "implement requirement", more or less following existing patterns and not doing too much thinking or changing of the existing structure.
But what I'd love to see is, if it has an engineering design step, could it step back and say "we're starting to see this system evolve to a place where a <CQRS, event-sourcing, server-driven-state-machine, etc> might be a better architectural match, and so here's a proposal to evolve things in that direction as a first step."
Something like Kent Beck's "for each desired change, make the change easy (warning: this may be hard), then make the easy change." If we can get to a point where AI tools can make those kinds of tradeoffs, that's where I think things get slightly dangerous.
OTOH if AI models are writing all the code, and AI models have contexts that far exceed what humans can keep in their head at once, then maybe for these agents everything is an easy change. In which case, well, I guess having human SWEs in the loop would do more harm than good at that point.
I have LLMs write and review design docs. Usually I prompt to describe the doc, the structure, what tradeoffs are especially important, etc. Then an LLM writes the doc. I spot check it. A separate LLM reviews it according to my criteria. Once everything has been covered in first draft form I review it manually, and then the cycle continues a few times. A lot of this can be done in a few minutes. The manual review is the slowest part.
How does it compare to Cursor with Claud? I’ve been really impressed with how well Cursor works, but always interested in up leveling if there’s better tools considering how fast this space is moving. Can you comment to how Codex performs vs Cursor?
Claude code is Claude code, whether you use in cursor or not
Codex and Claude code are neck and neck, but we made the decision to go all in on opus 4, as there are compounding returns in optimizing prompts and building intuition for a specific model
That said I have tested these prompts on codex, amp, opencode, even grok 4 fast via codebuff, and they still work decently well
But they are heavily optimized from our work with opus in particular
Not OP, but I use Codex for back-end, scripting, and SQL. Claude Code for most front-end. I have found that when one faces a challenge, the other often can punch through and solve the problem. I even have them work together (moving thoughts and markdown plans back and fourth) and that works wonders.
My progression: Cursor in '24, Roo code mid '25, Claude Code in Q2 '25, Codex CLI in Q3 `25.
Yes, it is a web project with next.js + Typescript + Tailwind + Postgres (Prisma).
I started with Cursor, since it offers a well-rounded IDE with everything you need. It also used to be the best tool for the job. These days Codex + GPT-5-Codex is king. But I sometimes go back to Cursor, especially when reading / editing the PRDs or if I need the ocasional 2nd opinion by Claude.
If it's working for you I have to assume that you are an expert in the domain, know the stack inside and out and have built out non-AI automated testing in your deployment pipeline.
And yes Step 3 is what no one does. And that's not limited to AI. I built a 20+ year career mostly around step 3 (after being biomed UNIX/Network tech support, sysadmin and programmer for 6 years).
Yes, I have over 2 decades of programming experience, 15 years working professionally. With my co-founder we built an entire B2B SaaS, coding everything from scratch, did product, support, marketing, sales...
Now I am building something new but in a very familiar domain. I agree my workflow would not work for your average "vibe coder".
I would not call it vibe coding. But I do not check all changed lines of code either.
In my opinion, and this is really my opinion, in the age of coding with AI, code review is changing as well. If you speed up how much code can be produced, you need to speed up code review accordingly.
I use automated tools most of the time AND I do very thorough manual testing. I am thinking about a more sophisticated testing setup, including integration tests via using a headless browser. It definitely is a field where tooling needs to catch up.
Strong feelings are fair, but the architect analogy cuts the other way. Architects and civil engineers do not eyeball every rebar or hand compute every load. They probably use way more automation than you would think.
I do not claim this is vibe coding, and I do not ship unreviewed changes to safety critical systems (in case this is what people think). I claim that in 2025 reviewing every single changed line is not the only way to achieve quality at the scale that AI codegen enables. The unit of review is shifting from lines to specifications.
No they dont check it because its already been checked and quality controlled by other people. One person isnt producing every aspect and componenet of a bridge. Its made by teams of people who thoroughly go through every little detail and check every aspect to make sure when its put into production, it will handle the load.
You cannot trust AI, its simple as that. It lies, it hallucinates and it can produce test code that can pass when it reality it does nothing that you expect it to even if you detail every little thing. Thats a fact.
Before its too late, come to your sense dude. I dont even think you believe what you say, because if you do, Id never want to work with you and neither would so many other people. You are making our profession some kind of toy. Thanks for contributing to the shitshow and making me realise that I have to be very careful who I work with in the future.
You were never an engineer. I'm 18 years into my career on the web and games and I was never an engineer. It's blind people leading blind people and your somewhere in the middle based on 2013 patterns you got to this point on and 2024 advancements called "Vibe Coding" and you get paid $$ to make it work.
Building a bridge from steel that lasts 100 years and carries real living people in the tens or hundreds of thousands per day without failing under massive weather spikes is engineering.
We've all been waiting for the other shoe to drop. Everyone points out that reviewing code is more difficult than writing it. The natural question is, if AI is generating thousands of lines of code per day, how do you keep up with reviewing it all?
The answer: you don't!
Seems like this reality will become increasingly justified and embraced in the months to come. Really though it feels like a natural progression of the package manager driven "dependency hell" style of development, except now it's your literal business logic that's essentially a dependency that has never been reviewed.
My process is probably more robust than simply reviewing each line of code. But hey, I am not against doing it, if that is your policy. I had worked the old-fashioned way for over 15 years, I know exactly what pitfalls to watch out for.
It is a fairly standardized way of capturing the essens of a new feature. It covers most important aspects of what the feature is about, the goals, the success criteria, even implementation details where it makes sense.
If there is interest, I can share the outline/template of my PRDs.
Wow, very nice. Thank you. That's very well thought out.
I'm particularly intrigued by the large bold letters: "Success must be verifiable by the AI / LLM that will be writing the code later, using tools like Codex or Cursor."
May I ask, what your testing strategy is like?
I think you've encapsulated a good best practices workflow here in a nice condensed way.
I'd also be interested to know how you handle documentation but don't want to bombard you with too many questions
I added that line, because otherwise the LLM would generate goals that are not verifiable in development (e.g. certain pages to render <300ms - this is not something you can test on your local machine).
Documentation is a different topic - I have not yet found how to do it correctly. But I am reading about it and might soon test some ideas to co-generate documentation based on the PRD and the actual code. The challenge being, the code normally evolves and drifts away from the original PRD.
Then I instruct the coding agent to use shadcn / choose the right component from shadcn component registry
The MCP server has a search / discovery tool, and it can also fetch individual components. If you tell the AI agent to use a specific component, it will fetch it (reference doc here: https://ui.shadcn.com/docs/components)
Programming has always had these steps, but traditionally people with different roles would do different parts of it, like gathering requirements, creating product concept, creating development tickets, coding, testing and so on.
How granular are the specs? Is it at the level of "this is the code you must write, and here is how to do it", or are you letting AI work some of that out?
Not surprisingly, building really fast is not the silver bullet you'd think it is. It's all about what to build and how to distribute it. Otherwise bigcos/billionaires would have armies of engineers growing their net worth to epic scales.
My current world view: for monster multiples you need someone who knows how to go 0 to 1, repeatedly. That's almost always only the founder. People after are incremental. If they weren't, they'd just be a founder. Hence why everything is done through acquisitions post-founder. So there's armies of engineers incrementally scaling and maintaining dollars. But not creating that wealth or growing it in a significant % way.
> Heck even Amjad was on a lenny's podcast 9 months ago talking about how PMs use Replit agent to prototype new stuff and then they hand it off to engineers to implement for production.
I got lectured this week that I wasn't working fast enough because the client had already vibe coded (a broken, non-functional prototype) in under an hour.
They saw the the first screen assembled by Replit and figured everything they could see would work with some "small tweaks" which is where I was allegedly to come into the picture.
They continued to lecture me about how the app would need Web Workers for maximum client side performance (explanations full of em-dashes so I knew they were pasting in AI slop at me) and it must all be browser based with no servers because "my prototype doesn't need a server"
Meanwhile their "prototype" had a broken Node.js backend running alongside the frontend listening on a TCP port.
When I asked about this backend they knew nothing about it be assured me their prototype was all browser based with no "servers".
Needless to say I'm never taking on any work from that client again, one of the small joys of being a contractor.
I created an account to say this: RepoPrompt's 'Context Builder' feature helps a ton with scoping context before you touch any code.
It's kind of like if you could chat with Repomix or Gitingest so they only pull the most relevant parts of your codebase into a prompt for planning, etc
I'm a paying RepoPrompt user but not associated in any other way.
I've used it in conjunction with Codex, Claude Code, and any other code gen tool I have tried so far. It saves a lot of tokens and time (and headaches)
Thanks for sharing, I wonder how do you keep the stylistic and mental alignment of the codebase - is this happens during the code review or there are specific instructions during at the plan/implement stages?
Lots of gold in this article. It's like discovering a basket of cheat codes. This will age well.
Great links, BAML is a crazy rabbithole and just found myself nodding along to frequent /compact. These tips are hard-earned and very generously given. Anyone here can take it or leave it. I have theft on my mind, personally. (ʃƪ¬‿¬)
> Within an hour or so, I had a PR fixing a bug which was approved by the maintainer the next morning
An hour for 14 lines of code. Not sure how this shows any productivity gain from AI. It's clear that it's not the code writing that is the bottleneck in a task like this.
Looking at the "30K lines" features, the majority of the 30K lines are either auto-generated code (not by AI), or documentation. One of them is also a PoC and not merged...
The author said he was not a Rust expert and had no prior familiarity with the codebase. An hour for a 14 line fix that works and is acceptable quality to merge is pretty good given those conditions.
When I read about people dumping 2000 lines of code every few days, I'm extremely skeptical about the quality of this code. All the people I've met who worked at this rate were always going for naive solutions and their code was full of hard-to-see bugs which only reared their ugly heads once in a while and were impossible to debug.
We're currently in a transition phase where we're using agentic coding on systems developed with tools and languages designed for humans. Ironically, this makes things unnecessarily hard as things that are easy for us aren't necessary easy to deal with; or that optimal for agentic coding systems.
People like languages that are expressive and concise. That means they do things like omit types, use type inference, macros, syntactic sugar, allow for ambiguities and all the other stuff that gives us shorter, easier to type code that requires more effort to figure out. A good intuition here might be that the harder the compiler/interpreter has to work to convert it into running/executable code, the harder an LLM will have to work to figure out what that code does.
LLMs don't mind verbosity and spelling things out. Things that are long winded and boring to us are helpful for an LLM. The optimal language for an LLM is going to be different than one that is optimal for a human. And we're not good at actually producing detailed specifications. Programming actually is the job of coming up with detailed specifications. Easy to forget when you are doing that but that's literally what programming is. You write some kind of specification that is then "compiled" into something that actually works as specified.
The solution to agentic coding isn't writing specifications for our specifications. That just moves the problem.
We've had a few decades of practice where we just happen to stuff code into files and use very primitive tools to manipulate those files. Agentic coding uses a few party tricks involving command line tools to manipulate those files and reading them by one into the precious context window. We're probably shoveling too much data around. But since that's the way we store code, there are no better tools to do that.
From having used things like Codex, 99% of what it does is interrogating what's there via tediously slow prodding and poking around the code base using simple command line commands and build tool invocations. It's like watching paint dry. I usually just go off doing something else while it boils the oceans and does god knows what before finally doing the (usually) relatively straightforward thing that I asked it to do. It's easy to see that this doesn't scale that well.
The whole point of a large code base is that it probably won't all fit in the context window. We can try to brute force the problem; or we can try to be more selective. The name of the game here is being able to be able to quickly select just the right stuff to put in there and discard all the rest.
We can either do that manually (tedious and a lot of work, sort of as the article proposes), or make it easier for the LLM to use tools that do that. Possibly a bunch of poorly structured files in some nested directory hierarchy isn't the optimal thing here. Most non AI based automated refactorings require something that more closely resembles the internal data structures of what a compiler would use (e.g. symbol tables, definitions, etc.).
A lot of what an agentic coding system has to do is reconstruct something similar enough to that just so it can build a context in which it can do constructive things. The less ambiguous and more structured that is, the easier the job. The easier we make it to do that, the more it can focus on solving interesting problems rather than getting ready to do that.
I don't have all the answers here but if agentic coding is going to be most of the coding, it makes sense to optimize the tools, languages, etc. for that rather than for us.
We're taking a profession that attracts people who enjoy a particular type of mental stimulation, and transforming it into something that most members of the profession just fundamentally do not enjoy.
If you're a business leader wondering why AI hasn't super charged your company's productivity, it's at least partly because you're asking people to change the way they work so drastically, that they no longer derive intrinsic motivation from it.
Because now your manager will measure on LOCs against other engineers again and it's only software engineers worrying about complexity, maintainability, and, in summary, the health of the very creature it's going to pay your salary.
This is the new world we live in. Anyone who actually likes coding should seriously look for other venues because this industry is for other type of people now.
I use AI in my job. I went from tolerable (not doing anything fancy) to unbearable.
I'm actually looking to become a council employee with a boring job and code my own stuff, because if this is what I have to do moving forward, I rather go back to non-coding jobs.
i strongly disagree with this - if anything, using AI to code real production code in real complex codebase is MORE technical than just writing software.
Staff/Principal engineers already spend a lot more time designing systems than writing code. They care a lot about complexity, maintainability, and good architecture.
The best people I know who have been using these techniques are former CTOs, former core Kubernetes contributors, have built platforms for CRDTs at scale, and many other HIGHLY technical pursuits.
This is actually where the "myth" of the 10x engineer comes from - there do exist such people and they always could do more than the rest of us ... because they knew what to build. It's not 10K lines of code, it's _the right_ 10K lines of code. Whether using LLMs or LLVM to produce bytes the bytes produced are not the "τέχνη".
That said, I don't think it takes MORE τέχνη to use the machine, merely a distinct ἐμπειρία. That said, both ἐμπειρία and τέχνη aren't σοφία.
Was the axe or the chainsaw designed in such a way that guarantees that it will definitely miss the log and hit your hand fair amount of the times you use it? If it were, would you still use it? Yes, these hand tools are dangerous, but they were not designed so that it would probably cut off your hand even 1% of the time. "Accidents happen" and "AI slop" are not even remotely the same.
So then with "AI" we're taking a tool that is known to "hallucinate", and not infrequently. So let's put this thing in charge of whatever-the-fuck we can?
I have no doubt "AI" will someday be embedded inside a "smart chainsaw", because we as humans are far more stupid than we think we are.
Even if we had perfectly human-level AI it'd still need management, just like human workers do, and turns out effective management is actually nontrivial.
OpenAI Codex has an `update_plan` function[0]. I'm wondering if switching the implementation to this would improve the coding agent's capabilities or is the default for simplicity better.
Interesting read, and some interesting ideas, but there's a problem with statements like these:
> Sean proposes that in the AI future, the specs will become the real code. That in two years, you'll be opening python files in your IDE with about the same frequency that, today, you might open up a hex editor to read assembly.
> It was uncomfortable at first. I had to learn to let go of reading every line of PR code. I still read the tests pretty carefully, but the specs became our source of truth for what was being built and why.
This doesn't make sense as long as LLMs are non-deterministic. The prompt could be perfect, but there's no way to guarantee that the LLM will turn it into a reasonable implementation.
With compilers, I don't need to crack open a hex editor on every build to check the assembly. The compiler is deterministic and well-understood, not to mention well-tested. Even if there's a bug in it, the bug will be deterministic and debuggable. LLMs are neither.
The fun part is that specs already are non-deterministic.
If you spend time to write out requirements in English in a way that cannot be misinterpreted in any way you end up with programming language.
Humans don't make mistakes nearly as much, the mistakes they do make are way more predictable (they're easier to spot in code review), and they don't tend to make the kinds of catastrophic mistakes that could sink a business. They also tend to cause codebases to rapidly deteriorate, since even very disciplined reviewers can miss the kinds of strange and unpredictable stuff an LLM will do. Redundant code isn't evident in a diff, and things like tautological tests, or useless tests where they're mocking everything and only actually testing the mocks. Or they'll write a bunch of redundant code because they really just aggressively avoid code re-use unless you are very specific.
The real problem is just that they don't have brains, and can't think. They generate text that is optimized to look the most right, but not to be the most right. That means they're deceptive right off the bat. When a human is wrong, it usually looks wrong. When an LLM is wrong, it's generating the most correct looking thing it possibly could while still being wrong, with no consideration for actual correctness. It has no idea what "correctness" even means, or any ideas at all, because it's a computer doing matmul.
They are text summarization/regurgitation, pattern matching machines. They regurgitate summaries of things seen in their training data, and that training data was written by humans who can think. We just let ourselves get duped into believing the machine is the where the thinking is coming from and not the (likely uncompensated) author(s) whose work was regurgitated for you.
>The real problem is just that they don't have brains, and can't think.
That would have had more weight if you haven't just described junior developer behavior beforehand.
"LLMs can't think" is anthropocentric cope. It's the old AI effect all over again - people would rather die than admit that there's very little practical difference between their own "thinking" and that of an AI chatbot.
> That would have had more weight if you haven't just described junior developer behavior beforehand.
Effectively telling that junior developers "don't have brains" is in very bad taste and offensively wrong.
> people would rather die than admit that there's very little practical difference between their own "thinking" and that of an AI chatbot.
Would you like to elaborate on this?
I was told that McDonalds employees would have been replaced by now, self-driving cars will be driving the streets and new medicines would have been discovered.
It's been a couple of years that "AI" is out, and no singularity yet.
LLMs use the same type of "abstract thinking" process as humans. Which is why they can struggle with 6-digit multiplication (unlike computer code, very much like humans), but not with parsing out metaphors or describing what love is (unlike computer code, very much like humans). The capability profile of an LLM is amusingly humanlike.
Setting the bar for "AI" at "singularity" is a bit like setting requirements for "fusion" at "creating a star more powerful than the Sun". Very good for dismissing all existing fusion research, but not any good for actually understanding fusion.
If we had two humans, one with IQ 80 and another with IQ 120, we wouldn't say that one of them isn't "thinking". It's just that one of them is much worse at "thinking" than the other. Which is where a lot of LLMs are currently at. They are, for all intents and purposes, thinking. Are they any good at it though? Depends on what you want from them. Sometimes they're good enough, and sometimes they aren't.
> LLMs use the same type of "abstract thinking" process as humans
It's surprising you say that, considering we don't actually understand the mechanisms behind how humans think.
We do know that human brains are so good at patterns, they'll even see patterns and such that aren't actually there.
LLMs are a pile of statistics that can mimic human speech patterns if you don't tax them too hard. Anyone who thinks otherwise is just Clever Hans-ing themselves.
We understand the outcomes well enough. LLMs converge onto a similar process by being trained on human-made text. Is LLM reasoning a 1:1 replica of what the human brain does? No, but it does something very similar in function.
I see no reason to think that humans are anything more than "a pile of statistics that can mimic human speech patterns if you don't tax them too hard". Humans can get offended when you point it out though. It's too dismissive of their unique human gift of intelligence that a chatbot clearly doesn't have.
> We understand the outcomes well enough
We do not, in fact, "understand the outcomes well enough" lol.
I don't really care if you want to have an AI waifu or whatever. I'm pointing out that you're vastly underestimating the complexity behind human brains and cognition.
And that complex human brain of yours is attributing behaviors to a statistical model that the model does not, in fact, possess.
Anthropocentric cope.
"We don't fully understand how a bird works, and thus: "wind tunnel" is useless, Wright brothers are utter fools, what their crude mechanical contraptions are doing isn't actually flight, and heavier than air flight is obviously unattainable."
I see no reason whatsoever to believe that what your wet meat brain is doing now is any different from what an LLM does. And the similarity in outcomes is rather evident. We found a pathway to intelligence that doesn't involve copying a human brain 1:1 - who would have thought?
I think saying that "LLMs can produce outcomes akin to those produced by human intelligence (in many but not all cases)" and "LLMs are intelligent" to both be fairly defensible.
> I see no reason whatsoever to believe that what your wet meat brain is doing now is any different from what an LLM does.
I don't think this follows though. Birds and planes can both fly, but a bird and a plane are clearly not doing the same thing to achieve flight. Interestingly, both birds and planes excel at different aspects of flight. It seems at least plausible (imo likely) that there are meaningful differences in how intelligence is implemented in LLMs and humans, and that that might manifest as some aspects of intelligence being accessible to LLMs but not humans and vice versa.
> It seems at least plausible (imo likely) that there are meaningful differences in how intelligence is implemented in LLMs and humans
Intelligence isn’t "implemented" in an LLM at all. The model doesn’t carry a reasoning engine or a mental model of the world. It generates tokens by mathematically matching patterns: each new token is chosen to best fit the statistical patterns it learned from its training data and the immediate context you give it. In effect, it’s producing a compressed, context-aware summary of the most relevant pieces of its training data, one token at a time.
The training data is where the intelligence happened, and that's because it was generated by human brains.
There doesn't seem to be much consensus on defining what intelligence is. For the definitions of at least some reasonable people of sound mind, I think it is defensible to call them intelligent, even if I don't necessarily agree. I sometimes call them "intelligent" because many of the things they do seem to me like they should require intelligence.
That said, to whatever extent they're intelligent or not, by almost any definition of intelligence, I don't think they're achieving it through the same mechanism that humans do. That is my main argument. I thing confident arguments that "LLMs think just like humans" are very bad, given that we clearly don't understand how humans achieve intelligence and the vastly different substrates and constraints that humans and LLMs are working with.
I guess to me, how is the ability to represent the statistical distribution of outcomes of almost any combination of scenarios, represented as textual data not a form of world model?
I think you're looking at it too abstractly. An LLM isn't representing anything, it has a bag of numbers that some other algorithm produced for it. When you give it some numbers, it takes them and does matrix operations with them in order to randomly select a token from a softmax distribution, one at a time, until the EOS token is generated.
If they don't have any training data that covers a particular concept, they can't map it onto a world model and make predictions about that concept based on an understanding of the world and how it works. [This video](https://www.youtube.com/watch?v=160F8F8mXlo) illustrates it pretty well. These things may or may not end up being fixed in the models, but that's only because they've been further trained with the specific examples. Brains have world models. Cats see a cup of water, and they know exactly what will happen when you tip it over (and you can bet they're gonna do it).
That video is a poor and mis-understood analysis of an old version of ChatGPT.
Analyzing an image generation failure modes from the dall-e family of models isn't really helpful in understanding if the invoking LLM has a robust world model or not.
The point of me sharing the video was to use the full glass of wine as an example for how generative AI models doing inference lack a true world model. The example was just as relevant now as it was then, and it applies to inference being done by LMs and SD models in the same way. Nothing has fundamentally changed in how these models work. Getting better at edge cases doesn't give them a world model.
That's the point though. Look at any end-to-end image model. Currently I think nano banana (Gemini 2.5 Flash) is probably the best in prod. (Looks like ChatGPT has regressed the image pipeline right now with GPT-5, but not sure)
SD models have a much higher propensity to fixate on proximal in distribution solutions because of the way they de-noise.
For example.. you can ask nano banana for a "Completely full wine glass in zero g" which I'm pretty sure is way more out of distribution, the model does a reasonable job at approximating what they might look like.
That's a fairly bad example. They don't have any trouble taking unrelated things and sticking them together. A world model isn't required for you to take two unrelated things and stick them together. If I ask it to put a frog on the moon, it can know what frogs look like and what the the moon looks like, and put the frog on the moon.
But what it won't be able to do, which does require a world model, is put a frog on the moon, and be able to imagine what that frog's body would look like on the moon in the vacuum of space as it dies a horrible death.
Your example is a good one. The frog won't work because ethically the model won't want to show a dead frog very easily, BUT if you ask nano-banana for:
"Create an image of what a watermelon would look like after being teleported to the surface of the moon for 30 seconds."
You'll see a burst frozen melon usually.
> "We don't fully understand how a bird works, and thus: "wind tunnel" is useless, Wright brothers are utter fools, what their crude mechanical contraptions are doing isn't actually flight, and heavier than air flight is obviously unattainable."
Completely false equivalency. We did in fact back then completely understand "how a bird works", how the physics of flight work. The problem getting man-made flying vehicles off the ground was mostly about not having good enough materials to build one (plus some economics-related issues).
Whereas in case of AI, we are very far from even slightly understanding how our brains work, how the actual thinking happens.
One of the Wright brothers achievements was to realize the published tables of flight physics was wrong and to carefully redo it with their own wind tunnel until they had a correct model from which to design a flying vehicle https://humansofdata.atlan.com/2019/07/historical-humans-of-...
We have a good definition of flight, we don't have a good definition of intelligence.
"Anthropocentric cope >:(" is one of the funniest things I've read this week, so genuinely thank you for that.
"LLMs think like people do" is the equivalent of flat earth theory or UFO bros.
Flerfers run on ignorance, misunderstanding and oppositional defiant disorder. You can easily prove the earth is round in quite a lot of ways (the Greeks did it) but the flerfers either don't know them or refuse to apply them.
There are quite a lot of reasons to believe brains work differently than LLMs (and ways to prove it) you just don't know them or refuse to believe them.
It's neat tech, and I use them. They're just wayyyyyyyy overhyped and we don't need to anthropomorphize them lol
This is wrong on so many levels. I feel like this is what I would have said if I never took a neuroscience class, or actually used an LLM for any real work beyond just poking around ChatGPT from time to time between TED talks.
There is no actual object-level argument in your reply, making it pretty useless. I’m left trying to infer what you might be talking about, and frankly it’s not obvious to me.
For example, what relevance is neuroscience here? Artificial neural nets and real brains are entirely different substrates. The “neural net” part is a misnomer. We shouldn’t expect them to work the same way.
What’s relevant is the psychology literature. Do artificial minds behave like real minds? In many ways they do — LLMs exhibit the same sorts fallacies and biases as human minds. Not exactly 1:1, but surprisingly close.
I didn't say brains and ANNs are the same, in fact I am making quite the opposite argument here.
LLMs exhibit these biases and fallacies because they regurgitate the biases and fallacies that were written by the humans that produced their training data.
Living in Silicon Valley, there are MANY self driving cars driving around right now. At the stop light the other day, I was between 3 of them without any humans in them.
It is so weird when people pull self driving cars out as some kind of counter example. Just because something doesn't happen on the most optimistic time scale, doesn't mean it isn't happening. They just happen slowly and then all at once.
15 years ago they said truck drivers would be obsolete in 1-2 years. They are still not obsolete, and they aren't on track to be any time soon, either.
So… COBOL?
Not really, code even in high level languages is always lower level than English just for computer nonsense reasons. Example: "read a CSV file and add a column containing the multiple of the price and quantity columns".
That's about 20 words. Show me the programming language that can express that entire feature in 20 words. Even very English-like languages like Python or Kotlin might just about do it, if you're working in something else like C++ then no.
In practice, this spec will expand to changes to your dependency lists (and therefore you must know what library is used for CSV parsing in your language, the AI knows this stuff better than you), then there's some file handling, error handling if the file doesn't exist, maybe some UI like flags or other configuration, working out what the column names are, writing the loop, saving it back out, writing unit tests. Any reasonable programmer will produce a very similar PR given this spec but the diff will be much larger than the spec.
> Show me the programming language that can express that entire feature in 20 words.
In python:
Not only is this shorter, but it contains all of the critical information that you left out of your english prompt: where is the csv? what are the input columns named? what are output columns named? what do you want to do with the output?I also find it easier to read than your english prompt.
> `mycsv = pandas.read_csv("/path/to/input.csv")`
You have to count the words in the functions you call to get the correct length of the implementation, which in this case is far far more than 20 words. read_csv has more than 20 arguments, you can't even write the function definition in under 20 words.
Otherwise, I can run every program by importing one function (or an object with a single method, or what have you) and just running that function. That is obviously a stupid way to count.
I really can't tell if this is meant as a joke.
Anyway, I just wrote what I, personally, would type in a normal work day to accomplish this coding task.
It isn't a joke, you need the Kolmogorov complexity of the code that implements the feature, which has nothing to do with the fact that you're using someone else's solution. You may not have to think about all the code needed to parse a CSV, but someone did and that's a cost of the feature, whether you want to think about it or not.
Again, if someone else writes a 100,000 line function for you, and they wrap it in a "do_the_thing()" method, you calling it is still calling a 100,000 line function, the computer still has to run those lines and if something goes wrong, SOMEONE has to go digging in it. Ignoring the costs you don't pay is ridiculous.
We are comparing between a) asking an LLM to write code to parse a csv and b) writing code to parse a csv.
In both cases, they'll use a csv library, and a bajillion items of lower-level code. Application code is always standing on the shoulders of giants. Nobody is going to manually write assembly or machine code to parse a csv.
The original contention, which I was refuting, is that it's quicker and easier to use an LLM to write the python than it is to just write the python.
Kolmogorov complexity seems pretty irrelevant to this question.
You actually have to count the number of bytes in the generated machine code to get the real count
Ok but how much physical space do those bytes take up? Need to measure them.
>"read a CSV file and add a column containing the multiple of the price and quantity columns"
This is an underspecification if you want to reliably repeatably produce similar code.
The biggest difference is that some developers will read the whole CSV into memory before doing the computations. In practice the difference between those implementation is huge.
Another big difference is how you represent the price field. If you parse them as floats and the quantity is big enough, you'll end up with errors. Even if quantity is small, you'll have to deal with rounding in your new column.
You didn't even specific the name of the new column, so the name is going to be different every time you run the LLM.
What happens if you run this on a file the program has already been ran on?
And these are just a few of the reasonable ways of fitting that spec but producing wildly different programs. Making a spec that has a good chance of producing a reasonably similar program each time looks more like:
“Read input.csv (UTF-8, comma-delimited, header row). Read it line by line, do not load the entire file into memory. Parse the price and quantity columns as numbers, stripping currency symbols and thousands separators; interpret decimals using a dot (.). Treat blanks as null and leave the result null for such rows. Compute per-row line_total = round(Decimal(price) * Decimal(quantity), 2). Append line_total as the last column (name the column "Total") without reordering existing columns, and write to output.csv, preserving quoting and delimiter. Do not overwrite existing columns. Do not evaluate or emit spreadsheet formulas.”
And even then you couldn't just check this in and expect the same code to be generated each time, you'd need a large test suite--just to constraint the LLM. And even then the LLM would still occasionally find ways to generate code that passes the tests but does thing you don't want it to.
Full code with prebuilt libraries/packages/components will be the winning setup.
Iverson languages could do that quite succinctly.
Specs are ambiguous but not necessarily non-deterministic.
The same entity interpreting the spec in exactly the same way will resolve the ambiguities the same way each time.
Human and current AI interpretation of specs is non-deterministic process. But, if we wanted to build a deterministic AI we could.
> But, if we wanted to build a deterministic AI we could.
Is this bold proposal backed by any theory?
Given that they all use pseudo-random (and not actually random) numbers, they are "deterministic" in the sense that given a fixed seed, they will produce a fixed result...
But perhaps that's not what was meant by deterministic. Something like an understandable process producing an answer rather than a pile of linear algebra?
I was thinking the exact same thing: if you don’t change the weights, use identical “temperature” etc, the same prompt will yield the same output. Under the hood it’s still deterministic code running on a deterministic machine
This is incorrect. Temperature would need to be zero to get same result.
You’re right - TIL
You can just change your definition of "AI". Back in the 60s the pinnacle of AI was things like automatic symbolic integration and these would certainly be completely deterministic. Nowadays people associate "AI" with stuff like LLMs and diffusion etc. that have randomness included in to make them seem "organic", but it doesn't have to be that way.
I actually think a large part of people's amazement with the current breed of AI is the random aspect. It's long been known that random numbers are cool (see Knuth volume 2, in particular where he says randomness make computer-generated graphics and music more "appealing"). Unfortunately being amazed by graphics and music (and now text) output is one thing, making logical decisions with real consequences is quite another.
In 2025, 99% of people are talking about LLMs or stable diffusion.
So then your question boils down to "how could X, which I've defined as Y, be Z?"
The cutting edge of AI research is LLMs. Those aren't deterministic.
You can build an "AI" with whatever you want, but context matters and we live in 2025, not 1985.
"The prompt could be perfect, but there's no way to guarantee that the LLM will turn it into a reasonable implementation."
I think it is worse than that. The prompt, written in natural language, is by its very nature vague and incomplete, which is great if you are aiming for creative artistry. I am also really happy that we are able to search for dates using phrases like "get me something close to a weekend, but not on Tuesdays" on a booking website instead of picking dates from a dropdown box.
However, if natural language was the right tool for software requirements, software engineering would have been a solved problem long ago. We got rightfully excited with LLMs, but now we are trying to solve every problem with it. IMO, for requirements specification, the situation is similar to earlier efforts using formal systems and full verification, but at the exact opposite end. Similar to formal software verification, I expect this phase to end up as a partially failed experiment that will teach us new ways to think about software development. It will create real value in some domains and it will be totally abandoned in others. Interesting times...
> With compilers, I don't need to crack open a hex editor on every build to check the assembly.
The tooling is better than just cracking open the assembly but in some areas people do effectively do this, usually to check for vectorization of hot loops, since various things can mean a compiler fails to do it. I used to use Intel VTune to do this in the HPC scientific world.
“This doesn't make sense as long as LLMs are non-deterministic.”
I think this is a logical error. Non-determinism is orthogonal to probability of being correct. LLMs can remain non-deterministic while being made more and more reliable. I think “guarantee” is not a meaningful standard because a) I don’t think there can be such a thing as a perfect prompt, and b) humans do not meet that standard today.
We also have to pretend that anyone has ever been any good at writing descriptive, detailed, clear and precise specs or documentation. That might be a skillset that appears in the workforce, but absolutely not in 2 years. A technical writer that deeply understands software engineering so they can prompt correctly but is happy not actually looking at code and just goes along with whatever the agent generates? I don't buy it.
This seems like a typical engineer forgets people aren't machines line of thinking.
I agree this whole spec based approach is misguided. Code is the spec.
This. Even with Junior Devs, implementation is always more or less deterministic (based on ones abilities/skills/aptitude). With AI models, you get totally different implementations even when specifically given clear directions via prompt.
Neither are humans, so this argument doesn't really stand.
> Neither are humans, so this argument doesn't really stand.
Even when we give a spec to a human and tell them to implement it, we scrutinize and test the code they produce. We don't just hand over a spec and blindly accept the result. And that's despite the fact that humans have a lot more common sense, and the ability to ask questions when a requirement is ambiguous.
> This doesn't make sense as long as LLMs are non-deterministic.
I think we will find ways around this. Because humans are also non-deterministic. So what do we do? We review our code, test it, etc. LLMs could do a lot more of that. Eg, they could maintain and run extensive testing, among other ways to validate that behavior matches the spec.
If you're reviewing the code, then you're no longer "opening python files with the same frequency that you open up a hex editor to read assembly".
Not only that but they’re lossy. A hex representation is strictly more information as long as comments are included or generated.
sounds like a good nudge to make tests better
If the tests are written by the AI, who watches the watchers? :-)
but ppl writing that already knew that . so why are they writing this kind of stuff. what the fuck is even going on?
> there's no way to guarantee that the LLM will turn it into a reasonable implementation.
There's also no way to guarantee that you're not going to get hit by a meteor strike tomorrow. It doesn't have to be provably deterministic at a computer science PhD level for people without PhDs to say eh, it's fine. Okay, it's not deterministic. What does that mean in practice? Given the same spec.md file, at the layer of abstraction where we're no longer writing code by hand, who cares, because of a lack of determinism, if the variable for the filename object is called filename or fname or file or name as long as the code is doing something reasonable? If it works, if it passes tests, if we presume that the stoichastic parrot is going to parrot out its training data sufficiently close each time, why is it important?
As far as compilers being deterministic, there's a fascinating detail we ran into with Ksplice. They're not. They're only sufficiently enough that we trust them to be fine. There was this bug we kept tripping, back in roughly 2006, where GCC would swap registers used for a variable, resulting in the Ksplice patch being larger than it had to be, to include handling the register swap as well. The bug has since been fixed, exposing the details of why it was choosing different registers, but unfortunately I don't remember enough details about it. So don't believe me if you don't want to, but the point is, we trust the c compiler, given a function that takes in variables a, b, c, d, that a, b, c, and d will be map them to r0, r1, r2, or r3. We don't actually care what the order that mapping goes, so long as it works.
So the leap, that some have made, and others have not, is that LLMs aren't going to randomly flip out and delete all your data. Which is funny, because that's actually happened on replit. Despite that, despite the fact that LLMs still hallucinate total bullshit and goes off the rail; some people trust LLMs enough to convert a spec to working code. Personally, I think we're not there yet and won't be while GPU time isn't free. (Arguably it is already because anybody can just start typing into chat.com, but that's propped up by VC funding. That isn't infinite, so we'll have to see where we're at in a couple of years.)
That addresses the determinism part. The other part that was raised is debuggable. Again, I don't think we're at a place where we can get rid of generated code any time soon, and as long as code is being generated, then we can debug it using traditional techniques. As far as debugging LLMs themselves, it's not zero. They're not mainstream yet, but it's an active area of research. We can abliterate models and fine tune them (or whatever) to answer "how do you make cocaine", counter to their training. So they're not total black boxes.
Thus, even if traditional software development dies off, the new field is LLM creation and editing. As with new technologies, porn picks it up first. Llama and other downlodable models (they're not open source https://www.downloadableisnotopensource.org/ ). Downloadable models have been fine tuned or whatever to generate adult content, despite being trained not to. So that's new jobs being created in a new field.
What does "it works" mean to you? For me, that'd be deterministic behavior, and your description about brute forcing LLMs to the desired result through a feedback loop with tests is just that. I mean, sure, if something gives the same result 100% of the time, or 90% of the time, or fuck it, even 80-50% of the time, that's all deterministic in the end, isn't it?
The interesting thing is, for something to be deterministic that thing doesn't need to be defined first. I'd guess we can get an understanding of day/night-cycles without understanding anything about the solar system. In that same vein your Ksplice GCC bug doesn't sound nondeterministic. What did you choose to do in the case of the observed Ksplice behavior? Did you debug and help with the patch, or did you just pick another compiler? It seems that somebody did the investigation to bring GCC closer to the "same result 100% of the time", and I truly have to thank that person.
But here we are and LLMs and the "90% of the time"-approach are praised as the next abstraction in programming, and I just don't get it. The feedback loop is hailed as the new runtime, whereas it should be build time only. LLMs take advantage of the solid foundations we built and provide an NLP-interface on top - to produce code, and do that fast. That's not abstraction in the sense of programming, like Assembly/C++/Blender, but rather abstraction in the sense of distance, like PC/Network/Cloud. We use these "abstractions in distance" to widen reach, design impact and shift responsibilities.
Having been writing a lot of AWS CDK/IAC code lately, I'm looking at this as the "spec" being the infrastructure code and the implementation being the deployed services based on the infrastructure code.
It would be an absolute clown show if AWS could take the same infrastructure code and perform the deployment of the services somehow differently each time... so non-deterministically. There's already all kinds of external variables other than the infra code which can affect the deployment, such as existing deployed services which sometimes need to be (manually) destroyed for the new deployment to succeed.
The fundamental frustration most engineers have with AI coding is that they are used to the act of _writing_ code being expensive, and the accumulation of _understanding_ happening for free during the former. AI makes the code free, but the understanding part is just as expensive as it always was (although, maybe the 'research' technique can help here).
But let's assume you're much better than average at understanding code by reviewing it -- you have another frustrating experience to get through with AI. Pre-AI, let's say 4 days of the week are spend writing new code, while 1 day is spent fixing unforseen issues (perhaps incorrect assumption) that came up after production integration or showing things to real users. Post-AI, someone might be able to write those 4 days worth of code in 1 day, but making decisions about unexpected issues after integration doesn't get compressed -- that still takes 1 day.
So post-AI, your time switches almost entirely from the fun, creative act of writing code to the more frustrating experience of figuring out what's wrong with a lot of code that is almost correct. But you're way ahead -- you've tested your assumptions much faster, but unfortunately that means nearly all of your time will now be spent in a state of feeling dumb and trying to figure out why your assumptions are wrong. If your assumptions were right, you'd just move forward without noticing.
I've used this pattern on two separate codebases. One was ~500k LOC apache airflow monolith repo (I am a data engineer). The other was a greenfield flutter side project (I don't know dart, flutter, or really much of anything regarding mobile development).
All I know is that it works. On the greenfield project the code is simple enough to mostly just run `/create_plan` and skip research altogether. You still get the benefit of the agents and everything.
The key is really truly reviewing the documents that the AI spits out. Ask yourself if it covered the edge cases that you're worried about or if it truly picked the right tech for the job. For instance, did it break out of your sqlite pattern and suggest using postgres or something like that. These are very simple checks that you can spot in an instant. Usually chatting with the agent after the plan is created is enough to REPL-edit the plan directly with claude code while it's got it all in context.
At my day job I've got to use github copilot, so I had to tweak the prompts a bit, but the intentional compaction between steps still happens, just not quite as efficiently because copilot doesn't support sub-agents in the same way as claude code. However, I am still able to keep productivity up.
-------
A personal aside.
Immediately before AI assisted coding really took off, I started to feel really depressed that my job was turning into a really boring thing for me. Everything just felt like such a chore. The death by a million paper cuts is real in a large codebase with the interplay and idiosyncrasies of multiple repos, teams, personalities, etc. The main benefit of AI assisted coding for me personally seems to be smoothing over those paper cuts.
I derive pleasure from building things that work. Every little thing that held up that ultimate goal was sucking the pleasure out of the activity that I spent most of my day trying to do. I am much happier now having impressed myself with what I can build if I stick to it.
I appreciate the share. Yes as I said it was a pretty dang uncomfortable to transition to this new way of working but now that it’s settled we’re never going back
I built a package which I use for large codebase work[0].
It starts with /feature, and takes a description. Then it analyzes the codebase and asks questions.
Once I’ve answered questions, it writes a plan in markdown. There will be 8-10 markdowns files with descriptions of what it wants to do and full code samples.
Then it does a “code critic” step where it looks for errors. Importantly, this code critic is wrong about 60% of the time. I review its critique and erase a bunch of dumb issues it’s invented.
By that point, I have a concise folder of changes along with my original description, and it’s been checked over. Then all I do is say “go” to Claude Code and it’s off to the races doing each specific task.
This helps it keep from going off the rails, and I’m usually confident that the changes it made were the changes I wanted.
I use this workflow a few times per day for all the bigger tasks and then use regular Claude code when I can be pretty specific about what I want done. It’s proven to be a pretty efficient workflow.
[0] GitHub.com/iambateman/speedrun
I will never understand why anyone wants to go through all this. I don't believe for a second this is more productive than regular coding with a little help from the LLM.
I got access to Kiro from Amazon this week and they’re doing something similar. First a requirements document is written based on your prompt, then a design document and finally a task list.
At first I thought that was pretty compelling, since it includes more edge cases and examples that you otherwise miss.
In the end all that planning still results in a lot of pretty mediocre code that I ended up throwing away most of the time.
Maybe there is a learning curve and I need to tweak the requirements more tho.
For me personally, the most successful approach has been a fast iteration loop with small and focused problems. Being able to generate prototypes based on your actual code and exploring different solutions has been very productive. Interestingly, I kind of have a similar workflow where I use Copilot in ask mode for exploration, before switching to agent mode for implementation, sounds similar to Kiro, but somehow it’s more successful.
Anyways, trying to generate lots of code at once has almost always been a disaster and even the most detailed prompt doesn’t really help much. I’d love to see how the code and projects of people claiming to run more than 5 LLMs concurrently look like, because with the tools I’m using, that would be a mess pretty fast.
I doubt there's much you could do to make the output better. And I think that's what really bothers me. We are layering all this bullshit on to try and make these things more useful then they are, but it's like building a house on sand. The underlying tech is impressive for what it is, and has plenty of interesting use cases in specific areas, but it flat out isn't what these corporations want people to believe it is. And none of it justifies the massive expenditure of resources we've seen.
Maybe the real question isn’t whether AI is useful, but whether we’ve designed workflows that let humans and AI collaborate effectively.
You can build the greatest house ever built, but if you build it on top of sand it's still going to collapse.
> but whether we’ve designed workflows that let humans and AI collaborate effectively.
In my experience with workflows that let humans and humans (let alone AIs) collaborate effectively, they are NP-hard problems.
It’s not necessarily faster to do this for a single task. But it’s faster when you can do 2-3 tasks at the same time. Agentic coding increases throughout.
Until you reach the human bottle neck of having to context switch, verify all the work, presumably tell them to fix it, and then switch back to what you were doing or review something else.
I believe people are being honest when they say these things speed them up, because I'm sure it does seem that way to them. But reality doesn't line up with the perception.
True, if you are in a big company with lots of people, you won't benefit much from the improved throughput of agentic coding.
A greenfield startup however with agentic coding in it's DNA will be able to run loops around a big company with lots of human bottlenecks.
The question becomes, will greenfield startups, doing agentic coding from the ground up, replace big companies with these human bottlenecks like you describe?
What does a startup, built using agentic coding with proper engineering practices, look like when it becomes a big corporation & succeeds?
That's not my point at all. Doesn't matter where you work, if a developer is working in a code base with a bunch of agents, they are always going to be the bottleneck. All the agent threads have to merge back to the developer thread at some point. The more agent threads the more context switching that has to occur, the smaller and smaller the productivity improvement gets, until you eventually end up in the negative.
I can believe a single developer with one agent doing some small stuff and using some other LLM tools can get a modest productivity boost. But having 5 or 10 of these things doing shit all at once? No way. Any gains are offset by having to merge and quality check all that work.
I've always assumed it is because they can't do the regular coding themselves. If you compare spending months on trying to shake a coding agent into not exploding too much with spending years on learning to code, the effort makes more sense
As a counterpoint, I’ve been coding professionally since 2010.
Every feature I’ve asked Claude Code to write was one I could’ve written myself.
And I’m quite certain it’s faster for my use case.
I won’t be bothered if you choose to ignore agents but the “it’s just useful for the inept” argument is condescending.
I'm in the same boat. I'm 20 years into my SWE career, I can write all the things Claude Code writes for me now but it still makes me faster and deliver better quality features (Like accessibility features, transitions, nice to have bells and whistles) I may not had time or even thought of to do otherwise. And all that with documentation and tests.
There is a chunk of devs using AI that do it not because they believe it makes them more productive in the present but because it might do so in the near future thanks to advances on AI tech/models, and then some do it because they think it might be required from them to do it this way by their bosses at some point in the future, so they can show preparedness and give the impression of being up to date with how the field evolves, even if at the end it turns out it doesn't speed up things that much.
That line of thinking makes no sense to me honestly.
We are years into this, and while the models have gotten better, the guard rails that have to be put on these things to keep the outputs even semi useful are crazy. Look into the system prompts for Claude sometime. And then we have to layer all these additional workflows on top... Despite the hype I don't see any way we get to this actually being a more productive way to work anytime soon.
And not only are we paying money for the privilege to work slower (in some cases people are shelling out for multiple services) but we're paying with our time. There is no way working this way doesn't degrade your fundamental skills, and (maybe) worse the understanding of how things actually work.
Although I suppose we can all take solice in the fact that our jobs aren't going anywhere soon. If this is what it takes to make these things work.
And most importantly, we're paying with our brain and skills degradation. Once all these services stop being subsidised there will be a massive amount of programmers who no longer can code.
I'm sorry to be blunt here, but the fact you're looking at idiotic use of Claude.md system prompts tells me you're not actually looking at the most productive users, and your opinion doesn't even cover 'where we are'.
I don't blame people who think this. I've stopped visiting Ai Subreddits because the average comment and post is just terrible, with some straight up delusional.
But broadly speaking - in my experience - either you have your documentation set up correctly and cleanly such that a new junior hire could come in and build or fix something in a few days without too many questions. Or you don't. That same distinction seems to cut between teams who get the most out of AI and those that insist everybody must be losing more time than it costs.
---
I suspect we could even flip it around: the cost it takes to get an AI functioning in your code base is a good proxy for technical debt.
I wasn't talking about the system prompts provided by users. I was talking about what the companies have to put between the users and the LLM.
Using something becuase it _might one day be useful_ is pretty weird. Just use it if and when it is useful.
But _everything_ you do initially falls into the category of _might one day be useful_, since you haven't yet learned how to do the thing well.
The claim I was responding to you was that some people use our friends the magic robots not because they think they are useful now, but because they think they might be useful in the future.
It absolutely can be, by a huge margin.
You spend a few minutes generating a spec, then agents go off and do their coding, often lasting 10-30 minutes, including running and fixing lints, adding and running tests, ...
Then you come back and review.
But you had 10 of these running at the same time!
You become a manager of AI agents.
For many, this will be a shitty way to spend their time.... But it is very likely the future of this profession.
You didnt have 10 of them running though.
You want to do that, but Ill bet money you arent doing it.
Thats the problem: this is speculative; maybe it scales sometimes, but mostly people do not work on ten things at once.
“Fix the landing page”
“I’ll make you ten new ones!”
“No. Calm down. Fix this one, and do it now, not when youre finished playing with your prompts”
There are legitimate times when complex pieces of work decompose into parallel tasks, but its the exception not the norm.
Most complex work has linked dependencies that need to be done in order.
Remember the mythical man month? Anyone? Anyone???!!??
You can't just add “more parallel” to get things done faster.
I definitely did have 10 running sometimes, mostly just based on copy-pasting issues from the issue tracker.
Codex / Jules etc make this pretty easy.
It's often not a sustainable pace with where the current tooling is at, though.
Especially because you still need to do manual fixes and cleanups quite often.
> sometimes
Mhm. Money -> to the dealer.
Anyway… watch the videos the OP has of the coding live streams. Thats the most interesting part of this post: actual real examples of people really using these tools in a way that is transferable and specifically detailed enough to copy and do yourself.
Could you share a link to the coding live streams? I can't find it.
I found this in the article: https://www.youtube.com/watch?v=42AzKZRNhsk
For each process, say you spend 3 minutes generating a spec. Presumably you also spend 5 minutes in PR and merging.
You can’t do 10 of these processes at once, because there’s 8 minutes of human administration which can’t be parallelised for every ~20min block of parallelisable work undertaken by Claude. You can have two, and intermittently three, parallel process at once under the regime described here.
Parallel vs serial computation... its something you would hope software engineers understand...
From the code I've seen surprisingly few do.
The number you have running is irrelevant. Primarily because humans are absolutely terrible at multitasking and context switching. An endless number of studies have been done on this. Each context switch cost you a non-trivial amount of time. And yes, even in the same project, especially big ones, you will be context switching each time one of these finishes it's work.
That coupled with the fact that you have to meticulously review every single thing the AI does is going to obliterate any perceived gains you get from going through all the trouble to set this up. And on top of that it's going to be expensive as fuck quick on a non trivial code base.
And before someone says "well you don't have to be that thorough with reviews", in a professional settings absolutely you do. Every single AI policy in every single company out there makes the employee using the tool solely responsible for the output of the AI. Maybe you can speed run when you're fucking around on your own, but you would have to be a total moron to risk your job by not being thorough. And the more mission critical the software the more thorough you have to be.
At the end of the day a human with some degree of expertise is the bottleneck. And we are decades away from these things being able to replace a human.
How about a bug fixing use case? Let agents pick bugs from Jira and let it do some research and thinking, setting up data and environment for reproduction. Let it write a unit test manifesting the bug (making it failing test). Let it take a shot at implementing the fix. If it succeeds, let it make a PR.
This can all be done autonomously without user interaction. Now many bugs can be few lines of code and might be relatively easy to review. Some of these bug fixes may fail, may be wrong etc. but even if half of them were good, this is absolutely worth it. In my specific experience the success rate was around 70%, and the rest of the fixes were not all worthless but provided some more insight into the bug.
> you have to meticulously review every single thing the AI does
Joke's on you (and me? and I guess on us as a profession?).
The biggest challenge i found with LLMs on large codebase is making the same mistakes again and again How do keep track of the architecture decisions in context of every tasks on the large codebase ?
Very very clear, unambiguous, prompts and agent rules. Use strong language like "must" and "critical" and "never" etc. I would also try working on smaller sections of a large codebase at a time too if things are too inaccurate.
The AI coding tools are going to be looking at other files in the project to help with context. Ambiguity is the death of AI effectiveness. You have to keep things clear and so that may require addressing smaller sections at a time. Unless you can really configure the tools in ways to isolate things.
This is why I like tools that have a lot of control and are transparent. If you ask a tool what the full system and user prompt is and it doesn't tell you? Run away from that tool as fast as you can.
You need to have introspections here. You have to be able to see what causes a behavior you don't want and be able to correct it. Any tool that takes that away from you is one that won't work.
> Use strong language like "must" and "critical" and "never" etc.
Truly we live in the stupidest timeline. Imagine if you had a domestic robot but when you asked it make you breakfast you had to preface your request with “it’s critical that you don’t kill me.”
Or when you asked it to do the laundry you had to remember to tell it that it “must not make its own doors by knocking holes in the wall” and hope that it listens.
> Truly we live in the stupidest timeline.
Wholeheartedly agree. Truly, I look around and see no shortage of evidence for this assertion.
EDITED to make it clear I am agreeing with parent.
There is even a chance our timeline might include that robot too at some point...
Book recommendation no one asked for but which is essentially about some guy living through multiple more or less stupid timelines: Count to Eschaton series by John C. Wright
"Better living through prompt fondling."
I always chuckle at these rain dance posts.
INVOKE THE ULTRATHINK OH MIGHTY CLAUDE AND BLESS MY CODE.
Have you tried kissing the keyboard before you press enter? It makes the code 123% more flibbeled.
`opencode` will read any amount of `!cmd` output.
I start my sessions with something like `!cat ./docs/*` and I can start asking questions. Make sure you regularly ask it to point out any inconsistencies or ambiguity in the docs.
nice !
Whenever I see Claude Code make same mistake multiple times I add instructions to clade.md to avoid it in the future.
In some sense “the same mistakes again and again” is either a prompting problem or a “you” problem insofar as your expectations differ from the machine overlords.
This looks very cool.
I see it has a pseudo code step, was it helpful at all to try to define a workflow, process or procedure beforehand?
I've also heard that keeping each file down to 100 lines is critical before connecting them. Noticed the same but haven't tried it in depth.
File size matters if you don’t have strategically placed “read the entire file” instructions for certain parts of the workflow (we do)
This article is like a bookmark in time of where I exactly gave up (in July) managing context in Claude code.
I made specs for every part of the code in a separate folder and that had in it logs on every feature I worked on. It was an API server in python with many services like accounts, notifications, subscriptions etc.
It got to the point where managing context became extremely challenging. Claude would not be able to determine business logic properly and it can get complex. e.g. if you want to do a simple RBAC system with an account and profile with a junction table for roles joining an account with profile. In the end what kind of worked was I had to give it UML diagrams of the relationship with examples to make it understand and behave better.
i think that was one of the key reasons we built research_codebase.md first - the number one concern is
"what happens if we end up owning this codebase but don't know how it works / don't know how to steer a model on how to make progress"
There are two common problems w/ primarily-AI-written code
1. Unfamiliar codebase -> research lets you get up to speed quickly on flows and functionality
2. Giant PR Reviews Suck -> plans give you ordered context on what's changing and why
Mitchell has praised ampcode for the thread sharing, another good solution to #2 - https://x.com/mitchellh/status/1963277478795026484
> the number one concern "what happens if we end up owning this codebase but ... don't know how to steer a model on how to make progress"
> Research lets you get up to speed quickly on flows and functionality
This is the _je ne sais quoi_ that people who are comfortable with AI have made peace with and those who are not have not. If you don't know what the code base does or how to make progress you are effectively trusting the system that built the thing you don't understand to understand the thing and teach you. And then from that understanding you're going to direct the teacher to make changes to the system it taught you to understand. Which suggests a certain _je ne sais quoi_ about human intelligence that isn't present in the system, but which would be necessary to create an understanding of the thing under consideration. Which leads to your understanding being questionable because it was sourced from something that _lacks_ that _je ne sais quoi_. But the order time of failure here is "lifetimes". Of features, of codebases, of persons.
So I can attest to the fact that all of the things proposed in this article actually works. And you can try it out yourself on any arbitrary code base within few minutes.
This is how: I work for a company called NonBioS.ai - we already implement most of what is mentioned in this article. Actually we implemented this about 6 months back and what we have now is an advanced version of the same flow. Every user in NonBioS gets a full linux VM with root access. You can ask nonbios to pull in your source code and ask it to implement any feature. The context is all managed automatically through a process we call "Strategic Forgetting" which is in someways an advanced version of the logic in this article.
Strategic Forgetting handles the context automatically - think of it like automatic compaction. It evaluates information retention based on several key factors:
1. Relevance Scoring: We assess how directly information contributes to the current objective vs. being tangential noise
2. Temporal Decay: Information gets weighted by recency and frequency of use - rarely accessed context naturally fades
3. Retrievability: If data can be easily reconstructed from system state or documentation, it's a candidate for pruning
4. Source Priority: User-provided context gets higher retention weight than inferred or generated content
The algorithm runs continuously during coding sessions, creating a dynamic "working memory" that stays lean and focused. Think of it like how you naturally filter out background conversations to focus on what matters.
And we have tried it out in very complex code bases and it works pretty well. Once you know how well it works, you will not have a hard time believing that the days of using IDE's to edit code is probably numbered.
Also - you can try it out for yourself very quickly at NonBioS.ai. We have a very generous free tier that will be enough for the biggest code base you can throw at nonbios. However, big feature implementations or larger refactorings might take time longer than what is afforded in the free tier.
It's strange that author is bragging that this 35K LOC was researched and implemented in 7 hours, but there are 40 commits spanning across 7 days. Was it 1 hour per day or what?
Also quite funny that one of the latest commits is "ignore some tests" :D
if you read further down, I acknowledge this
> While the cancelation PR required a little more love to take things over the line, we got incredible progress in just a day.
FWIW I think your style is better and more honest than most advocates. But I'd really love to see some examples of things that completely failed. Because there have to be some, right? But you hardly ever see an article from an AI advocate about something that failed, nor from an AI skeptic about something that succeeded. Yet I think these would be the types of things that people would truly learn from. But maybe it's not in anyone's financial interest to cross borders like that, for those who are heavily vested in the ecosystem.
But, yeah, looking again, that was a pretty big omission. And even moreso, a missed opportunity! I think if this had been called out more explicitly, then rather than arguing whether this is a realistic workflow or not, we'd be seeing more thoughtful conversation about how to fix the remaining problems.
I don't mean to sound discouraging. Keep up the good work!
there is a portion in the article where I talk about how our hadoop refactor completely failed
I think what the OP is asking for is an article _like this one_ about where you go in-depth into what you tried, where the system went, and more specifically what went wrong (even if it's just a list of "undifferentiated issues"). Because "we tried a thing. It didn't work. We bailed out." doesn't show off the rough edges of the tool in a way that helps people understand "the shape of the elephant".
Or, in the vein of https://adamdrake.com/command-line-tools-can-be-235x-faster-... - "here's a place I wouldn't use an AI tool because _other thing_ is far better"
You do acknowledge this but this doesn't make the "spent 7 hours and shipped 35k LOC" claim factually correct or true. It sure sounds good but it's disingenuous, because shipping != making progress. Shipping code means deploying it to the end users.
I'm always amazed when I seen xKLOC metrics being thrown out like it matters somehow. The bar has always been shipped code. If it's not being used, it's merely a playground or learning exercise.
would "wrote" be more appropriate than "shipped"?
the most accurate wording here is "generated".
we generated 35K LOC in 7 hours, 7 days of fixes and we shipped it.
This at least makes it clearer that it is on par with what it would take a senior BAML team member to accomplish this, which is kind of impressive on it's own. Not sure about ignoring the tests though
I don’t think it is any better. 35kLOC of slop isn’t a good metric by any measure, no matter what word you use.
There are a lot of people declaring this, proclaiming that about working with AI, but nobody presents the details. Talk is cheap, show me the prompts. What will be useful is to check in all the prompts along with code. Every commit generated by AI should include a prompt log recording all the prompts that led to the change. One should be able to walkthrough the prompt log just as they may go through the commit log and observe firsthand how the code was developed.
I agree, the rare times when someone has shared prompts and AI generated code I have not been impressed at all. It very quickly accrues technical debt and lacks organization. I suspect the people who say it’s amazing are like data engineers who are used to putting everything in one script file, React devs where the patterns and organization are well defined and constrained, or people who don’t code and don’t even understand the issues in their generated code yet.
This blog post of mine will be evergreen: https://dmitriid.com/everything-around-llms-is-still-magical...
Moreover, show me the money!!
A few weeks later, @hellovai and I paired on shipping 35k LOC to BAML, adding cancellation support and WASM compilation - features the team estimated would take a senior engineer 3-5 days each.
Sorry, had they effectively estimated that an engineer should produce 4-6KLOC per day (that's before genAI)?
The missing detail here is that the senior engineer would probably have shipped it in 2k lines of code
Or 1k lines of functional, readable, testable, commented code... but who cares, we'll abstract it all away soon enough.
And note that, as admitted elsewhere, it _actually_ took a week: https://news.ycombinator.com/item?id=45351546
It seems we're still collectively trying to figure out the boundaries of "delegation" versus "abstraction" which I personally don't think are the same thing, though they are certainly related and if you squint a bit you can easily argue for one or the other in many situations.
> We've gotten claude code to handle 300k LOC Rust codebases, ship a week's worth of work in a day, and maintain code quality that passes expert review.
This seems more like delegation just like if one delegated a coding task to another engineer and reviewed it.
> That in two years, you'll be opening python files in your IDE with about the same frequency that, today, you might open up a hex editor to read assembly (which, for most of us, is never).
This seems more like abstraction just like if one considers Python a sort of higher level layer above C and C a higher level layer above Assembly, except now the language is English.
Can it really be both?
I would say its much more about abstraction and the leverage abstractions give you.
You'll also note that while I talk about "spec driven development", most of the tactical stuff we've proven out is downstream of having a good spec.
But in the end a good spec is probably "the right abstraction" and most of these techniques fall out as implementation details. But to paraphrase sandy metz - better to stay in the details than to accidentally build against the wrong abstraction (https://sandimetz.com/blog/2016/1/20/the-wrong-abstraction)
I don't think delegation is right - when me and vaibhav shipped a week's worth of work in a day, we were DEEPLY engaged with the work, we didn't step away from the desk, we were constantly resteering and probably sent 50+ user messages that day, in addition to some point-edits to markdown files along the way.
It’s definitely not abstraction. You don’t watch a compiler output machine code and constantly “resteer” it.
I continue to write codebases in programming languages, not English. LLM agents just help me manipulate that code. They are tools that do work for me. That is delegation, not abstraction.
To write and review a good spec, you also need to understand your codebase. How are you going to do that without reading the code? We are not getting abstracted away from our codebases.
For it to be an abstraction, we would need our coding agents to not only write all of our code, they would also need to explain it all to us. I am very skeptical that this is how developers will work in the near future. Software development would become increasingly unreliable as we won't even understand what our codebases actually do. We would just interact with a squishy lossy English layer.
You don’t think early c programmers spent a lot of time reading the assembly that was produced?
No not really. They didn’t need to spend a lot of time looking at the output because (especially back then) they mostly knew exactly what the assembly was going to look like.
With an LLM, you don’t need to move down to the code layer so you can optimize a tight loop. You need to look at the code so you can verify that the LLM didn’t write a completely different program that what you asked it to write.
Probably at first when the compiler was bad at producing good assembly. But even then, the compiler would still always produce code that matches the rules of the language. This is not the case with LLMs. There is no indication that in the future LLMs will become deterministic such that we could literally write codebases in English and then "compile" them using an LLM into a programming language of our choice and rely on the behaviour of the final program matching our expectations.
This is why LLMs are categorically not compilers. They are not translating English code into some other type of code. They are taking English direction and then writing/editing code based upon that. They are working on a codebase alongside us, as tools. And then you still compile that code using an actual compiler.
We will start to trust these tools more and more, and probably spend less time reviewing the code they produce over time. But I do not see a future where professional developers completely disregard the actual codebase and rely entirely on LLMs for code that matters. That would require a completely different category of tools than what we have today.
I mean, the ones who were actually _writing_ a C compiler, sure, and to some who were in performance critical spaces (early C compilers were not _good_). But normal programmers, checking for correctness, no, absolutely not. Where did you get that idea?
(The golden age of looking at compiler-generated assembly would've been rather later, when processors added SIMD instructions and compilers started trying to get clever about using them.)
if you haven't tried the research -> plan -> implementation approach here, you are missing out on how good LLMs are. it completely changed my perspective.
the key part was really just explicitly thinking about different levels of abstraction at different levels of vibecoding. I was doing it before, but not explicitly in discrete steps and that was where i got into messes. The prior approach made check pointing / reverting very difficult.
When i think of everything in phases, i do similar stuff w/ my git commits at "phase" levels, which makes design decision easier to make.
I also do spend ~4-5 hours cleaning up the code at the very very end once everything works. But its still way faster than writing hard features myself.
tbh I think the thing that's making this new approach so hard to adopt for many people is the word "vibecoding"
Like yes vibecoding in the lovable-esque "give me an app that does XYZ" manner is obviously ridiculous and wrong, and will result in slop. Building any serious app based on "vibes" is stupid.
But if you're doing this right, you are not "coding" in any traditional sense of the word, and you are *definitely* not relying on vibes
Maybe we need a new word
I'm sticking to the original definition of "vibe coding", which is AI-generated code that you don't review.
If you're properly reviewing the code, you're programming.
The challenge is finding a good term for code that's responsibly written with AI assistance. I've been calling it "AI-assisted programming" but that's WAY too long.
It’s just programming. We don’t use a different word for writing code in an IDE either.
we can come up with something better :)
alex reibman proposed hyperengineering
i've also heard "aura coding", "spec-driven development" and a bunch of others I don't love.
but we def need a new word cause vibe coding aint it
Vibe coding is accepting ai output based on vibes. Simple as that.
You can vibe code using specs or just by having a conversation.
AI is the new pAIr programming.
> but not explicitly in discrete steps and that was where i got into messes.
I've said this repeatedly, I mostly use it for boilerplate code, or when I'm having a brain fart of sorts, I still love to solve things for myself, but AI can take me from "I know I want x, y, z" to "oh look I got to x, y, z in under 30 minutes, which could have taken hours. For side projects this is fine.
I think if you do it piecemeal it should almost always be fine. When you try to tell it to do two much, you and the model both don't consider edge cases (Ask it for those too!) and are more prone for a rude awakening eventually.
Good pointers on decompositing and looking at implementation or fixing in chunks.
1. Break down the feature or bug report into a technical implementation spec. Add in COT for the splits. 2. Verify the implementation spec. Feed reviews back to your original agent that has created the spec. Edit, merge, integrate feedback. 3. Transform implementation spec into an implementation plan - logically split into modules look at dependency chain. 4. Build, test and integrate continuously with coding agents 5. Squash the commits if needed into a single one for the whole feature.
Generally has worked well as a process when working on a complex feature. You can add in HITL at each stage if you need more verification.
For larger codebases always maintain an ARCHITECTURE.md and for larger modules a DESIGN.md
I admittedly haven't tried this approach at work yet but at home while working on a side project, I'll make a new feature branch and give CLAUDE a prompt about what the feature is with as much detail as possible. i then have it generate a CLAUDE-feature.md and place an implementation plan along with any supporting information (things we have access to in the codebase, etc.).
i'll then prompt it for more based on if my interpretation of the file is missing anything or has confusing instructions or details.
usually in-between larger prompts I'll do a full /reset rather than /compact, have it reference the doc, and then iterate some more.
once it's time to try implementing I do one more /reset, then go phase by phase of the plan in increments /reset-ing between each and having it update the doc with its progress.
generally works well enough but not sure i'd trust it at work.
My advice - never use compact, always stash your context to Md or a wordy git commit message and then clear context
You want control over and visibility into what’s being compacted, and /compact doesn’t do great on either
> It was uncomfortable at first. I had to learn to let go of reading every line of PR code. I still read the tests pretty carefully, but the specs became our source of truth for what was being built and why.
This is exactly right. Our role is shifting from writing implementation details to defining and verifying behavior.
I recently needed to add recursive uploads to a complex S3-to-SFTP Python operator that had a dozen path manipulation flags. My process was:
* Extract the existing behavior into a clear spec (i.e., get the unit tests passing).
* Expand that spec to cover the new recursive functionality.
* Hand the problem and the tests to a coding agent.
I quickly realized I didn't need to understand the old code at all. My entire focus was on whether the new code was faithful to the spec. This is the future: our value will be in demonstrating correctness through verification, while the code itself becomes an implementation detail handled by an agent.
> Our role is shifting from writing implementation details to defining and verifying behavior.
I could argue that our main job was always that - defining and verifying behavior. As in, it was a large part of the job. Time spent on writing implementation details have always been on a downward trend via higher level languages, compilers and other abstractions.
Tell that to all the engineers that want to argue over minutia for days in a PR
> My entire focus was on whether the new code was faithful to the spec
This may be true, but see Postel's Law, that says that the observed behavior of a heavily-used system becomes its public interface and specification, with all its quirks and implementation errors. It may be important to keep testing that the clients using the code are also faithful to the spec, and detect and handle discrepancies.
I believe that's Hyrum's Law.
Claude Plays Pokemon showed that too. AI is bad at deciding when something is "working" - it will go in circles forever. But an AI combined with a human to occasionally course correct is a powerful combo.
If you actually define every inch of behavior, you are pretty much writing code. If there's any line in the PR that you can't instantly grok the meaning of, you probably haven't defined the full breadth of the behavior.
Maybe I am just misunderstanding. I probably am; seems like it happens more and more often these days
But.. I hate this. I hate the idea of learning to manage the machine's context to do work. This reads like a lecture in an MBA class about managing certain types of engineers, not like an engineering doc.
Never have I wanted to manage people. And never have I even considered my job would be to find the optimum path to the machine writing my code.
Maybe firmware is special (I write firmware)... I doubt it. We have a cursor subscription and are expected to use it on production codebases. Business leaders are pushing it HARD. To be a leader in my job, I don't need to know algorithms, design patterns, C, make, how to debug, how to work with memory mapped io, what wear leveling is, etc.. I need to know 'compaction' and 'context engineering'
I feel like a ship corker inspecting a riveted hull
Guess it boils down to personality, but I personally love it. I got into coding later in life, and coming from a career that involved reading and writing voluminous amounts of text in English. I got into programming because I wanted to build web applications, not out of any love for the process of programming in and of itself. The less I have to think and write in code, the better. Much happier to be reading it and reviewing it than writing it myself.
No ones like programming that much. That's like saying someone love speaking English. You have an idea and you express it. Sometimes there's additional complexity that got in the way (initializing the library, memory cleanup,...), but I put those at the same level as proper greetings in a formal letter.
It also helps starting small, get something useful done and iterate by adding more features overtime (or keeping it small).
> No ones like programming that much. That's like saying someone love speaking English. You have an idea and you express it.
I can assure you both kinds of people exist. Expressing ideas as words or code is not a one-way flow if you care enough to slow down and look closely. Words/clauses and data structures/algorithms exert their own pull on ideas and can make you think about associated and analogous ideas, alternative ways you could express your solution, whether it is even worth solving explicitly and independently of a more general problem, etc.
IMO, that’s a sign of overthinking (and one thing I try hard to not get caught in). My process is usually:
- What am I trying to do?
- What data do I have available?
- Where do they come from?
- What operations can I use?
- What’s the final state/output?
Then it’s a matter of shifting into the formal space, building and linking stuff.
What I did observe is a lot of people hate formalizing their thoughts. Instead they prefer tweaking stuff until something kinda works and they can go on to the next ticket/todo item. There’s no holistic view about the system. And they hate the 5 why’s. Something like:
- Why is the app displaying “something went wrong” when I submit the form?
- Why is the response is an error when the request is valid?
- Why is the data is persisted when the handler is failing and giving a stack trace in the log?
- Why is it complaining about missing configuration for Firebase?
- …
Ignorance is te default state of programming effort. But a lot of people have great difficulty to say I don’t know AND to go find the answer they lack.
None of this is excluded by my statement. And arguably someone else can draw a line in the sand and say most of this is overthinking somehow and you should let the machine worry about it.
I would love to let the computer do the investigative work for me, but I have to double check it, and there's not much mental energy and time saved (if you care about work quality). When I use `pgrep` to check if a process is running, I don't have to inspect the kernel memory to see if it's really there.
It's very much faster, cognitively, to just understand the project and master the tooling. Then it just becomes routine, like playing a short piano piece for the 100th time.
I know lots of programmers (usually the good ones) who do love programming.
I've started to use agents on some very low-level code, and have middling results. For pure algorithmic stuff, it works great. But I asked it to write me some arm64 assembly and it failed miserably. It couldn't keep track of which registers were which.
I imagine the LLM's have been trained on a lot less firmware code than say, HTML
Honestly - if it's such a good technique it should be built into the tool itself. I think just waiting for the tools to mature a bit will mean you can ignore a lot of the "just do xyz" crap.
It's not at senior engineer level until it asks relevant questions about lacking context instead of blindly trying to solve problems IMO.
I am still sceptical of the roi and the time i am supposed to sink into trying and learning these AI tools which seem to be replacing each other every week.
For me the biggest difficulty is I find it hard to read unverifiable documentation. It's like dyslexia - if I can't connect the text content with runnable code, I feel lost in 5 minutes.
So with this approach of spending 3 hours on planning without verification in code, that's too hard for me.
I agree the context compaction sounds good. But I'm not sure if an md file is good enough to carry the info from research to plan and implementation. Personally I often find the context is too complex or the problem is too big. I just open a new session to resolve a smaller, more specific problem in source code, then test and review the source code.
Context has never been the bottleneck for me. AI just stops working when I reach certain things that AI doesn't know how to do.
My problem is it keeps working, even when it reaches certain things it doesn't know how to do.
I've been experimenting with Github agents recently, they use GPT-5 to write loads of code, and even make sure it compiles and "runs" before ending the task.
Then you go and run it and it's just garbage, yeah it's technically building and running "something", but often it's not anything like what you asked for, and it's splurged out so much code you can't even fix it.
Then I go and write it myself like the old days.
I have same experience with CC. It loves to comment out code, add a "fallback" implementation that returns mock data, and act like the thing works.
> Context has never been the bottleneck for me. AI just stops working when I reach certain things that AI doesn't know how to do.
It's context all the way down. That just means you need to find and give it the context to enable it to figure out how to do the thing. Docs, manuals, whatever. Same stuff that you would use to enable a human that doesn't know how to do it to figure out how.
At that point it's easier to implement the thing yourself, and then let AI work with that.
Or just forget the AI entirely, if you can build it yourself then do it yourself
I treat "uses AI tools" as a signal that a person doesn't know what they are doing
Specifically what did you have difficulty implementing where it "just stops working"?
I've had AI totally fail several times on Swift concurrency issues, i.e. threads deadlocking or similar issues. I've also had AI totally fail on memory usage issues in Swift. In both cases I've had to go back to reasoning over the bugs myself and debugging them by hand, fixing the code by hand.
Anything it has not been trained on. Try getting AI to use OpenAI's responses API. You will have to try very hard to convince it not to use the chat completions API.
in cursor you can index docs by just adding a url and then reference it like file context in the editor
yeah once again you need the right context to override what's in the weights. It may not know how to use the responses api, so you need to provide examples in context (or tools to fetch them)
im struggling to understand what the issue with that is
This is just an issue with people who expect AI to solve all of lifes problems before they get out of bed not realising they have no idea how AI works or what it produces and decide "it stops working because it sucks" instead of "it stops working because I don't know what I'm doing"
In my limited experiments with Gemini: it stops working when presented with a program containing fundamental concurrency flaws. Ask it to resolve a race condition or deadlock and it will flail, eventually getting caught in a loop, suggesting the same unhelpful remedies over and over.
I imagine this has to to with concurrency requiring conceptual and logical reasoning, which LLMs are known to struggle with about as badly as they do with math and arithmetic. Now, it's possible that the right language to work with the LLM in these domains is not program code, but a spec language like TLA+. However, at that point, I'd probably just spend less effort to write the potentially tricky concurrent code myself.
1. Research -> Plan -> Implement
2. Write down the principles and assumptions behind the design and keep them current
In other words, the same thing successful human teams on complex projects do! Have we become so addicted to “attention-deficit agile” that this seems like a new technique?
Imagine, detailed specs, design documents, and RFC reviews are becoming the new hotness. Who would have thought??
yeah its kinda funny how some bigger more sophisticated eng orgs that would be called "slow and ineffective" by smaller teams are actually pretty dang well set-up to leverage AI.
All because they have been forced to master technical communication at scale.
but the reason I wrote this (and maybe a side effect of the SF bubble) is MOST of the people I have talked to, from 3-person startups to 1000+ employee public companies, are in a state where this feels novel and valuable, not a foregone conclusion or something happening automatically
Tasted like a sales pitch the whole way and what do ya know at the very end there it is
Thanks to write such detailed article... lot of very well supported information.
I've been working on something what I call Micromanaged Driven Development https://mmdd.dev and wrote about it at https://builder.aws.com/content/2y6nQgj1FVuaJIn9rFLThIslwaJ/...
I'm in a similar search and I'm stoked to see that many people riding the wave of coding with AI is moving in this direction.
Lots of learning ahead.
> And yeah sure, let's try to spend as many tokens as possible
It'd be nice if the article included the cost for each project. A 35k LOC change in a 350k codebase with a bunch of back and forth and context rewriting over 7 hours, would that be a regular subscription, max subscription, or would that not even cover it?
Oh, oops it says further down
> oh, and yeah, our team of three is averaging about $12k on opus per month
I'll have to admit, I was intrigued with the workflow at first. But emm, okay, yeah, I'll keep handwriting my open source contributions for a while.
From a cost perspective, you would definitely want a Claude Max subscription for this.
yes - correct. For the record, if spending raw tokens, the 2 prs to baml cost about $650.
but yes we switched off per-token this week because we ran out of anthropic credits, we're on max plan now
Haha, when I asked claude the question, it estimated $20-45. https://claude.ai/share/5c3b0592-7bc9-4c40-9049-459058b16920
Horrible, right? When I asked gemini, it guessed 37 cents! https://g.co/gemini/share/ff3ed97634ba
I use a similar pattern but without the subagents. I get good results with it. I review and hand edit "research" and plans. I follow up and hand edit code changes. It makes me faster, especially in unfamiliar codebases.
But the write up troubles me. If I'm reading correctly, he did 1 bugfix (approved and merged) and then 2 larger PRs (1 merged, 1 still in draft over a month later). That's an insanely small sample size to draw conclusions from.
How can you talk like you've just proven the workflow works "for brownfield codebases"? You proved it worked for 2/3 tasks in 2 codebases, one failure (we can't say it works until the code is shipped IMO).
> Sean proposes that in the AI future, the specs will become the real code. That in two years, you'll be opening python files in your IDE with about the same frequency that, today, you might open up a hex editor to read assembly (which, for most of us, is never).
Only if AI code generation is correct 99.9% of the time and almost never hallucinates. We trust compilers and don't read assembly code because we know it's deterministic and the output can never be wrong (barring bugs and certain optimization issues, which are rare/one-time fixes). As long as generated code is not doing what the original "code" (in this case, specs) doing, humans need to go back to fix things themselves.
I used a similar pattern. When ask AI to do a large implementation. I ask gemini-2.5-pro to write a very detailed overview implementation plan. Then review it. Then ask gemini-2.5-pro to split the plan into multiple stages and write detail implementation plan for each stage. Then I ask claude sonnat to read the overview plan and implement the stage n. I found that this is the only way to complete a major implementation with a relatively high success rate.
This article bases its argument on the predicate that AI _at worst_ will increase developer productivity be 0-10%. But several studies have found that not to be true at all. AI can, and does, make some people less effective
There's also the more insidious gap between perceived productivity and actual productivity. Doesn't help that nobody can agree on how to measure productivity even without AI.
"AI can, and does, make some people less effective"
So those people should either stop using it or learn to use it productively. We're not doomed to live in a world where programmers start using AI, lose productivity because of it and then stay in that less productive state.
If managers are convinced by stakeholders who relentlessly put out pro-"AI" blog posts, then a subset of programmers can be forced to at least pretend to use "AI".
They can be forced to write in their performance evaluation how much (not if, because they would be fired) "AI" has improved their productivity.
Both (1) "AI can, and does, make some people less effective" and (2) "the average productivity boost (~20%) is significant" (per Stanford's analysis) can be true.
The article at the link is about how to use AI effectively in complex codebases. It emphasizes that the techniques described are "not magic", and makes very reasonable claims.
the techniques described sound like just as much work, if not more, than just writing the code. the claimed output isn't even that great, it's comparable to the speed you would expect a skilled engineer to move at in a startup environment
> the techniques described sound like just as much work, if not more, than just writing the code.
That's very fair, and I believe that's true for you and for many experienced software developers who are more productive than the average developer. For me, AI-assisted coding is a significant net win.
I tend to think about it like vim - you will feel slow and annoyed for the first few weeks, but investing in these skills are massive +EV long term
Yet a lot of people never bother to learn vim, and are still outstanding and productive engineers. We're surely not seeing any memos "Reflexive vim usage is now a baseline expectation at [our company]" (context: https://x.com/tobi/status/1909251946235437514)
The as-of-yet unanswered question is: Is this the same? Or will non-LLM-using engineers be left behind?
Perhaps if we get the proper thought influencers on board we can look forward to C-suite VI mandates where performance reviews become descriptions of how we’ve boosted our productivity 10x with effective use of VI keyboard agents, the magic of g-prefixed VI technology, VI-power chording, and V-selection powered column intelligence.
letting people pick their own editors is a zirp phenomenon
How many skilled engineers can you afford to hire? Vs. Far more mediocre engineers who know how to leverage these tools?
definitely - the standford video has a slide about how many cases caused people to be even slower than without AI
According to the Stanford video the only cases (statistically speaking) where that happened was high-complexity tasks for legacy / low popularity languages, no? I would imagine that is a small minority of projects. Indeed, the video cites the overall productivity boost at 15 - 20% IIRC.
Question for discussion - what steps can I take as a human to set myself up for success where success is defined by AI made me faster, more efficient etc?
In many cases (though not all) it's the same thing that makes for great engineering managers:
smart generalists with a lot of depth in maybe a couple of things (so they have an appreciation for depth and complexity) but a lot of breadth so they can effectively manage other specialists,
and having great technical communication skills - be able to communicate what you want done and how without over-specifying every detail, or under-specifying tasks in important ways.
>where success is defined by AI made me faster, more efficient etc?
I think this attitude is part of the problem to me; you're not aiming to be faster or more efficient (and using AI to get there), you're aiming to use AI (to be faster and more efficient).
A sincere approach to improvement wouldn't insist on a tool first.
Can't agree with the formula for performance, on the "/ size" part. You can have a huge codebase, but if the complexity goes up with size then you are screwed. Wouldn't a huge but simple codebase be practical and fine for AI to deal with?
The hierarchy of leverage concept is great! Love it. (Can't say I like the 1 bad line of CLAUDE.md is 100K lines of bad code; I've had some bad lines in my CLAUDE.md from time to time - I almost always let Claude write it's own CLAUDE.md.).
i mean there's also the fact that claude code injects this system message into your claude.md which means that even if your claude.md sucks you will probably be okay:
<system-reminder> IMPORTANT: this context may or may not be relevant to your tasks. You should not respond to this context or otherwise consider it in your response unless it is highly relevant to your task. Most of the time, it is not relevant. </system-reminder>
lots of others have written about this so i won't go deep but its a clear product decision, but if you don't know what's in your context window, you can't respond/architect your balance between claude.md and /commands well.
I enjoyed the emphasis on optimising the context window itself. I think that's the most important bit.
An abstraction for this that seems promising to me for its completeness and size is a User Story paired with a research plan(?).
This works well for many kinds of applications and emphasizes shipping concrete business value for every unit of work.
I wrote about some of it here: https://blog.nilenso.com/blog/2025/09/15/ai-unit-of-work/
I also think a lot of coding benchmarks and perhaps even RL environments are not accounting for the messy back and forth of real world software development, which is why there's always a gap between the promise and reality.
I have had a user story and a research plan and only realized deep in the implementation that a fundamental detail about how the code works was missing (specifically, that types and sdks are generated from OpenAPI spec) - this missing meant the plan was wrong (didn’t read carefully enough) and the implementation was a mess
Yeah I agree. There's a lot more needed than just the User Story, one way I'm thinking about it is that the "core" is deliverable business value, and the "shells" are context required for fine-grained details. There will likely need to be a step to verify against the acceptance criteria.
I hope to back up this hypothesis with actual data and experiments!
Hello, I noticed your privacy policy is a black page with text seemingly set to 1% or so opacity. Can you get the slopless AI to fix that when time permits?
- Mr. Snarky
They wanted a transparent privacy policy
thank you for the feedback! themes are hard. Update going out now
I'm using GPT Pro and a VS extension that makes it easy to copy code from multiple files at once. I'm architecting the new version of our SaaS and using it to generate everything for me on the backend. It’s a huge help with modeling and coding, though it takes a lot of steering and correction. I think I’ll end up with a better result than if I did it alone, since it knows many patterns and details I’m not aware of (even simple things like RRULE). I’m designing this new project with a simpler, more vertical architecture in the hopes that Codex will be able to create new tables and services easily once the initial structure is ready and well documented.
Edit: typo.
yeah flat, simple code is good to start, but I find I'm still developing instincts around right balance between "when to let duplicate code sprawl" vs. "when to be the DRY police".
Agents get really confused by duplicate code, so I advise DRYing out early and often.
Re the meta of running multiple phases of "document expansion":
Research helps with complex implementations and for brownfield. But it isn't always needed - simple bugfixes can be one-shot!
So all AI workflows could be expressed with some number "N" of "document expansion phases":
N(0): vibe coding.
N(1): "write a spec then implement it while I watch".
N(2): "research then specify". At this point you start to get serious steerability.
What's N(3) and beyond? Strategy docs, industry research, monetization planning? Can AI do these too, all of it ending up in git? Interesting to muse on.
I wrote this blogpost on the same topic: https://getstream.io/blog/cursor-ai-large-projects/
It's super effective with the right guardrails and docs. It also works better on languages like Go instead of Python.
why do you think go is better than python (i have some thoughts but curious your take)
imo:
1. Go's spec and standard practices are more stable, in my experience. This means the training data is tighter and more likely to work.
2. Go's types give the llm more information on how to use something, versus the python model.
3. Python has been an entry-level accessible language for a long time. This means a lot of the code in the training set is by amateurs. Go, ime, is never someone's first language. So you effectively only get code from someone who has already has other programming experience.
4. Go doesn't do much 'weird' stuff. It's not hard to wrap your head around.
yeah i love that there is a lot of source data for "what is good idiomatic go" - the model doesn't have it all in the training set but you can easily collect coding standards for go with deep research or something
And then I find models try to write scripts/manual workflows for testing, but Go is REALLY good for doing what you might do in a bash script, and so you can steer the model to build its own feedback loop as a harness in go integration tests (we do a lot of this in github.com/humanlayer/humanlayer/tree/main/hld)
probably because it's typed?
Among other things; coding agents that can get feedback by running a compile step on top of the linter will tend to produce better output.
Also, strongly-typed languages tend to catch more issues through the language server which the agent can touch through LSP.
Python is strongly typed
Except for ofc pushing their own product (humanlayer) and some very complex prompt template+agent setups that are probably overkill for most, the basics in this post about compaction and doing human review at the correct level are pretty good pointers. And giving a bit of a framework to think within is also neat
Verifying behavior is great and all if you can actually exhaustively test the behaviors of your system. If you can't, then not knowing what your code is actually doing is going to set you back when things do go belly up.
I love this comment because it makes perfect sense today, it made perfect sense 10 years ago, it would have made perfect sense in 1970. The principles of software engineering are not changed by the introduction of commodified machine intelligence.
i 100% agree - the folks who are best at ai-first engineering, they spend 3 days designing the test harness and then kick off an agent unsupervised for 2+ days and come back to working software.
not exactly valuable as guidance since programming languages are very easy to verify, but the https://ghuntley.com/ralph post is an example of whats possible on the very extreme end of the spectrum
As an aside, this single markdown file as an entire GitHub repo is a unique approach to blog posts.
s/unique/lazy
To minimise context bloat and provide more holistic context, I extract on first step the important elements from a codebase via AST which then the LLM uses to determine which files to get in full for given task.
https://github.com/piqoni/vogte
I used to do these things manually in Cursor. Then I had to take a few months off programming, and when I came back and updated Cursor I found out that it now automatically does ToDos, as well as keeps track of the context size and compresses it automatically by summarising the history when it reaches some threshold.
With this I find that most of the shenanigans of manual context window managing with putting things in markdown files is kind of unnecessary.
You still need to make it plan things, as well as guide the research it does to make sure it gets enough useful info into the context window, but in general it now seems to me like it does a really good job with preserving the information. This is with Sonnet 4
YMMV
I’m not an expert in either language, but seeing a 20k LoC PR go up (linked in the article) would be an instant “lgtm, asshole” kind of review.
> I had to learn to let go of reading every line of PR code
Ah. And I’m over here struggling to get my teammates to read lines that aren’t in the PR.
Ah well, if this stuff works out it’ll be commoditized like the author said and I’ll catch up later. Hard to evaluate the article given the authors financial interest in this succeeding and my lack of domain expertise.
I dunno man, I usually close the PR when someone does that and tell them to make more atomic changes.
Would you trust an colleague who is over confident, lies all the time, and then pushes a huge PR? I wouldn't.
Closing someone else’s PR is an actively hostile move. Opening a 20k LOC isn’t great either, but going ahead and closing it is rude as hell.
Dumping a huge PR across a shared codebase wherein everyone else also has to deal with the risk of you monumental changes is pretty rude as well, I would even go so far as to say that it is likely selfishly risky.
Dumping a 20k LOC PR on somebody to review especially if all/a lot of it was generated with AI is disrespectful. The appropriate response is to kick that back and tell them to make it more digestible.
Opening a 20k LOC PR is an actively hostile move worthy of an appropriate response.
Closed > will not review > make more atomic changes.
A 20k LOC PR isn’t reviewable in any normal workflow/process.
The only moves are refusing to review it, taking it up the chain of authority, or rubber stamping it with a note to the effect that it’s effectively unreviewable so rubber stamping must be the desired outcome.
Sure it is if the great majority is tests.
I don't understand this attitude. Tests are important parts of the codebase. Poorly written tests are a frequent source of headaches in my experience, either by encoding incorrect assumptions, lying about what they're testing, giving a false sense of security, adding friction to architectural changes/refactors, etc. I would never want to review even 2k lines of test changes in one go.
Preach. Also, don't forget making local testing/CI take longer to run, which costs you both compute and developer context switching.
I've heard people rave about LLMs for writing tests, so I tried having Claude Code generate some tests for a bug I fixed against some autosave functionality - (every 200ms, the auto-saver should initiate a save if the last change was in the previous 200ms). Claude wrote five tests that each waited 200ms (!) adding a needless entire second to the run-time of my test suite.
I went in to fix it by mocking out time, and in the process realized that the feature was doing a time stamp comparisons when a simpler/non-error prone approach was to increment a logical clock for each change instead.
The tests I've seen Claude write vary from junior-level to flat-out-bad. Tests are often the first consumer of a new interface, and delegating them to an LLM means you don't experience the ergonomics of the thing you just wrote.
i think the general take away for all of this is the model can write the code but you still have to design it. I don't disagree with anything you've said, and I'd say my advice is engage more, iterate more, and work in small steps to get the right patterns and rules laid out. It wont work well on day one if you don't set up the right guidelines and guardrails. That's why it's still software engineering, despite being a different interaction medium.
And if the 10k lines of tests are all garbage, now what? Because tests are the 1 place you absolutely should not delegate to AI outside of setting up the boilerplate/descriptions.
If somebody did this, it means they ignored their team's conventions and offloaded work onto colleagues for their own convenience. Being considered rude by the offender is not a concern of mine when dealing with a report who pulls this kind of antisocial crap.
I'm the owner of some of my work projects/repos. I will absolutely without a 2nd thought close a 20k LoC PR, especially an AI generated one, because the code that ends up in master is ultimately my responsibility. Unless it's something like a repo-wide linter change or whatever, there's literally never a reason to have such a massive PR. Break it down, I don't care if it ends up being 200 disparate PRs, that's actually possible to properly review compared to a single 20k line PR.
If this stuff works out, you'll be behind the curve and people who were on the ball will have your job.
It’s refreshing to read a full article this was written by a human. Content +++
> context management, and keeping utilization in the 40%-60% range (depends on complexity of the problem).
Is this a rule of thumb? Will the cheaper (fewer params) models dumb down at 25%?
Using AI to help with code felt like working with a smart but slightly unreliable teammate. If I wasn’t clear, it just couldn’t follow. But once I learned to explain what I wanted clearly and specifically, it actually saved me time and helped me think more clearly too.
I am working on a project with ~200k LoC, entirely written with AI codegen.
These days I use Codex, with GPT-5-Codex + $200 Pro subscription. I code all day every day and haven't yet seen a single rate limiting issue.
We've come a long way. Just 3-4 months ago, LLMs would start doing a huge mess when faced with a large codebase. They would have massive problems with files with +1k LoC (I know, files should never grow this big).
Until recently, I had to religiously provide the right context to the model to get good results. Codex does not need it anymore.
Heck, even UI seems to be a solved problem now with shadcn/ui + MCP.
My personal workflow when building bigger new features:
1. Describe problem with lots of details (often recording 20-60 mins of voice, transcribe)
2. Prompt the model to create a PRD
3. CHECK the PRD, improve and enrich it - this can take hours
4. Actually have the AI agent generate the code and lots of tests
5. Use AI code review tools like CodeRabbit, or recently the /review function of Codex, iterate a few times
6. Check and verify manually - often times, there are a few minor bugs still in the implementation, but can be fixed quickly - sometimes I just create a list of what I found and pass it for improving
With this workflow, I am getting extraordinary results.
AMA.
And I assume there's no actual product that customers are using that we could also demo? Because only 1 out of every 20 or so claims of awesomeness actually has a demoable product to back up those claims. The 1 who does usually has immediate problems. Like an invisible text box rendered over the submit button on their Contact Us page preventing an onClick event for that button.
In case it wasn't obvious, I have gone from rabidly bullish on AI to very bearish over the last 18 months. Because I haven't found one instance where AI is running the show and things aren't falling apart in not-always-obvious ways.
I'm kind of in the same boat although the timeline is more compressed. People claim they're more productive and that AI is capable of building large systems but I've yet to see any actual evidence of this. And the people who make these claims also seem to end up spending a ton of time prompting to the point where I wonder if it would have been faster for them to write the code manually, maybe with copilot's inline completions.
I created these demos using real data and real api connections with real databases, utilizing 100% AI code in http://betpredictor.io and https://pix2code.com; however, they barely work. At this point, I'm fixing 90% or more of every recommendation the AI gives. With you're code base being this large, you can be guaranteed that the AI will not know what needs to be edited, but I haven't written one line of hand-written code.
I can't reach either site.
pix2code screenshot doesn't load.
Neither site works bro.
It is true AI-generated UIs tend to be... Weird. In weird ways. Sometimes they are consistent and work as intended, but often times they reveal weird behaviors.
Or at least this was true until recently. GPT-5 is consistently delivering more coherent and better working UIs, provided I use it with shadcn or alternative component libraries.
So while you can generate a lot of code very fast, testing UX and UI is still manual work - at least for me.
I am pretty sure, AI should not run the show. It is a sophisticated tool, but it is not a show runner - not yet.
If you tell it to use a standard component library, the UIs should be mostly as coherent as the library.
Nothing much weird about the SwiftUI UIs GPT-5-codex generates for me. And it adapts well to building reusable/extensible components and using my existing components instead of constantly reinventing, because it is good at reading a lot of code before putting in work.
It is also good at refactoring to consolidate existing code for reusability, which makes it easier to extend and change UI in the future. Now I worry less about writing new UI or copy/pasting UI because I know I can do the refactoring easily to consolidate.
Let me summarise your comment in a few words: show me the money. If nobody is buying anything, there is no incremental value creation or augmentation of existing value in the economy that didn't already exist.
It's not the goal to have AI running the show. There's babysitting required, but it works pretty well tbh.
Note: using it for my B2B e-commerce
What is you opinion on what is the "right level of detail" that we should use when creating technical documents the LLM will use to implement features ?
When I started leaning heavily into LLMs I was using really detailed documentations. Not '20 minutes of voice recordings', but my specification documents would easily hit hundreds of lines even for simple features.
The result was decent, but extremely frustrating. Because it would often deliver 80% to 90% but the final 10% to 20% it could never get right.
So, what I naturally started doing was to care less about the details of the implementation and focus on the behavior I want. And this led me to simpler prompts, to the point that I don't feel the need to create a specification document anymore. I just use the plan mode in Claude Code and it is good enough for me.
One way that I started to think about this was that really specific documentations were almost as if I was 'over-fitting' my solution over other technically viable solutions the model could come up with. One example would be if I want to sort an array, I could either ask for "sort the array" or "merge sort the array". And by forcing a merge sort I may end up with a worse solution. Admittedly sort is a pretty simple and unlikely example, but this could happen with any topic. You may ask the model to use a hash-set but a better solution would be to use a bloom filter.
Given all that, do you think investing so much time into your prompts provides a good ROI compared with the alternative of not really min-maxing every single prompt?
I 100% agree with the over-fitting part.
I tend to provide detailed PRDs, because even if the first couple of iterations of the coding agent are not perfect, it tends to be easier to get there (as opposed to having a vague prompt and move on from there).
What I do sometimes is an experimental run - especially when I am stuck. I express my high-level vision, and just have the LLM code it to see what happens. I do not do it often, but it has sometimes helped me get out of being mentally stuck with some part of the application.
Funnily, I am facing this problem right now, and your post might just have reminded me, that sometimes a quick experiment can be better than 2 days of overthinking about the problem...
This mirrors my experience with AI so far - I've arrived at mostly using the plan and implement modes in Claude Code with complete but concise instructions about the behavior I want with maybe a few guide rails for the direction I'd like to see the implementation path take. Use cases and examples seem to work well.
I kind of assumed that claude code is doing most of the things described this document under the hood (but I really have no idea).
"The result was decent, but extremely frustrating. Because it would often deliver 80% to 90% but the final 10% to 20% it could never get right."
This is everyone's experience if they don't have a vested interest in LLM's, or if their domain is low risk (e.g., not regulated).
> Heck, even UI seems to be a solved problem now with shadcn/ui + MCP.
I'm interested in hearing more about this - any resource you can point me at or do you mind elaborating a bit? TIA!
Basically, you install the shadcn MCP server as described here: https://ui.shadcn.com/docs/mcp
If you use Codex, convert the config to toml:
[mcp_servers.shadcn] command = "npx" args = ["shadcn@latest", "mcp"]
Now with the MCP server, you can instruct the coding agent to use shadcn. I often do "I you need to add new UI elements, make sure to use shadcn and the shadcn component registry to find the best fitting component"
The genius move is that the shadcn components are all based on Tailwind and get COPIED to your project. 95% of the time, the created UI views are just pixel-perfect, spacing is right, everything looks good enough. You can take it from here to personalize it more using the coding agent.
I've had success here by simply telling Codex which components to use. I initially imported all the shadcn components into my project and then I just say things like "Create a card component that includes a scrollview component and in the scrollview add a table with a dropdown component in the third column"...and Codex just knows how to add the shadcn components. This is without internet access turned on by the way.
Telling which component to use works perfectly too, if you want a very specific look.
> 1. Describe problem with lots of details (often recording 20-60 mins of voice, transcribe)
I just ask it to give me instructions for a coding agent and give it a small description of what I want to do, it looks at my code, and details what I describes as best as it can, and usually I have enough to let Junie (JetBrains AI) run on.
I can't personally justify $200 a month, I would need to see seriously strong results for that much. I use AI piecemeal because it has always been the best way to use it. I still want to understand the codebase. When things break its mostly on you to figure out what broke.
A small description can be extrapolated to a large feature, but then you have to accept the AI filling in the gaps. Sometimes that is cool, often times it misses the mark. I do not always record that much, but if I have a vague idea that I want to verbalize, I use recording. Then I take the transcript and create the PRD based on it. Then I iterate a few more times on the PRD - which yield much better results.
>I am working on a project with ~200k LoC, entirely written with AI codegen.
I’d love to see the codebase if you can share. My experience with LLM code generation (I’ve tried all of the popular models and tools, though generally favor Claude Code with Opus and Sonnet). My time working with them leads me to suspect that your ~200k LoC project could be solved in only about 10k LoC. Their solutions are unnecessary complex (I’m guessing because they don’t “know” the problem, in the way a human does) and that compounds over time. At this point, I would guess my most common instruction to this tools is to simplify the solution. Even when that’s part of the plan.
Don't want to come off as combative but if you code every day with codex you must not be pushing very hard, I can hit the weekly quota in <36 hours. The quota is real and if you're multi-piloting you will 100% hit it before the week is over.
Fair enough. I spend entire days working on the product, but obviously there are lots of times I am not running Codex - when reviewing PRDs, testing, talking to users, even posting on HN is good for the quota ;)
On the Pro tier? Plus/Team is only suitable for evaluating the tool and occasional help
Btw one thing that helps conserve context/tokens is to use GPT 5 Pro to read entire files (it will read more than Codex will, though Codex is good at digging) and generate plans for Codex to execute. Tools like RepoPrompt help with this (though it also looks pretty complicated)
Yes, the $200 tier. I do use GPT5/Gemini 2.5 to generate plans that I hand off to codex, that's actually how I keep my agents super busy.
Bracing myself for the inevitability of keeping 3-5 Pro subscriptions at once
I thought about it, but I don't think it's necessary. Grok-4-fast is actually quite a good model, you can just set up a routing proxy in front of codex and route easy queries to it, and for maybe $50/mo you'll probably never hit your GPT plan quota.
Maybe, but I'd rather pay for consistent access to state of the art quality even if it's slower (which hasn't mattered much while parallelizing)
I can recommend one more thing: tell the LLM frequently to "ask me clarifying questions". It's simple, but the effect is quite dramatic, it really cuts down on ambiguity and wrong directions without having to think about every little thing ahead of time.
When do you do that? You give it the PRD and tell it to ask clarifying questions? Will definitely try that.
The "ask my clarifying questions" can be incredibly useful. It often will ask me things I hadn't thought of that were relevant, and it often suggests very interesting features.
As for when/where to do it? You can experiment. I do it after step 1.
Before or after.
"Here is roughly what I want, ask me clarifying questions"
Now I pick and choose and have a good idea if my assumptions and the LLMs assumptions align.
yeah if you read our create_plan prompt, it sets up a 3+ phase back and forth soliciting clarifying questions before the plan is built!
This sounds very similar to my workflow. Do you have pre-commits or CI beyond testing? I’ve started thinking about my codebase as an RL environment with the pre-commits as hyperparameters. It’s fascinating seeing what coding patterns emerge as a result.
I think pre-commit is essential. I enforce conventional commits (+ a hook which limits commit length to 50 chars) and for Python, ruff with many options enabled. Perhaps the most important one is to enforce complexity limits. That will catch a lot of basic mistakes. Any sanity checks that you can make deterministic are a good idea. You could even add unit tests to pre-commit, but I think it's fine to have the model run pytest separately.
The models tend to be very good about syntax, but this sort of linting will often catch dead code like unused variables or arguments.
You do need to rule-prompt that the agent may need to run pre-commit multiple times to verify the changes worked, or to re-add to the commit. Also, frustratingly, you also need to be explicit that pre-commit might fail and it should fix the errors (otherwise sometimes it'll run and say "I ran pre-commit!") For commits there are some other guardrails, like blanket denying git add <wildcard>.
Claude will sometimes complain via its internal monologue when it fails a ton of linter checks and is forced to write complete docstrings for everything. Sometimes you need to nudge it to not give up, and then it will act excited when the number of errors goes down.
Very solid advice. I need to experiment more with the pre-commit stuff, I am a bit tired of reminding the model to actually run tests / checks. They seem to be as lazy about testing as your average junior dev ;)
Yes, I do have automated linting (a bit of a PITA at this scale). On the CI side I am using Github Actions - it does the job, but haven't put much work into it yet.
Generally I have observed that using a statically typed language like Typescript helps catching issues early on. Had much worse results with Ruby.
Which of these steps do you think/wish could be automated further? Most of the latter ones seem like throwing independent AI reviewers could almost fully automate it, maybe with a "notify me" option if there's something they aren't confident about? Could PRD review be made more efficient if it was able to color code by level of uncertainty? For 1, could you point it to a feed of customer feedback or something and just have the day's draft PRD up and waiting for you when you wake up each morning?
There is definitely way too much plumbing and going back and forth.
But one thing that MUST get better soon is having the AI agent verify its own code. There are a few solutions in place, e.g. using an MCP server to give access to the browser, but these tend to be brittle and slow. And for some reason, the AI agents do not like calling these tools too much, so you kinda have to force them every time.
PRD review can be done, but AI cannot fill the missing gaps the same way a human can. Usually, when I create a new PRD, it is because I have a certain vision in my head. For that reason, the process of reviewing the PRD can be optimized by maybe 20%. OR maybe I struggle to see how tools could make me faster at reading and commenting / editing the PRD.
Agents __SHOULD NOT__ verify their own code. They know they wrote it, and they act biased. You should have a separate agent with instructions to red team the hell out of a commit, be strict, but not nitpick/bikeshed, and you should actually run multiple review agents with slightly different areas of focus since if you try to run one agent for everything it'll miss lots of stuff. A panel of security, performance, business correctness and architecture/elegance agents (armed with a good covering set of code context + the diff) will harden a PR very quickly.
Codex uses this principle - /review runs in a subthread, does not see previous context, only git diff. This is what I am using. Or I open Cursor to review code written by GPT-5 using Sonnet.
Do you have examples of this working, or any best practices on how to orchestrate it efficiently? It sounds like the right thing to do, but it doesn't seem like the tech is quite to the point where this could work in practice yet, unless I missed it. I imagine multiple agents would churn through too many tokens and have a hard time coming to a consensus.
I've been doing this with Gemini 2.5 for about 6 months now. It works quite well, it doesn't catch big architectural 100% but it's very good at line/module level logic issues and anti-patterns.
Have you considered or tried adding steps to create / review an engineering design doc? Jumping straight from PRD to a huge code change seems scary. Granted, given that it's fast and cheap to throw code away and start over, maybe engineering design is a thing of the past. But still, it seems like it would be useful to have it delineate the high-level decisions and tradeoffs before jumping straight into code; once the code is generated it's harder to think about alternative approaches.
It depends. But let me explain.
Adding an additional layer slows things down. So the tradeoff must be worth it.
Personally, I would go without a design doc, unless you work on a mission-critical feature humans MUST specify or deeply understand. But this is my gut speaking, I need to give it a try!
Yeah I'd love to hear more about that. Like the way I imagine things working currently is "get requirement", "implement requirement", more or less following existing patterns and not doing too much thinking or changing of the existing structure.
But what I'd love to see is, if it has an engineering design step, could it step back and say "we're starting to see this system evolve to a place where a <CQRS, event-sourcing, server-driven-state-machine, etc> might be a better architectural match, and so here's a proposal to evolve things in that direction as a first step."
Something like Kent Beck's "for each desired change, make the change easy (warning: this may be hard), then make the easy change." If we can get to a point where AI tools can make those kinds of tradeoffs, that's where I think things get slightly dangerous.
OTOH if AI models are writing all the code, and AI models have contexts that far exceed what humans can keep in their head at once, then maybe for these agents everything is an easy change. In which case, well, I guess having human SWEs in the loop would do more harm than good at that point.
I have LLMs write and review design docs. Usually I prompt to describe the doc, the structure, what tradeoffs are especially important, etc. Then an LLM writes the doc. I spot check it. A separate LLM reviews it according to my criteria. Once everything has been covered in first draft form I review it manually, and then the cycle continues a few times. A lot of this can be done in a few minutes. The manual review is the slowest part.
How does it compare to Cursor with Claud? I’ve been really impressed with how well Cursor works, but always interested in up leveling if there’s better tools considering how fast this space is moving. Can you comment to how Codex performs vs Cursor?
Claude code is Claude code, whether you use in cursor or not
Codex and Claude code are neck and neck, but we made the decision to go all in on opus 4, as there are compounding returns in optimizing prompts and building intuition for a specific model
That said I have tested these prompts on codex, amp, opencode, even grok 4 fast via codebuff, and they still work decently well
But they are heavily optimized from our work with opus in particular
What do you mean by "compounding returns" here?
What platform are you developing for, web?
Did you start with Cursor and move to Codex or only ever Codex?
Not OP, but I use Codex for back-end, scripting, and SQL. Claude Code for most front-end. I have found that when one faces a challenge, the other often can punch through and solve the problem. I even have them work together (moving thoughts and markdown plans back and fourth) and that works wonders.
My progression: Cursor in '24, Roo code mid '25, Claude Code in Q2 '25, Codex CLI in Q3 `25.
Cursor for me until 3-4 weeks ago, now Codex CLI most of the time.
These tools change all the time, very quickly. Important to stay open to change though.
Yes, it is a web project with next.js + Typescript + Tailwind + Postgres (Prisma).
I started with Cursor, since it offers a well-rounded IDE with everything you need. It also used to be the best tool for the job. These days Codex + GPT-5-Codex is king. But I sometimes go back to Cursor, especially when reading / editing the PRDs or if I need the ocasional 2nd opinion by Claude.
Hey, this sounds a lot like what we have been doing. We would love to chat with you, and share notes if you are up for it!
Drop us an email at navan.chauhan[at]strongdm.com
If it's working for you I have to assume that you are an expert in the domain, know the stack inside and out and have built out non-AI automated testing in your deployment pipeline.
And yes Step 3 is what no one does. And that's not limited to AI. I built a 20+ year career mostly around step 3 (after being biomed UNIX/Network tech support, sysadmin and programmer for 6 years).
Yes, I have over 2 decades of programming experience, 15 years working professionally. With my co-founder we built an entire B2B SaaS, coding everything from scratch, did product, support, marketing, sales...
Now I am building something new but in a very familiar domain. I agree my workflow would not work for your average "vibe coder".
This just won't work beyond a one-person team
Then I will adapt and expand. Have done it before.
I am not giving universal solutions. I am sharing MY solution.
What is the % breakdown of LOC for tests vs application code?
200k LoC + 80k LoC for tests.
I have roughly 2k tests now, but should probably spend a couple of days before production release to double that.
Are you vibe coding or have the 200k LoC been human reviewed?
I would not call it vibe coding. But I do not check all changed lines of code either.
In my opinion, and this is really my opinion, in the age of coding with AI, code review is changing as well. If you speed up how much code can be produced, you need to speed up code review accordingly.
I use automated tools most of the time AND I do very thorough manual testing. I am thinking about a more sophisticated testing setup, including integration tests via using a headless browser. It definitely is a field where tooling needs to catch up.
[flagged]
Strong feelings are fair, but the architect analogy cuts the other way. Architects and civil engineers do not eyeball every rebar or hand compute every load. They probably use way more automation than you would think.
I do not claim this is vibe coding, and I do not ship unreviewed changes to safety critical systems (in case this is what people think). I claim that in 2025 reviewing every single changed line is not the only way to achieve quality at the scale that AI codegen enables. The unit of review is shifting from lines to specifications.
No they dont check it because its already been checked and quality controlled by other people. One person isnt producing every aspect and componenet of a bridge. Its made by teams of people who thoroughly go through every little detail and check every aspect to make sure when its put into production, it will handle the load.
You cannot trust AI, its simple as that. It lies, it hallucinates and it can produce test code that can pass when it reality it does nothing that you expect it to even if you detail every little thing. Thats a fact.
Before its too late, come to your sense dude. I dont even think you believe what you say, because if you do, Id never want to work with you and neither would so many other people. You are making our profession some kind of toy. Thanks for contributing to the shitshow and making me realise that I have to be very careful who I work with in the future.
You were never an engineer. I'm 18 years into my career on the web and games and I was never an engineer. It's blind people leading blind people and your somewhere in the middle based on 2013 patterns you got to this point on and 2024 advancements called "Vibe Coding" and you get paid $$ to make it work.
Building a bridge from steel that lasts 100 years and carries real living people in the tens or hundreds of thousands per day without failing under massive weather spikes is engineering.
[flagged]
It’s unbelievable right? I’m flabbergasted that there are engineers like this shipping code.
We've all been waiting for the other shoe to drop. Everyone points out that reviewing code is more difficult than writing it. The natural question is, if AI is generating thousands of lines of code per day, how do you keep up with reviewing it all?
The answer: you don't!
Seems like this reality will become increasingly justified and embraced in the months to come. Really though it feels like a natural progression of the package manager driven "dependency hell" style of development, except now it's your literal business logic that's essentially a dependency that has never been reviewed.
I don't believe they've shipped yet, based on their comments.
Tools change, standards do not.
My process is probably more robust than simply reviewing each line of code. But hey, I am not against doing it, if that is your policy. I had worked the old-fashioned way for over 15 years, I know exactly what pitfalls to watch out for.
Clearly you don't :)
What does PRD mean? I never heard that acronym before.
Product Requirements Document
It is a fairly standardized way of capturing the essens of a new feature. It covers most important aspects of what the feature is about, the goals, the success criteria, even implementation details where it makes sense.
If there is interest, I can share the outline/template of my PRDs.
I'd be very interested
There you go: https://gist.github.com/matisojka/aebf75ea33439e540eb0f74026...
Wow, very nice. Thank you. That's very well thought out.
I'm particularly intrigued by the large bold letters: "Success must be verifiable by the AI / LLM that will be writing the code later, using tools like Codex or Cursor."
May I ask, what your testing strategy is like?
I think you've encapsulated a good best practices workflow here in a nice condensed way.
I'd also be interested to know how you handle documentation but don't want to bombard you with too many questions
I added that line, because otherwise the LLM would generate goals that are not verifiable in development (e.g. certain pages to render <300ms - this is not something you can test on your local machine).
Documentation is a different topic - I have not yet found how to do it correctly. But I am reading about it and might soon test some ideas to co-generate documentation based on the PRD and the actual code. The challenge being, the code normally evolves and drifts away from the original PRD.
I think the only way to keep documentation up-to-date is to have it as part of the PR review process. Knowledge needs to evolve with code.
We working on this at https://dosu.dev/ (open to feedback!)
https://en.wikipedia.org/wiki/Product_requirements_document
can you expand on how you use shadcn UI with MCP?
I add the MCP server (https://ui.shadcn.com/docs/mcp)
Then I instruct the coding agent to use shadcn / choose the right component from shadcn component registry
The MCP server has a search / discovery tool, and it can also fetch individual components. If you tell the AI agent to use a specific component, it will fetch it (reference doc here: https://ui.shadcn.com/docs/components)
Can we see it?
No, because everyone that claims to have coded some amazing software with AI Code Generator 3000 never seems to share their project. Curious.
Book a demo! Really, it will not be self-service just yet, because it requires a bit of holding hands in the beginning.
But I am working on making a solid self-service signup experience - might need a couple of weeks to get it done.
But you claim to have AI to write it for you? It can't even do a signup page?
[flagged]
Please don't cross into personal attack. Also, please don't post snark to HN threads. This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.
Programming has always had these steps, but traditionally people with different roles would do different parts of it, like gathering requirements, creating product concept, creating development tickets, coding, testing and so on.
[flagged]
It is more than 200k lines of slop. 200k lines of code slop, and 80k lines of test slop.
The problem is the research phase will fail because you can't glean tribal product knowledge from just looking at the code
How granular are the specs? Is it at the level of "this is the code you must write, and here is how to do it", or are you letting AI work some of that out?
Nice, my experience writing production code in large codebase here, granted it's evolved a lot since: https://philippcannons.com/100x-ai-coding-for-real-work-summ...
Not surprisingly, building really fast is not the silver bullet you'd think it is. It's all about what to build and how to distribute it. Otherwise bigcos/billionaires would have armies of engineers growing their net worth to epic scales.
Regarding billionaires having armies of engineers growing their wealth to massive scale: is that not what they have?
My current world view: for monster multiples you need someone who knows how to go 0 to 1, repeatedly. That's almost always only the founder. People after are incremental. If they weren't, they'd just be a founder. Hence why everything is done through acquisitions post-founder. So there's armies of engineers incrementally scaling and maintaining dollars. But not creating that wealth or growing it in a significant % way.
> our team of three is averaging about $12k on opus per month
That’s usd 150k per year. Probably low for SF, but may be a lot in other areas.
You could almost hire a real engineer for that money.
> Heck even Amjad was on a lenny's podcast 9 months ago talking about how PMs use Replit agent to prototype new stuff and then they hand it off to engineers to implement for production.
Please kill me now
I got lectured this week that I wasn't working fast enough because the client had already vibe coded (a broken, non-functional prototype) in under an hour.
They saw the the first screen assembled by Replit and figured everything they could see would work with some "small tweaks" which is where I was allegedly to come into the picture.
They continued to lecture me about how the app would need Web Workers for maximum client side performance (explanations full of em-dashes so I knew they were pasting in AI slop at me) and it must all be browser based with no servers because "my prototype doesn't need a server"
Meanwhile their "prototype" had a broken Node.js backend running alongside the frontend listening on a TCP port.
When I asked about this backend they knew nothing about it be assured me their prototype was all browser based with no "servers".
Needless to say I'm never taking on any work from that client again, one of the small joys of being a contractor.
Sounds like hell
I created an account to say this: RepoPrompt's 'Context Builder' feature helps a ton with scoping context before you touch any code.
It's kind of like if you could chat with Repomix or Gitingest so they only pull the most relevant parts of your codebase into a prompt for planning, etc
I'm a paying RepoPrompt user but not associated in any other way.
I've used it in conjunction with Codex, Claude Code, and any other code gen tool I have tried so far. It saves a lot of tokens and time (and headaches)
Thanks for sharing, I wonder how do you keep the stylistic and mental alignment of the codebase - is this happens during the code review or there are specific instructions during at the plan/implement stages?
Honestly I was reading that article and I smelled sales pitch and then in the end of course.
This is not my experience at all.
I also don’t get the line obsession.
Good code has less lines not more
It seems like a different universe from the openbsd and the suckless guys.
Doesnt githubs new speckit solve this? https://github.com/github/spec-kit
how does this solve it?
Lots of gold in this article. It's like discovering a basket of cheat codes. This will age well.
Great links, BAML is a crazy rabbithole and just found myself nodding along to frequent /compact. These tips are hard-earned and very generously given. Anyone here can take it or leave it. I have theft on my mind, personally. (ʃƪ¬‿¬)
> Within an hour or so, I had a PR fixing a bug which was approved by the maintainer the next morning
An hour for 14 lines of code. Not sure how this shows any productivity gain from AI. It's clear that it's not the code writing that is the bottleneck in a task like this.
Looking at the "30K lines" features, the majority of the 30K lines are either auto-generated code (not by AI), or documentation. One of them is also a PoC and not merged...
The author said he was not a Rust expert and had no prior familiarity with the codebase. An hour for a 14 line fix that works and is acceptable quality to merge is pretty good given those conditions.
When I read about people dumping 2000 lines of code every few days, I'm extremely skeptical about the quality of this code. All the people I've met who worked at this rate were always going for naive solutions and their code was full of hard-to-see bugs which only reared their ugly heads once in a while and were impossible to debug.
We're currently in a transition phase where we're using agentic coding on systems developed with tools and languages designed for humans. Ironically, this makes things unnecessarily hard as things that are easy for us aren't necessary easy to deal with; or that optimal for agentic coding systems.
People like languages that are expressive and concise. That means they do things like omit types, use type inference, macros, syntactic sugar, allow for ambiguities and all the other stuff that gives us shorter, easier to type code that requires more effort to figure out. A good intuition here might be that the harder the compiler/interpreter has to work to convert it into running/executable code, the harder an LLM will have to work to figure out what that code does.
LLMs don't mind verbosity and spelling things out. Things that are long winded and boring to us are helpful for an LLM. The optimal language for an LLM is going to be different than one that is optimal for a human. And we're not good at actually producing detailed specifications. Programming actually is the job of coming up with detailed specifications. Easy to forget when you are doing that but that's literally what programming is. You write some kind of specification that is then "compiled" into something that actually works as specified.
The solution to agentic coding isn't writing specifications for our specifications. That just moves the problem.
We've had a few decades of practice where we just happen to stuff code into files and use very primitive tools to manipulate those files. Agentic coding uses a few party tricks involving command line tools to manipulate those files and reading them by one into the precious context window. We're probably shoveling too much data around. But since that's the way we store code, there are no better tools to do that.
From having used things like Codex, 99% of what it does is interrogating what's there via tediously slow prodding and poking around the code base using simple command line commands and build tool invocations. It's like watching paint dry. I usually just go off doing something else while it boils the oceans and does god knows what before finally doing the (usually) relatively straightforward thing that I asked it to do. It's easy to see that this doesn't scale that well.
The whole point of a large code base is that it probably won't all fit in the context window. We can try to brute force the problem; or we can try to be more selective. The name of the game here is being able to be able to quickly select just the right stuff to put in there and discard all the rest.
We can either do that manually (tedious and a lot of work, sort of as the article proposes), or make it easier for the LLM to use tools that do that. Possibly a bunch of poorly structured files in some nested directory hierarchy isn't the optimal thing here. Most non AI based automated refactorings require something that more closely resembles the internal data structures of what a compiler would use (e.g. symbol tables, definitions, etc.).
A lot of what an agentic coding system has to do is reconstruct something similar enough to that just so it can build a context in which it can do constructive things. The less ambiguous and more structured that is, the easier the job. The easier we make it to do that, the more it can focus on solving interesting problems rather than getting ready to do that.
I don't have all the answers here but if agentic coding is going to be most of the coding, it makes sense to optimize the tools, languages, etc. for that rather than for us.
TLDR:
We're taking a profession that attracts people who enjoy a particular type of mental stimulation, and transforming it into something that most members of the profession just fundamentally do not enjoy.
If you're a business leader wondering why AI hasn't super charged your company's productivity, it's at least partly because you're asking people to change the way they work so drastically, that they no longer derive intrinsic motivation from it.
Doesn't apply to every developer. But it's a lot.
Why though. Why should we do that?
If AI is so groundbreaking, why do we have to have guides and jump through 3000 hoops just so we can make it work?
Because now your manager will measure on LOCs against other engineers again and it's only software engineers worrying about complexity, maintainability, and, in summary, the health of the very creature it's going to pay your salary.
This is the new world we live in. Anyone who actually likes coding should seriously look for other venues because this industry is for other type of people now.
I use AI in my job. I went from tolerable (not doing anything fancy) to unbearable.
I'm actually looking to become a council employee with a boring job and code my own stuff, because if this is what I have to do moving forward, I rather go back to non-coding jobs.
i strongly disagree with this - if anything, using AI to code real production code in real complex codebase is MORE technical than just writing software.
Staff/Principal engineers already spend a lot more time designing systems than writing code. They care a lot about complexity, maintainability, and good architecture.
The best people I know who have been using these techniques are former CTOs, former core Kubernetes contributors, have built platforms for CRDTs at scale, and many other HIGHLY technical pursuits.
This is actually where the "myth" of the 10x engineer comes from - there do exist such people and they always could do more than the rest of us ... because they knew what to build. It's not 10K lines of code, it's _the right_ 10K lines of code. Whether using LLMs or LLVM to produce bytes the bytes produced are not the "τέχνη".
That said, I don't think it takes MORE τέχνη to use the machine, merely a distinct ἐμπειρία. That said, both ἐμπειρία and τέχνη aren't σοφία.
why do we have guides and lessons on how to use a chainsaw when we can hack the tree with an axe?
The chainsaw doesn't sometimes chop off your arm when you are using it correctly.
If you swing an axe with a lack of hand eye coordination you don't think it's possible to seriously injure yourself?
Was the axe or the chainsaw designed in such a way that guarantees that it will definitely miss the log and hit your hand fair amount of the times you use it? If it were, would you still use it? Yes, these hand tools are dangerous, but they were not designed so that it would probably cut off your hand even 1% of the time. "Accidents happen" and "AI slop" are not even remotely the same.
So then with "AI" we're taking a tool that is known to "hallucinate", and not infrequently. So let's put this thing in charge of whatever-the-fuck we can?
I have no doubt "AI" will someday be embedded inside a "smart chainsaw", because we as humans are far more stupid than we think we are.
if nuclear power is so much better than coal, why do we need to learn how to safely operate a reactor just to make it work? Coal is so much easier
Even if we had perfectly human-level AI it'd still need management, just like human workers do, and turns out effective management is actually nontrivial.
I don't want to effectively manage the idiot box
I want to do the work
I refactored CPython using GPT-5, turning the compiler bilingual for english and portuguese keywords.
https://github.com/ricardoborges/cpython
what web programming task GPT-5 can't handle?
OpenAI Codex has an `update_plan` function[0]. I'm wondering if switching the implementation to this would improve the coding agent's capabilities or is the default for simplicity better.
[0]: https://blog.toolkami.com/openai-codex-tools/