This is sort of why I think software development might be the only real application of LLMs outside of entertainment. We can build ourselves tight little feedback loops that other domains can't. I somewhat frequently agree on a plan with an LLM and a few minutes or hours later find out it doesn't work and then the LLM is like "that's why we shouldn't have done it like that!". Imagine building a house from scratch and finding out that it was using some american websites to spec out your electric system and not noticing the problem until you're installing your candadian dishwasher.
This goes in the right direction. It could go further though. Types are indeed nice. So, why use a language why using those is optional? There are many reasons but many of those have to do with people and their needs/wants rather than tool requirements. AI agents benefit from good tool feedback, so maybe switch to languages and frameworks that provide plenty of that and quickly. Switching used to be expensive. Because you had to do a lot of the work manually. That's no longer true. We can make LLMs do all of the tedious stuff.
Including using more rigidly typed languages, making sure things are covered with tests, using code analysis tools to spot anti patterns and addressing all the warnings, etc. That was always a good idea but we now have even less excuses to skip all that.
I like this. "Best practices" are always contingent on the particular constellation of technology out there; with tools that make it super-easy to write code, I can absolutely see 100% coverage paying off in a way that doesn't for human-written code -- it maximizes what LLMs are good at (cranking out code) while giving them easy targets to aim for with little judgement.
(A thing I think is under-explored is how much LLMs change where the value of tests are. Back in the artisan hand-crafted code days, unit tests were mostly useful as scaffolding: Almost all the value I got from them was during the writing of the code. If I'd deleted the unit tests before merging, I'd've gotten 90% of the value out of them. Whereas now, the AI doesn't necessarily need unit tests as scaffolding as much as I do, _but_ having them put in there makes future agentic interactions safer, because they act as reified context.)
The tests I have for systems that keep evolving while being production critical over a decade are invaluable. I cannot imagine touching a thing without the tests. Many of which reference a ticket they prove remains fixed: a sometimes painfully learned lesson.
Wouldn't a better title be "How we're forcing AI to write good code (because it's normally not that good in general, which is crazy, given how many resources it's sucking, that we need to add an extra layer on top of it and use it to get anything decent)"
Don't forget "we're obligated to try and sell it so here's an ai generated article to fill up our quota because nobody here wanted to actually sit down and write it"
At Qlty, we are going so far as to rewrite hundreds of thousands of lines of code to ensure full test coverage, end-to-end type checking (including database-generated types).
I’ll add a few more:
1. Zero thrown errors. These effectively disable the type checker and act as goto statements. We use neverthrow for Rust-like Result types in TypeScript.
2. Fast auto-formatting and linting. An AI code review is not a substitute for a deterministic result in sub-100ms to guarantee consistency. The auto-formatter is set up as a post-tool use Claude hook.
3. Side-effect free imports and construction. You should be able to load all the code files and construct an instance of every class in your app without a network connection spawning. This is harder than it sounds and without it you run into all sorts of trouble with the rest.
3. Zero mocks and shared global state. By mocks, I mean mocking frameworks which override functions on existing types or global. These effectively are injecting lies into the type checker.
Should put to tsgo which has dramatically lowered our type checking latency. As the tok/sec of models keeps going up, all the time is going to get bottlenecked on tool calls (read: type checking and tests).
With this approach we now have near 100% coverage with a test suite that runs in under 1,000ms.
Answering myself: maybe I feel much more urgency and motivation for this in the age of AI because the effects can be felt so much more acute and immediately.
If you're like me you're doing it to establish a greater level of trust in generated code. It feels easier to draw out the hard guard-rails and have something fill out the middle -- giving both you, and the models, a reference point or contract as to what's "correct"
Can you say more? I see a lot of teams struggling with getting AI to work for them. A lot of folks expect it to be a little more magical and "free" than it actually is. So this post is just me sharing what works well for us on a very seasoned eng team.
As someone who struggles to get AI to produce productivity gains (see recent comment history) I appreciate the article.
100% coverage for AI generated code is a very different value proposition than 100% coverage for human generated code (for the reasons outlined in the article).
The value of the blog post is negatively correlated to how good the site looks. Mailing list? Sponsors? Fancy Title? Garbage. Raw HTML dumped on a .xyz domain, Gold!
That's a negative correlation signal for me (as are all the other weird TLDs that I have not seen besides SEO spam results and perhaps the occasional HN submission.) On the other hand, .com, .net, and .org are a positive signal.
I find that this idea of restricting degrees of freedom is absolutely critical to being productive with agents at scale. Please enlighten us as to why you think this is nonsense
"fast, ephemeral, concurrent dev environments" seems like a superb idea to me. I wish more projects would do it, it lowers the barrier to contributions immensely.
I’m more afraid that some manager will read this and impose rules on their team. On the surface one might think that having more test coverage is universally good and won’t consider trade offs. I have a gut feeling that Goodhart’s Law accelerated with AI is a dangerous mix.
I never claim that 100% coverage has anything to do with code breaking. The only claim made is that anything less than 100% does guarantee that some piece of code is not automatically exercised, which we don't allow.
It's a footnote on the post, but I expand on this with:
100% coverage is actually the minimum bar we set. We encourage writing tests for as many scenarios as is possible, even if it means the same lines getting exercised multiple times. It gets us closer to 100% path coverage as well, though we don’t enforce (or measure) that
SimpleCov in ruby has 2 metrics, line coverage and branch coverage. If you really want to be strict, get to 100% branch coverage. This really helps you flesh out all the various scenarios
They are retrained every 12-24 months and constantly getting new/updated reinforcement learning layers. New concepts are not the problem. The problem is outdated information in the training data, like only crappy old Postgres syntax in most of the Stackoverflow body.
> They are retrained every 12-24 months and constantly getting new/updated reinforcement learning layers
This is true now, but it can't stay true, given the enormous costs of training. Inference is expensive enough as is, the training runs are 100% venture capital "startup" funding and pretty much everyone expects them to go away sooner or later
Can't plan a business around something that volatile
I suspect it will still fall on humans (with machine assistance?) to move the field forward and innovate, but in terms of training an LLM on genuinely new concepts, they tend to be pretty nimble on that front (in my experience).
Especially with the massive context windows modern LLMs have. The core idea that the GPT-3 paper introduced was (summarizing):
A sufficiently large language model can perform new tasks it has never seen using only a few examples provided at inference time, without any gradient updates or fine-tuning.
This is sort of why I think software development might be the only real application of LLMs outside of entertainment. We can build ourselves tight little feedback loops that other domains can't. I somewhat frequently agree on a plan with an LLM and a few minutes or hours later find out it doesn't work and then the LLM is like "that's why we shouldn't have done it like that!". Imagine building a house from scratch and finding out that it was using some american websites to spec out your electric system and not noticing the problem until you're installing your candadian dishwasher.
This goes in the right direction. It could go further though. Types are indeed nice. So, why use a language why using those is optional? There are many reasons but many of those have to do with people and their needs/wants rather than tool requirements. AI agents benefit from good tool feedback, so maybe switch to languages and frameworks that provide plenty of that and quickly. Switching used to be expensive. Because you had to do a lot of the work manually. That's no longer true. We can make LLMs do all of the tedious stuff.
Including using more rigidly typed languages, making sure things are covered with tests, using code analysis tools to spot anti patterns and addressing all the warnings, etc. That was always a good idea but we now have even less excuses to skip all that.
I like this. "Best practices" are always contingent on the particular constellation of technology out there; with tools that make it super-easy to write code, I can absolutely see 100% coverage paying off in a way that doesn't for human-written code -- it maximizes what LLMs are good at (cranking out code) while giving them easy targets to aim for with little judgement.
(A thing I think is under-explored is how much LLMs change where the value of tests are. Back in the artisan hand-crafted code days, unit tests were mostly useful as scaffolding: Almost all the value I got from them was during the writing of the code. If I'd deleted the unit tests before merging, I'd've gotten 90% of the value out of them. Whereas now, the AI doesn't necessarily need unit tests as scaffolding as much as I do, _but_ having them put in there makes future agentic interactions safer, because they act as reified context.)
It might depend on the lifecycle of your code.
The tests I have for systems that keep evolving while being production critical over a decade are invaluable. I cannot imagine touching a thing without the tests. Many of which reference a ticket they prove remains fixed: a sometimes painfully learned lesson.
Wouldn't a better title be "How we're forcing AI to write good code (because it's normally not that good in general, which is crazy, given how many resources it's sucking, that we need to add an extra layer on top of it and use it to get anything decent)"
Don't forget "we're obligated to try and sell it so here's an ai generated article to fill up our quota because nobody here wanted to actually sit down and write it"
Then it wouldn't be effective advertising/vanity blogging from some self-promoting startup.
Strong agreement with everything in this post.
At Qlty, we are going so far as to rewrite hundreds of thousands of lines of code to ensure full test coverage, end-to-end type checking (including database-generated types).
I’ll add a few more:
1. Zero thrown errors. These effectively disable the type checker and act as goto statements. We use neverthrow for Rust-like Result types in TypeScript.
2. Fast auto-formatting and linting. An AI code review is not a substitute for a deterministic result in sub-100ms to guarantee consistency. The auto-formatter is set up as a post-tool use Claude hook.
3. Side-effect free imports and construction. You should be able to load all the code files and construct an instance of every class in your app without a network connection spawning. This is harder than it sounds and without it you run into all sorts of trouble with the rest.
3. Zero mocks and shared global state. By mocks, I mean mocking frameworks which override functions on existing types or global. These effectively are injecting lies into the type checker.
Should put to tsgo which has dramatically lowered our type checking latency. As the tok/sec of models keeps going up, all the time is going to get bottlenecked on tool calls (read: type checking and tests).
With this approach we now have near 100% coverage with a test suite that runs in under 1,000ms.
I‘m on the same page as you, I‘m investing into DX and test coverage and quality tooling like crazy.
But the weird thing is: those things have always been important to me.
And it has always been a good idea to invest in those, for my team and me.
Why am doing this 200% now?
Answering myself: maybe I feel much more urgency and motivation for this in the age of AI because the effects can be felt so much more acute and immediately.
If you're like me you're doing it to establish a greater level of trust in generated code. It feels easier to draw out the hard guard-rails and have something fill out the middle -- giving both you, and the models, a reference point or contract as to what's "correct"
For me it's because coworkers are pumping out horrible slop faster than ever before.
I’m increasingly finding that the type of engineer that blogs is not they type of engineer anyone should listen to.
Can you say more? I see a lot of teams struggling with getting AI to work for them. A lot of folks expect it to be a little more magical and "free" than it actually is. So this post is just me sharing what works well for us on a very seasoned eng team.
As someone who struggles to get AI to produce productivity gains (see recent comment history) I appreciate the article.
100% coverage for AI generated code is a very different value proposition than 100% coverage for human generated code (for the reasons outlined in the article).
The value of the blog post is negatively correlated to how good the site looks. Mailing list? Sponsors? Fancy Title? Garbage. Raw HTML dumped on a .xyz domain, Gold!
on a .xyz domain
That's a negative correlation signal for me (as are all the other weird TLDs that I have not seen besides SEO spam results and perhaps the occasional HN submission.) On the other hand, .com, .net, and .org are a positive signal.
The exception is a front end dev, since that's their bread and butter.
Badgersnake's corollary to Gell-Mann amnesia?
I find that this idea of restricting degrees of freedom is absolutely critical to being productive with agents at scale. Please enlighten us as to why you think this is nonsense
Wearing seatbelts is critical for drunk-driving.
All praise drunk-driving for increased seatbelt use.
Finally something I can get behind.
I'm sad programmers lacking a lot of experience will read this and think it's a solid run-down of good ideas.
"fast, ephemeral, concurrent dev environments" seems like a superb idea to me. I wish more projects would do it, it lowers the barrier to contributions immensely.
I’m more afraid that some manager will read this and impose rules on their team. On the surface one might think that having more test coverage is universally good and won’t consider trade offs. I have a gut feeling that Goodhart’s Law accelerated with AI is a dangerous mix.
What’s bad about them? We make things baby-safe and easy to grasp and discover for LLMs. Understandability and modularity will improve.
Could you be more specific in your feedback please.
100% test coverage, for most projects of modest size, is extremely bad advice.
laziness? unprofessionalism? both? or something else?
all of the above.
Author should ask AI to write a small app with 100% code coverage that breaks in every path except what is covered in the tests.
Example output if anyone else is curious:
I never claim that 100% coverage has anything to do with code breaking. The only claim made is that anything less than 100% does guarantee that some piece of code is not automatically exercised, which we don't allow.
It's a footnote on the post, but I expand on this with:
SimpleCov in ruby has 2 metrics, line coverage and branch coverage. If you really want to be strict, get to 100% branch coverage. This really helps you flesh out all the various scenarios
I feel this comment is lost on those who have never achieved it and gave up along the journey.
https://logic.inc/
"Ship AI features and tools in minutes, not weeks. Give Logic a spec, get a production API—typed, tested, versioned, and ready to deploy."
https://en.wikipedia.org/wiki/Drinking_the_Kool-Aid
I don't know about all this AI stuff.
How are LLMs going to stay on top of new design concepts, new languages, really anything new?
Can LLMs be trained to operate "fluently" with regards to a genuinely new concept?
I think LLMs are good for writing certain types of "bad code", i.e. if you're learning a new language or trying to quickly create a prototype.
However to me it seems like a security risk to try to write "good code" with an LLM.
They are retrained every 12-24 months and constantly getting new/updated reinforcement learning layers. New concepts are not the problem. The problem is outdated information in the training data, like only crappy old Postgres syntax in most of the Stackoverflow body.
> They are retrained every 12-24 months and constantly getting new/updated reinforcement learning layers
This is true now, but it can't stay true, given the enormous costs of training. Inference is expensive enough as is, the training runs are 100% venture capital "startup" funding and pretty much everyone expects them to go away sooner or later
Can't plan a business around something that volatile
I suspect it will still fall on humans (with machine assistance?) to move the field forward and innovate, but in terms of training an LLM on genuinely new concepts, they tend to be pretty nimble on that front (in my experience).
Especially with the massive context windows modern LLMs have. The core idea that the GPT-3 paper introduced was (summarizing):
You do realise they can search the web? They can read documentation and api specs?