The trick with coding agent is guiding the attention towards tasks it can expect will fit in the agent’s token window and deciding when to delegate. Funny as a PM you have the exact problem.
Very simplistic view on the problem domain IMHO. Yah sure we can add a bunch of functions... ok. But how about snapshotting (or at least work with git), sandboxing both process and network level, prompt engineering, detect when stuck, model switching with parallel solvers for better solutions. These are the kind of things that make coding agents reliable - not function declarations.
We (the Princeton SWE-bench team) built an agent in ~100 lines of code that does pretty well on SWE-bench, you might enjoy it too: https://github.com/SWE-agent/mini-swe-agent
Your task: {{task}}. Please reply
with a single shell command in
triple backticks.
To finish, the first line of the
output of the shell command must be
'COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT'.
> 1. Analyze the codebase by finding and reading relevant files
2. Create a script to reproduce the issue
3. Edit the source code to resolve the issue
4. Verify your fix works by running your script again
5. Test edge cases to ensure your fix is robust
This prompt snippet from your instance template is quite useful. I use something like this for getting out of debug loops:
> Analyse the codebase and brainstorm a list of potential root causes for the issue, and rank them from most likely to least likely.
Then create scripts or add debug logging to confirm whether your hypothesis is correct. Rule out root causes from most likely to least by executing your scripts and observing the output in order of likelihood.
I'm trying to understand what does it got to do with LLM size?
Imho, right tools allow small models to perform better than undirected tool like bash to do everything.
But I understand that this code is to show people how function calling is just a template for LLM.
Mini swe agent, as an academic tool, can be easily tested aimed to show the power of a simple idea against any LLM. You can go and test it with different LLMs. Tool calls didn't work fine with smaller LLM sizes usually. I don't see many viable alternatives less than 7GB, beyond Qwen3 4B for tool calling.
> right tools allow small models to perform better than undirected tool like bash to do everything.
Interesting enough the newer mini swe agent was refutation of this hypothesis for very large LLMs from the original swe agent paper (https://arxiv.org/pdf/2405.15793) assuming that specialized tools work better.
A very similar "how to guide" can be found here https://ampcode.com/how-to-build-an-agent written by Thorsten Ball. In general Amp is quite interesting - obviously no hidden gem anymore ;-) but great to see more tooling around agentic coding being published. Also, because similar agentic-approaches will be part of (certain/many?) software suits in the future.
The problem I have with this is that this style of agent design, providing enormous autonomy, makes sense in coding while keeping an expert human in the loop since it can self-correct via debugging. What would the other use cases of giving an agent this much autonomy be today versus a more structured flow versus something more like LangGraph?
Technically speaking, you can get away with just a Bash tool, and I had some success with this. It's actually quite interesting to take away tools from agents and see how creative they are with the use.
One of the reasons why you get better performance if you give them the other tools is that there has been some reinforcement learning on Sonne with all these tools. The model is aware of how these tools work, it is more token-efficient and it is generally much more successful at performing those actions. The Bash tool, for instance, at times gets confused by bashisms, not escaping arguments correctly, not handling whitespace correctly etc.
> The model is aware of how these tools work, it is more token-efficient and it is generally much more successful at performing those actions.
Interesting! This didn't seem to be the case in the OP's examples - for instance using a list_files tool and then checking if the json result included README vs bash [ -f README ]
Separate tools is simpler than having everything go through bash.
If everything goes through bash then you need some way to separate always safe commands that don't need approval (such as listing files), from all other potentially unsafe commands that require user approval.
If you have listing files as a separate tool then you can also enforce that the agent doesn't list any files outside of the project directory.
> you need some way to separate always safe commands that don't need approval (such as listing files), from all other potentially unsafe commands that require user approval.
This is a very strong argument for more specific tools, thanks!
Why do humans need a IDE when we could do anything in a shell?
Interface give you the informations you need at a given moment and the actions you can take.
To me a better analogy would be: if you're a household of 2 who own 3 reliable cars, why would you need a 4th car with smaller cargo & passenger capacities, higher fuel consumption, worse off-road performance and lower top speed?
I hate to do meta-commentary (the content is a decent beginner level introduction to the topic!), but this is some of the worst AI-slop-infused presentation I've seen with a blog post in a while.
Why the unnecessary generated AI pictures in between?
Why put everything that could have been a bullet point into it's own individual picture (even if it's not AI generated)? It's very visually distracting, breaks the flow of reading, and it's less accessible as all the picture lack alt-text.
---
I see that it's based on a conference talk, so it's possibly just 1:1 the slides. If that's the case please put it up in it's native conference format, rather than this.
Wow. Yeah. That's unreadable - my frustration and annoyance levels got high fast, had to close the page before I went for the power button on my machine :)
what's the best current cli (with a non interactive option) that is on par with Claude code but can work with other llms like ollama, openrouter etc? I tried stuff like aider but it cannot discover files, the open source gemini one but it was terrible; what is a good one that maybe is the same as CC if you plug in Opus?
Exactly my approach to gaining knowledge and learning through building your own(`npx genaicode`). When I was presenting my work on a local meetup I got this exact question: "why u building this instead of just using Cursor".
The answer is explained in this article(tl;dr; transformative experience), even though some parts of it are already outdated or will be outdated very soon as the technology is making progress every day.
I really think the current trend of CLI coding agents isn't going to be the future. They're cool but they are _too simple_. Gemini CLI often makes incorrect edits and gets confused, at least on my codebase. Just like ChatGPT would do in a longer chat where the context gets lost: random, unnecessary and often harmful edits are made confidently. Extraneous parts of the codebase are modified when you didn't ask for it. They get stuck in loops for an hour trying to solve a problem, "solving it", and then you have to tell the LLM the problem isn't solved, the error message is the same, etc.
I think the future will be dashboards/HUDs (there was an article on HN about this a bit ago and I agree). You'll get preview windows, dynamic action buttons, a kanban board, status updates, and still the ability to edit code yourself, of course.
The single-file lineup of agentic actions with user input, in a terminal chat UI, just isn't gonna cut it for more complicated problems. You need faster error reporting from multiple sources, you need to be able to correct the LLM and break it out of error loops. You won't want to be at the terminal even though it feels comfortable because it's just the wrong HCI tool for more complicated tasks. Can you tell I really dislike using these overly-simple agents?
You'll get a much better result with a dashboard/HUD. The future of agents is that multiple of them will be working at once on the codebase and they'll be good enough that you'll want more of a status-update-confirm loop than an agentic code editing tool update.
Also required is better code editing. You want to avoid the LLM making changes in your code unrelated to the requested problem. Gemini CLI often does a 'grep' for keywords in your prompt to find the right file, but your prompt was casual and doesn't contain the right keywords so you end up with the agent making changes that aren't intended.
Obviously I am working in this space so that's where my opinions come from. I have a prototype HUD-style webapp builder agent that is online right now if you'd like to check it out:
It's not got everything I said above - it's a work-in-progress. Would love any feedback you have on my take on a more complicated, involved, and narrow-focus agentic workflow. It only builds flask webapps right now, strict limits on what it can do (no cron etc yet) but it does have a database you can use in your projects. I put a lot of work into the error flow as well, as that seems like the biggest issue with a lot of agentic code tools.
One last technical note: I blogged about using AST transformations when getting LLMs to modify code. I think that using diffs or rewriting the whole file isn't the right solution either. I think that having the LLM write code that modifies your code and then running that code to affect the modifications is the way forward. We'll see I guess. Blog post: https://codeplusequalsai.com/static/blog/prompting_llms_to_m...
I'm not sure what do you mean by "whole file format", but if it refers to the write_file tool that overwrites the whole file, there is also the replace tool which is apparently inspired by a blog post [1] by Anthropic. It seems that Claude Code also supports the roughly identical tool (inferred from error messages), so editing tools can't be the reason why Claude Code is good.
Oh that's wild, I did suspect that but didn't know it outright. Mind-blowing Google would release that kind of thing, I had wondered why it sucked so much haha. Okay so what is a good representation of the current state of coding agents? Which one should I try that does a better job at code modifications?
Claude code is the strongest atm, but roocode or cline (vscode extensions) can also work well. Roo with gpt5-mini (so cheap, pretty fast) does diff based edits w/ good coordination over a task, and finishes most tasks that I tried. It even calls them "surgical diffs" :D
This comment belongs in a discussion about using LLMs to help write code for large existing systems - it's a bit out of place in a discussion about a tutorial on building coding agents to help people understand how the basic tools-in-a-loop pattern works.
anyone who used those coding agent can already see how it works, you can usually see agent fetching files, running commands, listing files and directories.
i just wrote this comment so people aren't under false belief that it's pretty much all coding agents do, making all this fault tolerant with good ux is lot of work.
1. Precompute frequently used knowledge and surface early. For example repository structure, os information, system time.
2. Anticipate next tool calls. If a match is not found while editing, instead of simply failing, return closest matching snippet. If read file tool gets a directory, return directory contents.
3. Parallel tool calls. Claude needs either a batch tool or special scaffolding to promote parallel tool calls. Single tool call per turn is very expensive.
that info can be just included in preffix which is cache by LLM, reducing cost by 70-80% average. System time varies, so it's not good idea to specify it in prompt, better to make a function out of it to avoid cache invalidation.
I am still looking for a good "memory" solution, so far running without it. Haven't looked too deep into it.
Not sure how next tool call be predicted.
I am still using serial tool calls as i do not have any subagents, i just use fast inference models for directly tools calls. It works so fast, i doubt i'll benefit from parallel anything.
There's "swe re-bench", a benchmark that tracks model release dates, and you can see how the model did for "real-world" bugs that got submitted on github after the model was released. (obviously works best for open models).
There are a few models that solve 30-50% of (new) tasks pulled from real-wolrd repos. So ... yeah.
Surprise, as rambunctious dev who’s socially hacked their way through promotions, I will just convince our manager we need to rewrite the platform in a new stack or convince them that I need to write a new server to handle the feature. No old tech needed!
For me, the post is missing an explanation of the reason why I would want to build my own coding agent instead of just using one of the publicly available ones.
Knowing how to build your own agent and what that loop is going to be the new whiteboard coding question in a couple of years. Absolute. It's going to be the same as "Reverse this string", "I've got a linked list, can you reverse it?", or "Here's my graph, can you traverse it?"
I see, thanks. I was wondering earlier if there would be any practical advantage in creating a custom agent, but couldn't think of any. I guess I simply misunderstood the purpose of your post.
The trick with coding agent is guiding the attention towards tasks it can expect will fit in the agent’s token window and deciding when to delegate. Funny as a PM you have the exact problem.
Very simplistic view on the problem domain IMHO. Yah sure we can add a bunch of functions... ok. But how about snapshotting (or at least work with git), sandboxing both process and network level, prompt engineering, detect when stuck, model switching with parallel solvers for better solutions. These are the kind of things that make coding agents reliable - not function declarations.
We (the Princeton SWE-bench team) built an agent in ~100 lines of code that does pretty well on SWE-bench, you might enjoy it too: https://github.com/SWE-agent/mini-swe-agent
OK that really is pretty simple, thanks for sharing.
The whole thing runs on these prompts: https://github.com/SWE-agent/mini-swe-agent/blob/7e125e5dd49...
Pretty sure you also need about 120 lines of prompting from default.yaml
https://github.com/SWE-agent/mini-swe-agent/blob/7e125e5dd49...
> 1. Analyze the codebase by finding and reading relevant files 2. Create a script to reproduce the issue 3. Edit the source code to resolve the issue 4. Verify your fix works by running your script again 5. Test edge cases to ensure your fix is robust
This prompt snippet from your instance template is quite useful. I use something like this for getting out of debug loops:
> Analyse the codebase and brainstorm a list of potential root causes for the issue, and rank them from most likely to least likely.
Then create scripts or add debug logging to confirm whether your hypothesis is correct. Rule out root causes from most likely to least by executing your scripts and observing the output in order of likelihood.
when a problem is entirely self contained in a file, it's very easy to edit it with LLM.
that's not the case with a codebase, where things are littered around in tune with specific model of organisation the developer had in mind.
Lumpers win again!
https://en.wikipedia.org/wiki/Lumpers_and_splitters
> in tune with specific model of organisation
You wish
Nice but sad to see lack of tools. Most your code is about the agent framework instead of specific to SWE.
I've built a SWE agent too (for fun), check it out => https://github.com/myriade-ai/autocode
> sad to see lack of tools.
Lack of tools in mini-swe-agent is a feature. You can run it with any LLM no matter how big or small.
I'm trying to understand what does it got to do with LLM size? Imho, right tools allow small models to perform better than undirected tool like bash to do everything. But I understand that this code is to show people how function calling is just a template for LLM.
Mini swe agent, as an academic tool, can be easily tested aimed to show the power of a simple idea against any LLM. You can go and test it with different LLMs. Tool calls didn't work fine with smaller LLM sizes usually. I don't see many viable alternatives less than 7GB, beyond Qwen3 4B for tool calling.
> right tools allow small models to perform better than undirected tool like bash to do everything.
Interesting enough the newer mini swe agent was refutation of this hypothesis for very large LLMs from the original swe agent paper (https://arxiv.org/pdf/2405.15793) assuming that specialized tools work better.
cheers i'll add it in.
What sort of results have you had from running it on its own codebase?
A very similar "how to guide" can be found here https://ampcode.com/how-to-build-an-agent written by Thorsten Ball. In general Amp is quite interesting - obviously no hidden gem anymore ;-) but great to see more tooling around agentic coding being published. Also, because similar agentic-approaches will be part of (certain/many?) software suits in the future.
Makes sense, the author says he also works at Amp
Ghuntley also works at Amp
If a picture is usually worth 1000 words, the pictures in this are on a 99.6% discount. What the actual...?
It's a conference workshop; these are the slides from the workshop, and the words are a dictation from the delivery.
That seems like a leaky implementation detail to me, for a published piece.
The problem I have with this is that this style of agent design, providing enormous autonomy, makes sense in coding while keeping an expert human in the loop since it can self-correct via debugging. What would the other use cases of giving an agent this much autonomy be today versus a more structured flow versus something more like LangGraph?
Instead of writing about how to build an agent, show us one project that this agent has built.
Why are any of the tools beyond the bash tool required?
Surely listing files, searching a repo, editing a file can all be achieved with bash?
Or is this what's demonstrated by https://news.ycombinator.com/item?id=45001234?
Technically speaking, you can get away with just a Bash tool, and I had some success with this. It's actually quite interesting to take away tools from agents and see how creative they are with the use.
One of the reasons why you get better performance if you give them the other tools is that there has been some reinforcement learning on Sonne with all these tools. The model is aware of how these tools work, it is more token-efficient and it is generally much more successful at performing those actions. The Bash tool, for instance, at times gets confused by bashisms, not escaping arguments correctly, not handling whitespace correctly etc.
> The model is aware of how these tools work, it is more token-efficient and it is generally much more successful at performing those actions.
Interesting! This didn't seem to be the case in the OP's examples - for instance using a list_files tool and then checking if the json result included README vs bash [ -f README ]
Separate tools is simpler than having everything go through bash.
If everything goes through bash then you need some way to separate always safe commands that don't need approval (such as listing files), from all other potentially unsafe commands that require user approval.
If you have listing files as a separate tool then you can also enforce that the agent doesn't list any files outside of the project directory.
> you need some way to separate always safe commands that don't need approval (such as listing files), from all other potentially unsafe commands that require user approval.
This is a very strong argument for more specific tools, thanks!
Why do humans need a IDE when we could do anything in a shell? Interface give you the informations you need at a given moment and the actions you can take.
To me a better analogy would be: if you're a household of 2 who own 3 reliable cars, why would you need a 4th car with smaller cargo & passenger capacities, higher fuel consumption, worse off-road performance and lower top speed?
>Why are any of the tools beyond the bash tool required?
My best guess is they started out with a limited subset of tools and realised they can just give it bash later.
This is explained in 3.2 How to design good tools?
I'm not sure where this quote is from - it doesn't seem to appear in the linked article.
I hate to do meta-commentary (the content is a decent beginner level introduction to the topic!), but this is some of the worst AI-slop-infused presentation I've seen with a blog post in a while.
Why the unnecessary generated AI pictures in between?
Why put everything that could have been a bullet point into it's own individual picture (even if it's not AI generated)? It's very visually distracting, breaks the flow of reading, and it's less accessible as all the picture lack alt-text.
---
I see that it's based on a conference talk, so it's possibly just 1:1 the slides. If that's the case please put it up in it's native conference format, rather than this.
Wow. Yeah. That's unreadable - my frustration and annoyance levels got high fast, had to close the page before I went for the power button on my machine :)
Agreed. It's unreadable.
what's the best current cli (with a non interactive option) that is on par with Claude code but can work with other llms like ollama, openrouter etc? I tried stuff like aider but it cannot discover files, the open source gemini one but it was terrible; what is a good one that maybe is the same as CC if you plug in Opus?
Haven’t tried many but the LLM cli seems alright to me
Building a coding agent involves defining clear goals, leveraging AI, and iterating based on feedback. Start with a simple task and scale up.
Exactly my approach to gaining knowledge and learning through building your own(`npx genaicode`). When I was presenting my work on a local meetup I got this exact question: "why u building this instead of just using Cursor". The answer is explained in this article(tl;dr; transformative experience), even though some parts of it are already outdated or will be outdated very soon as the technology is making progress every day.
Where is the program synthesis? My way of thinking is given primitives as tools, i want the model to construct and return the program to execute.
Of course following nix philosophy is another way.
Sonnet does this via the edit tool and bash tool. It’s inbuilt to the model.
Interesting.
I really think the current trend of CLI coding agents isn't going to be the future. They're cool but they are _too simple_. Gemini CLI often makes incorrect edits and gets confused, at least on my codebase. Just like ChatGPT would do in a longer chat where the context gets lost: random, unnecessary and often harmful edits are made confidently. Extraneous parts of the codebase are modified when you didn't ask for it. They get stuck in loops for an hour trying to solve a problem, "solving it", and then you have to tell the LLM the problem isn't solved, the error message is the same, etc.
I think the future will be dashboards/HUDs (there was an article on HN about this a bit ago and I agree). You'll get preview windows, dynamic action buttons, a kanban board, status updates, and still the ability to edit code yourself, of course.
The single-file lineup of agentic actions with user input, in a terminal chat UI, just isn't gonna cut it for more complicated problems. You need faster error reporting from multiple sources, you need to be able to correct the LLM and break it out of error loops. You won't want to be at the terminal even though it feels comfortable because it's just the wrong HCI tool for more complicated tasks. Can you tell I really dislike using these overly-simple agents?
You'll get a much better result with a dashboard/HUD. The future of agents is that multiple of them will be working at once on the codebase and they'll be good enough that you'll want more of a status-update-confirm loop than an agentic code editing tool update.
Also required is better code editing. You want to avoid the LLM making changes in your code unrelated to the requested problem. Gemini CLI often does a 'grep' for keywords in your prompt to find the right file, but your prompt was casual and doesn't contain the right keywords so you end up with the agent making changes that aren't intended.
Obviously I am working in this space so that's where my opinions come from. I have a prototype HUD-style webapp builder agent that is online right now if you'd like to check it out:
https://codeplusequalsai.com/
It's not got everything I said above - it's a work-in-progress. Would love any feedback you have on my take on a more complicated, involved, and narrow-focus agentic workflow. It only builds flask webapps right now, strict limits on what it can do (no cron etc yet) but it does have a database you can use in your projects. I put a lot of work into the error flow as well, as that seems like the biggest issue with a lot of agentic code tools.
One last technical note: I blogged about using AST transformations when getting LLMs to modify code. I think that using diffs or rewriting the whole file isn't the right solution either. I think that having the LLM write code that modifies your code and then running that code to affect the modifications is the way forward. We'll see I guess. Blog post: https://codeplusequalsai.com/static/blog/prompting_llms_to_m...
>Gemini CLI often makes incorrect edits and gets confused
Gemini CLI still uses archaic whole file format for edits, it's not a good representative of current state of coding agents.
I'm not sure what do you mean by "whole file format", but if it refers to the write_file tool that overwrites the whole file, there is also the replace tool which is apparently inspired by a blog post [1] by Anthropic. It seems that Claude Code also supports the roughly identical tool (inferred from error messages), so editing tools can't be the reason why Claude Code is good.
[1] https://www.anthropic.com/engineering/swe-bench-sonnet
Many agents can send diffs. Whole file reading and writing burns tokens and pollutes context.
Oh that's wild, I did suspect that but didn't know it outright. Mind-blowing Google would release that kind of thing, I had wondered why it sucked so much haha. Okay so what is a good representation of the current state of coding agents? Which one should I try that does a better job at code modifications?
Claude code is the strongest atm, but roocode or cline (vscode extensions) can also work well. Roo with gpt5-mini (so cheap, pretty fast) does diff based edits w/ good coordination over a task, and finishes most tasks that I tried. It even calls them "surgical diffs" :D
claude code (with max subscription), cursor-agent (with usage based pricing)
You are wasting your time and everyone elses with Gemini, it is the worst.
Oh I don’t use Gemini! I did try it out and admittedly formed an opinion too narrow on cli agents. But no way do I actually use Gemini.
Anyone can build a coding agent which works on a) fresh code base b) when you've unlimited token budget
now build it for old codebase, let's see how precisely it edits or removes features without breaking the whole codebase
lets see how many tokens it consumes per bug fix or feature addition.
This comment belongs in a discussion about using LLMs to help write code for large existing systems - it's a bit out of place in a discussion about a tutorial on building coding agents to help people understand how the basic tools-in-a-loop pattern works.
anyone who used those coding agent can already see how it works, you can usually see agent fetching files, running commands, listing files and directories.
i just wrote this comment so people aren't under false belief that it's pretty much all coding agents do, making all this fault tolerant with good ux is lot of work.
Agree. To reduce costs:
1. Precompute frequently used knowledge and surface early. For example repository structure, os information, system time.
2. Anticipate next tool calls. If a match is not found while editing, instead of simply failing, return closest matching snippet. If read file tool gets a directory, return directory contents.
3. Parallel tool calls. Claude needs either a batch tool or special scaffolding to promote parallel tool calls. Single tool call per turn is very expensive.
Are there any other such general ideas?
that info can be just included in preffix which is cache by LLM, reducing cost by 70-80% average. System time varies, so it's not good idea to specify it in prompt, better to make a function out of it to avoid cache invalidation.
I am still looking for a good "memory" solution, so far running without it. Haven't looked too deep into it.
Not sure how next tool call be predicted.
I am still using serial tool calls as i do not have any subagents, i just use fast inference models for directly tools calls. It works so fast, i doubt i'll benefit from parallel anything.
There's "swe re-bench", a benchmark that tracks model release dates, and you can see how the model did for "real-world" bugs that got submitted on github after the model was released. (obviously works best for open models).
There are a few models that solve 30-50% of (new) tasks pulled from real-wolrd repos. So ... yeah.
Surprise, as rambunctious dev who’s socially hacked their way through promotions, I will just convince our manager we need to rewrite the platform in a new stack or convince them that I need to write a new server to handle the feature. No old tech needed!
For me, the post is missing an explanation of the reason why I would want to build my own coding agent instead of just using one of the publicly available ones.
Knowing how to build your own agent and what that loop is going to be the new whiteboard coding question in a couple of years. Absolute. It's going to be the same as "Reverse this string", "I've got a linked list, can you reverse it?", or "Here's my graph, can you traverse it?"
I see, thanks. I was wondering earlier if there would be any practical advantage in creating a custom agent, but couldn't think of any. I guess I simply misunderstood the purpose of your post.
You wouldn't.
This project and this post are for the curious and for the learners.