Reverse engineering Codex CLI to get GPT-5-Codex-Mini to draw me a pelican

(simonwillison.net)

75 points | by simonw 7 hours ago ago

34 comments

f311a 5 hours ago ago
Installing Rust projects is usually as simple as calling `cargo install`. No need to wait for 5-30 minutes until LLM figures this out.
People are delegating way too much to LLMs. In turn, this makes your own research or problem-solving skills less sharp.
[-]
- simonw 4 hours ago ago
  Quoting my article:
  > This is a useful starting point for a project like this—in figuring out the compile step the coding agent gets seeded with a little bit of relevant information about the project, and if it can compile that means it can later partially test the code it is writing while it works.
  "Figure out how to build this" is a shortcut for getting a coding agent primed for future work. If you look at the transcript you can see what it did: https://gistpreview.github.io/?ddabbff092bdd658e06d8a2e8f142...
```
   Ran ls
    └ AGENTS.md
      CHANGELOG.md
        +13 lines
      scripts
      sdk
  
   Ran cat README.md
    └ <p align="center"><code>npm i -g @openai/codex</code><br />or <code>brew
      install --cask codex</code></p>
  
        +105 lines
  
      This repository is licensed under the [Apache-2.0 License](LICENSE).
  
   Ran rg -n "rust" -n docs/install.md
    └ 23:curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
      25:rustup component add rustfmt
      26:rustup component add clippy
  
   Ran sed -n '1,200p' docs/install.md
    └ ## Install & build
  
        +36 lines
      cargo test
      ```
    
     Im gathering the build steps for the Rust tool, focusing on running cargo
      build inside the codex-rs directory. I'll check dependencies and confirm
      the instructions before actually building, aiming to provide a clear summary
      afterward.
    
     Ran cargo build
      └     Updating git repository `https://github.com/nornagon/crossterm`
            Updating git repository `https://github.com/nornagon/ratatui`
          +825 lines
           Compiling codex-cli v0.0.0 (/Users/simon/Dropbox/dev/codex/codex-rs/cli)
            Finished `dev` profile [unoptimized + debuginfo] target(s) in 1m 27s
    
     Ran ls target/debug | head
      └ apply_patch
        apply_patch.d
          +6 lines
        codex-execpolicy
```
That's a decent starting point on seeding the context with information that's relevant to making and then testing the modifications I'm about to ask for.
[-]
- adastra22 4 hours ago ago
  What useful context is in there? How to call “cargo build”? It already knows that.
  [-]
  - simonw 3 hours ago ago
    The README files that it looked at, and the directory structure it explored.
    It now knows what the project is, what dependencies it uses, how it's laid out and the set of binaries that it generates.
    Even more importantly: it knows that the project can be built without errors. If it tries a build later and sees an error it will know that the error was caused by code it had modified.
    [-]
    - risyachka 2 hours ago ago
      You could just tell it to check out readme, but I suspect it would have checked it out anyway or figured out the type of project and how it is structured as a first step of any other command you give it as without it it is impossible to add or update the project.
      [-]
      cadamsdotcom 2 hours ago ago
      For a Rust developer, neglecting their ability to debug cargo build issues puts their career at risk. For someone like that, letting AI handle it would be a really shortsighted move.
      But Simon isn’t a Rust developer - he’s a motivated individual with a side project. He can now speedrun the part he’s not interested in. That doesn’t affect anyone else’s decisions, you can still choose to learn the details. Ability to skip it if you wish, is a huge win for everyone.
      [-]
      risyachka 23 minutes ago ago
      >> He can now speedrun the part he’s not interested in
      In this case its more like slowrunning. Building rust project is 1 command and chatgpt will tell you this command in 5 seconds.
      Running an agent for that is 1000x more inefficient.
      At this point its not optimizing or speeding things up but running agent for the sake of running agent.
      simonw 2 hours ago ago
      The most important thing is to have it successfully build the software, to prove to both me and itself that a clean compile is possible before making any further changes.
- Retr0id 5 hours ago ago
  Figuring out how to build a project in an unfamiliar language/build system is my least favourite activity, mainly because all the people who are familiar with those tools think it's "as simple as" and don't bother to write it down anywhere. I don't plan on learning every build system ever.
  [-]
  - heddycrow 5 hours ago ago
    I did not know how to do X so I Y.
    It would be interesting to know what kinds of responses humans offer across different values of Y such as:
    1) looked on stack overflow 2) googled it 3) consulted the manual 4) asked an LLM 5) asked a friend
    For each of these, does the learner somehow learn something more or better?
    Is there some means of learning that doesn't degrade us as human beings according to those in the know?
    I ask as someone who listens to audiobooks and answers yes when someone asks me if I've read the book. And that's hardly the extent of my transgressions.
    [-]
    - hrimfaxi 5 hours ago ago
      At least if you're copy/pasting from stack overflow you presumably glanced at the change you are copying if only to ensure you select the correct text.
      [-]
      - heddycrow 5 hours ago ago
        Good point. We also sometimes leave comments in code noting the thread we referenced.
- beklein 2 hours ago ago
  If you've never used Rust before, I couldn't find good documentation on how to run a existing Rust project nor could I find `cargo install` on the "Getting Started" page. I could read the Cargo Book, or check `--help` I guess, but this can be surprisingly time consuming as well, it might take 5-30 minutes of active searching to locate the information. If you can, try put yourself in a beginner's mindset and think though your argument again.
  Regarding your second point, I think people actually underutilise LLMs for simple tasks. Delegating these tasks frees up your problem-solving skills for challenges that truly need human insight. In this case, asking an LLM is arguably the smart choice: it's a common task in training data, easy to verify, and low-risk to run and not a direct learning or benefit for your initial question.
  [-]
  - aabhay an hour ago ago
    You don’t need to cargo install anything. You just need cargo itself, which is linked on the main page. Once you have that, here’s an example google search that gives you all the info you need to run the project (hint: `cargo run`)
    https://www.google.com/search?q=how+do+I+run+a+rust+project
- survirtual 12 minutes ago ago
  Funny you say that, because I have the opposite opinion.
  It is easy for any of us to quickly bootstrap a new project in whatever language. But this takes a cognitive toll, and adds friction to bring our visions online.
  Recently, I went "blind" for a couple of days. My vision was so damaged I could only see blurs. The circumstances of this blindness are irrelevant, but it dawned on me that if I were blind, I could no longer code as I do.
  This realization led me to purchase a Max subscription to Claude Code and rely more on LLMs for building, not less.
  It was much more effective than I thought it would be. In my blurred blindness, I saw blobs of a beautiful user interface form, from the skeleton of my Rust backend, Vue3 frontend. It took my (complex backend in Rust) and my frontend scaffolding to another level. I could recognize it via the blur distinctly. And it did this in minutes / hours instead of days.
  As my vision returned, I began analyzing what happened and conducting experiments. My attitude changed completely. Instead of doing things myself, directly, I set out to make the LLM do it, even if it took more time.
  It is painful at first. It makes very stupid mistakes that make an experienced engineer snarl at it, "I can do better myself". But my blindness gave me new sight. If I were blind, I couldn't do it myself. I would need help.
  Instead of letting that ego take over, I started patiently understanding how the LLM best operates. I discovered mainly it needs context and specific instructions.
  I experimented with a DSL I made for defining LLM instructions that are more suited for it, and I cannot describe the magic that started unfolding.
  Now, I am writing a massive library of modular instructions for LLMs, and launching them against various situations. They will run for hours uninterrupted and end up implementing full code bases, with complete test suites, domain separation, and security baked in.
  Reviewing their code, it looks better than 90% of what I see people producing. Clear separation of concerns, minimal code reuse, distinct interface definitions, and so much more.
  So now, I have been making it more and more autonomous. It doesn't matter if I could bootstrap a project in 30 seconds. If I spend a few hours perfecting the instructions to the LLM, I can bootstrap ANY project for ANY LANGUAGE, forever.
  And the great thing? I already know the pattern works. At this point, it is foolish for me to do anything other than this.
- lifthrasiir 2 hours ago ago
  I have used Rust for decades (yeah, Rust is that old) and want to point out that that's not always the case, especially when FFI is involved. At some point, for example, any Rust crate with the `openssl` dependency used to require a special care every time `cargo install` gets run. Cargo itself is super nice; other tools, still not so much.
- winternewt 3 hours ago ago
  I see where you're coming from. But I often find that when I have some idea or challenge that I want to solve, I get bogged down in details (like how do I build that project)... before I even know if the idea I _wanted_ to solve is feasible.
  It's not that I don't care about learning how to build Rust or think that it's too big of a challenge. It's just not the thing I was excited about right now, and it's not obvious ahead of time how sidetracked it will get me. I find that having an LLM just figure it out helps me to not lose momentum.
- caymanjim 4 hours ago ago
  I would have done the same thing. I know how to build software in a dozen or more languages. I've done it manually, from scratch, in all of them. I don't know Rust. I have no immediate plan to learn Rust. I vaguely know that Cargo is something in the Rust toolbox. I don't have it installed. I don't particularly want to learn anything about it. It's a whole lot easier for me to tell the LLM to figure that out.
  I might learn Rust some day. At the moment, I don't need the mental clutter.
  [-]
  - adastra22 4 hours ago ago
    Well, fyi because it is really simple: if you have rust installed, you have cargo installed too. And to run a project you type “cargo run” from the base directory. That is all.
padolsey 2 hours ago ago
I help run an eval platform and thought it fun to try a bunch of models on this challenge [1].
There's some fun little ones in there. I've not idea what Llama 405B is doing. Qwen 30B A3B is the only one that cutely starts on the landscaping and background. Mistral Large & Nemo are just convinced that front shot is better than portrait. Also interesting to observe varying temperatures.
I feel like this SVG challenge is a pretty good threshold to meet before we start to get too impressed by ARC AGI wins.
[1] https://weval.org/analysis/visual__pelican/f141a8500de7f37f/...
[-]
- cubefox an hour ago ago
  > I feel like this SVG challenge is a pretty good threshold to meet before we start to get too impressed by ARC AGI wins.
  It's a very bad threshold. The models write the plain SVG without looking at the final image. Humans would be awful at it and you would mistakenly conclude that they aren't general intelligences.
  [-]
  - padolsey an hour ago ago
    I dunno. A competent human can hold the a mental image and work through it. Not too hard with experience. What I generally mean tho is: I don't think we can state the supreme capabilities of AI (which people love to do with grate fervour and rhetoric) until they can at the very least draw basic objects in well-known declarative languages. And while it may be unwise to judge an AI based on its ability to count the number of 'R' letters in various words, it -- amongst a wider suite ofc -- remains a good minimum threshold of capability.
djmips 4 hours ago ago
How long before large language models are specifically trained on drawing pelicans riding a bicycle. ( ͡° ͜ʖ ͡°)
[-]
- heddycrow 4 hours ago ago
  And where on the web has someone shared a human effort at doing the same?
  [-]
  - djmips 4 hours ago ago
    you could literally hire a human to do that, not everything needs to be on the web.
  - hrimfaxi 4 hours ago ago
    Where on the web do hallucinations come from?
    [-]
    - heddycrow 3 hours ago ago
      I think it's some part of the Dark Web, or I wish it was.
- NitpickLawyer 4 hours ago ago
  Previous discussions about "pelican on a bicycle" always mention this, but it's not something they can do without being blatantly obvious. You can always do other x riding y tests. A juggler riding a barrel. A bear riding a unicycle. An anteater riding a horse, etc.
  [-]
  - tkgally 2 hours ago ago
    A couple of days ago, inspired by Simon and those discussions, I had Claude create 30 such tests. I posted a Show HN with the results from six models, but it didn’t get any traction. Here it is again:
    https://news.ycombinator.com/item?id=45845717
    https://gally.net/temp/20251107pelican-alternatives/index.ht...
  - swyx 3 hours ago ago
    simon has said multiple times he has hidden tests he runs for precisely this eventuality (because of course it will happen someday, and he'll write a banger article calling them out for it)
cyrusradfar 4 hours ago ago
Was a fun idea and fun read. Thank you.
Did you consider expanding the number of models by getting all calls through open router?
[-]
- NitpickLawyer 3 hours ago ago
  You can already do that through the config file, you can define custom endpoints for any openai compatible API. So you can get openrouter, or even local models via vLLM or alternatives. I think someone even tried to get cheaper API pay-as-you-go usage by hitting their "bulk" API, for tasks that run over night (so no need for immediate responses).
- simonw 4 hours ago ago
  I haven't tried it myself yet, but it looks like OpenRouter is a supported feature of Codex already: https://github.com/openai/codex/blob/a47181e471b6efe55e95f98...
  [-]
  - eloycoto 3 hours ago ago
    It's weird, tools are empty on the http request?