The unreasonable effectiveness of fuzzing for porting programs

(rjp.io)

232 points | by Bogdanp 2 days ago ago

49 comments

nyanpasu64 2 days ago ago
> Most code doesn't express subtle logic paths. If I test if a million inputs are correctly sorted, I've probably implemented the sorter correctly.
I don't know if this was referring to Zopfli's sorter or sorting in general, but I have heard of a subtle sorting bug in Timsort: https://web.archive.org/web/20150316113638/http://envisage-p...
[-]
- klabb3 a day ago ago
  > Most code doesn't express subtle logic paths. If I test if a million inputs are correctly sorted, I've probably implemented the sorter correctly.
  This just rings of famous last words to me. There are many errors that pass this test. Edge cases in arbitrary code is not easy.
  Makes me wonder how fuzzers do it. Just random data? How guided is it?
  [-]
  - _flux a day ago ago
    Modern fuzzers try to modify the input so that code travels through as many different paths as possible.
    One of the better known "new gen fuzzers" is AFL. Wikipedia has a high-level overview of its fuzzing algorithm https://en.wikipedia.org/wiki/American_Fuzzy_Lop_(software)#...
    With AFL you can use a JPEG decoder and come up with a "valid" JPEG picture, i.e. one acceptable by the decoder: https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-th...
    [-]
    - klabb3 21 hours ago ago
      Thanks. This is pretty damn cool, and sounds much more useful than random for real-world use cases.
      Question: does this work for interpreted languages? Or is this an assembly only thing?
      [-]
      - _flux 20 hours ago ago
        I suppose it does work for interpreted languages (you just need to define what is success and what is failure), but for AFL the evaluation might be too far away from the actual branching that it would be less effective, possibly critically so. Additionall in fuzzing the bottleneck is running the programs, millions of times (though maybe not billions). So the slower your function is, the slower the fuzzing will be.
        I think though the same approach could be used with interpreters, and I expect it would be easier to do there. E.g. for Python there is this, but I haven't tried it: https://github.com/jwilk/python-afl
- rjpower9000 2 days ago ago
  Thanks for sharing, I did not know about that!
  Indeed, this is exactly the type of subtle case you'd worry about when porting. Fuzzing would be unlikely to discover a bug that only occurs on giant inputs or needs a special configuration of lists.
  In practice I think it works out okay because most of the time the LLM has written correct code, and when it doesn't it's introduced a dumb bug that's quickly fixed.
  Of course, if the LLM introduces subtle bugs, that's even harder to deal with...
  [-]
  - hansvm 2 days ago ago
    > most of the time the LLM has written correct code [...dumb bugs]
    What domain do you work in?
    I hope I'm just misusing the tool, but I don't think so (math+ML+AI background, able to make LLMs perform in other domains, able to make LLMs sing and dance for certain coding tasks, have seen other people struggle in the same ways I do trying to use LLMs for most coding tasks, haven't seen evidence of anyone doing better yet). On almost any problem where I'd be faster letting an LLM attempt it rather than just banging out a solution myself, it only comes close to being correct with intensive, lengthy prompting -- after much more effort than just typing the right thing in the first place. When it's wrong, the bugs often take more work to spot than to just write the right thing since you have to carefully scrutinize each line anyway while simultaneously reverse engineering the rationale for each decision (the API is structured and named such that you expect pagination to be handled automatically, but that's actually an additional requirement the caller must handle, leading to incomplete reads which look correct in prod ... till they aren't; when moving code from point A to point B it removes a critical safety check but the git diff is next to useless and you have to hand-review that sort of tedium and have to actually analyze every line instead of trusting the author when they say that a certain passage is a copy-paste job; it can't automatically pick up on the local style (even when explicitly prompted as to that style's purpose) and requires a hand-curated set of examples to figure out what a given comptime template should actually be doing, violating all sorts of invariants in the generated code, like running blocking syscalls inside an event loop implementation but using APIs which make doing so _look_ innocuous).
    I've shipped a lot of (curated, modified) LLM code to prod, but I haven't yet seen a single model or wrapper around such models capable of generating nearly-correct code "most" of the time.
    I don't doubt that's what you've actually observed though, so I'm passionately curious where the disconnect lies.
    [-]
    - rjpower9000 a day ago ago
      I might have phrased this unclearly, I meant specifically for the case of translating one symbol at a time from C to Rust. I certainly won't claim I've figured out any magic that makes the coding agents consistent!
      Here you've got the advantage that you're repeating the same task over and over, so you can tweak your prompt as you go, and you've got the "spec" in the form of the C code there, so I think there's less to go wrong. It still did break things sometimes, but the fuzzing often caught it.
      It does require careful prompting. In my first attempt Claude decided that some fields in the middle of an FFI struct weren't necessary. You can imagine the joy of trying to debug how a random pointer was changing to null after calling into a Rust routine that didn't even touch it. It was around then I knew the naive approach wasn't going to work.
      The second attempt thus had a whole bunch of "port the whole struct or else" in the prompt: https://github.com/rjpower/zopfli/blob/master/port/RUST_PORT... .
      In general I've found the agents to be a mixed bag, but overall positive if I use them in the right way. I find it works best for me if I used the agent as a sounding board to write down what I want to do anyway. I then have it write some tests for what should happen, and then I see how far it can go. If it's not doing something useful, I abort and just write things myself.
      It does change your development flow a bit for sure. For instance, it's so much more important to concrete test cases to force the agent to get it right; as you mention, otherwise it's easy for it do something subtly broken.
      For instance, I switched to tree-sitter from the clang API to do symbol parsing, and Claude wrote effectively all of it; in this case it was certainly much faster than writing it myself, even if I needed to poke it once or twice. This is sort of a perfect task for it though: I roughly knew what symbols should come out and in what order, so it was easy to validate the LLM was going in the right direction.
      I've certainly had them go the other way, reporting back that "I removed all of the failing parts of the test, and thus the tests are passing, boss" more times than I'd like. I suspect the constrained environment again helped here, there's less wiggle room for the LLM to misinterpret the situation.
  - awesome_dude 2 days ago ago
    > Fuzzing would be unlikely to discover a bug that only occurs on giant inputs or needs a special configuration of lists.
    I have a concern about peoples' over confidence in fuzz testing.
    It's a great tool, sure, but all it is is something that selects (and tries) inputs at random from the set of all possible inputs that can be generated for the API.
    For a strongly typed system that means randomly selecting ints from all the possible ints for an API that only accepts ints.
    If the API accepts any group of bytes possible, fuzz testing is going to randomly generate groups of bytes to try.
    The only advantage this has over other forms of testing is that it's not constrained by people thinking "Oh these are the likely inputs to deal with"
    [-]
    - xyzzy123 2 days ago ago
      This is not quite true, what you are describing is "dumb" fuzzing. Modern fuzzers are coverage guided and will search for and devote more effort to inputs which trigger new branches / paths.
      https://afl-1.readthedocs.io/en/latest/about_afl.html
      But yeah in general path coverage is hard and fuzzing works better if you have a comprehensive corpus of test inputs.
comex 2 days ago ago
Interesting! But there’s a gap between aspirations and what was accomplished here.
Early on in the blog post, the author mentions that "c2rust can produce a mechanical translation of C code to Rust, though the result is intentionally 'C in Rust syntax'". The flow of the post seems to suggest that LLMs can do better. But later on, they say that their final LLM approach produces Rust code which “is very 'C-like'" because "we use the same unsafe C interface for each symbol we port”. Which sounds like they achieved roughly the same result as c2rust, but with a slower and less reliable process.
It’s true that, as the author says, “because our end result has end-to-end fuzz tests and tests for every symbol, its now much easier to 'rustify' the code with confidence". But it would have been possible to use c2rust for the actual port, and separately use an LLM to write fuzz tests.
I'm not criticizing the approach. There's clearly a lot of promise in LLM-based code porting. I took a look at the earlier, non-fuzz-based Claude port mentioned in the post, and it reads like idiomatic Rust code. It would be a perfect proof of concept, if only it weren't (according to the author) subtly buggy. Perhaps there's a way to use fuzzing to remove the bugs while keeping the benefits compared to mechanical translation. Unfortunately, the author's specific approach to fuzzing seems to have removed both the bugs and the benefits. Still, it's a good base for future work to build on.
[-]
- rjpower9000 2 days ago ago
  It's in between. It's more C like than the Claude port, but it's more Rust-y than c2rust. How much depends on how fine-grained you want to make your port and how you want to prompt your LLM. For inside of functions and internal symbols, the LLM is free to use more idiomatic construction and structures. But since the goal was to test the effectiveness of the fuzz testing, using the LLM to do the symbol translation is more of an implementation detail.
  You could certainly try using c2rust to do the initial translation, and it's a reasonable idea, but I didn't find the LLMs really struggled with this part of the task, and there's certainly more flexibility this way. c2rust seemed to choke on some simple functions as well, so I didn't pursue it further.
  And of course for external symbols, you're constrained by the C API, so how much leeway you have depends on the project.
  You can also imagine having the LLM produce more idiomatic code from the beginning, but that can be hard to square with the incremental symbol-by-symbol translation.
amw-zero 2 days ago ago
There are 2 main problems in generative testing:
- Input data generation (how do you explore enough of the program's behavior to have confidence that you're test is a good proxy for total correctness)
- Correctness statements (how do you express whether or not the program is correct for an arbitrary input)
When you are porting a program, you have a built in correctness statement: The port should behave exactly as the source program does. This greatly simplifies the testing process.
[-]
- bluGill 2 days ago ago
  Several times I've been involved in porting code. Eventually we reach a time where we are getting a lot of bug reports "didn't work, didn't work with the old system as well" which is to say we ported correctly, but the old system wasn't right either and we just hadn't tested it in that situation until the new system had the budget for exhaustive testing. (normally it worked at one point on the old system and got broke in some other update)
zie1ony 2 days ago ago
I find it amazing, that the same ideas pop up in the same period of time. For example, I work on tests generation and I went the same path. I tried to find bugs by prompting "Find bugs in this code and implement tests to show it.", but this didn't get me far. Then I switched to property (invariant) testing, like you, but in my case I ask AI: "Based on the whole codebase, make the property tests." and then I fuzz some random actions on the state-full objects and run prop tests over and over again.
At first I also wanted to automate everything, but over time I realized that best is: 10% human to 90% AI of work.
Another idea I'm exploring is AI + Mutation Tests (https://en.wikipedia.org/wiki/Mutation_testing). It should help AI with generation of full coverage.
[-]
- LAC-Tech 2 days ago ago
  I'd have much more confidence in an AI codebase where the human has chosen the property tests, than a human codebase where the AI has chosen the property tests.
  Tests are executable specs. That is the last thing you should offload to an LLM.
  [-]
  - bccdee 2 days ago ago
    Also, a poorly designed test suite makes your code base extremely painful to change. A well-designed test suite with good abatractions makes it easy to change code, on top of which, it makes tests extremely fast to write.
    I think the whole idea of getting LLMs to write the tests comes from a pandemic of under-abstracted, labour-intensive test suites. And that just makes the problem worse.
    [-]
    - LAC-Tech 2 days ago ago
      Perhaps the viewpoint that tests are a chore or grunt work; something you have to do but you don't really view as interesting or important.
      (like how I describe what git should do and I get the LLM to give me the magic commands with all the confusing nouns and verbs and dashes in the right place).
      [-]
      - bccdee 2 days ago ago
        Yeah—I like writing elegant test abstractions much more than I like writing clumsy, verbose unit tests, and there's an inverse relationship between those. Maybe people just don't want to ever bother to refactor a test suite, and so early shortcuts turn into walls of boilerplate.
  - 2 days ago ago
    [deleted]
  - kenjackson a day ago ago
    While I agree in theory -- the problem I have is that humans I've worked with are much worse at writing tests than they are at writing the implementation. Maybe its motivation or experience, but test quality generally is much worse than implementation quality -- at least in my experience.
  - koakuma-chan 2 days ago ago
    How about an LRM?
    [-]
    - LAC-Tech 2 days ago ago
      I do not know this term; could you give a concise explanation?
      [-]
      - koakuma-chan 2 days ago ago
        LRM is a new term for reasoning LLMs. From my experience, either I am bad at prompting, or LRMs are vastly better than LLMs at instruction following.
- wahnfrieden 2 days ago ago
  An under-explored approach is to collect data on human usage of the app (from production and from internal testers) and feed that to your generative inputs
DrNosferatu 2 days ago ago
Why not use the same approach to port the full set of Matlab libraries to Octave?
(or a open source language of your choice)
Matlab manuals are public: it would be clean room reverse engineering.
(and many times, the appropriate bibliography of the underlying definitions of what is being implemented is listed on the manual page)
e28eta 2 days ago ago
> LLMs open up the door to performing radical updates that we'd never really consider in the past. We can port our libraries from one language to another. We can change our APIs to fix issues, and give downstream users an LLM prompt to migrate over to the new version automatically, instead of rewriting their code themselves. We can make massive internal refactorings. These are types of tasks that in the past, rightly, a senior engineer would reject in a project until its the last possibly option. Breaking customers almost never pays off, and its hard to justify refactoring on a "maintenance mode" project.
> But if it’s more about finding the right prompt and letting an LLM do the work, maybe that changes our decision process.
I don’t see much difference between documenting any breaking changes in sufficient detail for your library consumers to understand them vs “writing an LLM prompt for migrating automatically”, but if that’s what it takes for maintainers to communicate the changes, okay!
Just as long as it doesn’t become “use this LLM which we’ve already trained on the changes to the library, and you just need to feed us your codebase and we’ll fix it. PS: sorry, no documentation.”
[-]
- marxism 2 days ago ago
  There's a huge difference between documentation and prompts. Let me give you a concrete example.
  I get requests to "make your research code available on Hugging Face for inference" with a link to their integration guide. That guide is 80% marketing copy about Git-based repositories, collaboration features, and TensorBoard integration. The actual implementation details are mixed in through out.
  A prompt would be much more compact.
  The difference: I can read a prompt in 30 seconds and decide "yes, this is reasonable" or "no, I don't want this change." With documentation, I have to reverse-engineer the narrow bucket which applies to my specific scenario from a one size drowns all ocean.
  The person making the request has the clearest picture of what they want to happen. They're closest to the problem and most likely to understand the nuances. They should pack that knowledge densely instead of making me extract it from documentation links and back and forth.
  Documentation says "here's everything now possible, you can do it all!" A prompt says "here's the specific facts you need."
  Prompts are a shared social convention now. We all have a rough feel for what information you need to provide - you have to be matter-of-fact, specific, can't be vague. When I ask someone to "write me a prompt," that puts them in a completely different mindset than just asking me to "support X".
  Everyone has experience writing prompts now. I want to leverage that experience to get cooperative dividends. It's division of labor - you write the initial draft, I edit it with special knowledge about my codebase, then apply it. Now we're sharing the work instead of dumping it entirely on the maintainer.
  [1] https://peoplesgrocers.com/en/writing/write-prompts-not-guid...
  [-]
  - rjpower9000 2 days ago ago
    I was pretty hand-wavy when I made the original comment. I was thinking implicitly to things like the Python sub-interpreter proposal, which had strong pushback from the Numpy engineers at the time (I don't know the current status, whether it's a good idea, etc, just something that came to mind).
    https://lwn.net/Articles/820424/
    The objections are of course reasonable, but I kept thinking this shouldn't be as big a problem in the future. A lot of times we want to make some changes that aren't _quite_ mechanical, and if they hit a large part of the code base, it's hard to justify. But if we're able to defer these types of cleanups to LLMs, it seems like this could change.
    I don't want a world with no API stability of course, and you still have to design for compatibility windows, but it seems like we should be able to do better in the future. (More so in mono-repos, where you can hit everything at once).
    Exactly as you write, the idea with prompts is that they're directly actionable. If I want to make a change to API X, I can test the prompt against some projects to validate agents handle it well, even doing direct prompt optimization, and then sharing it with end users.
  - e28eta 2 days ago ago
    Yes, there's a difference between "all documentation for a project" and "prompt for specific task".
    I don't think there should be a big difference between "documentation of specific breaking changes in a library and how consumers should handle them" and "LLM prompt to change a code base for those changes".
    You might call it a migration guide. Or it might be in the release notes, in a special section for Breaking Changes. It might show up in log messages ("you're using this API wrong, or it's deprecated").
    Why would describing the changes to an LLM be easier than explaining them to the engineer on the other end of your API change?
oasisaimlessly 2 days ago ago
Author FYI: The "You can see the session log here." link to [1] is broken.
[1]: https://rjp.io/blog/claude-rust-port-conversation
[-]
- rjpower9000 2 days ago ago
  Fixed, thanks!
lhmiles 2 days ago ago
Are you the author? You can speed things up and get better results sometimes by retrying the initial generation step many times in parallel, instead of the interactive rewrite thing.
[-]
- rjpower9000 2 days ago ago
  I'm the author. That's a great idea. I didn't explore that for this session but it's worth trying.
  I didn't measure consistently, but I would guess 60-70% of the symbols ported easily, with either one-shot or trivial edits, 20% Gemini managed to get there but ended up using most of its attempts, and 10% it just struggled with.
  The 20% would be good candidates for multiple generations & certainly consumed more than 20% of the porting time.
maxjustus 2 days ago ago
I used this general approach to port the ClickHouse specific version of cityHash64 to JS from an existing Golang implementation https://github.com/maxjustus/node-ch-city/blob/main/ch64.js. I think it works particularly well when porting pure functions.
rcthompson 2 days ago ago
The author notes that the resulting Rust port is not very "rusty", but I wonder if this could also be solved through further application of the same principle. Something like telling the AI to minimize the use of unsafe etc., while enforcing that the result should compile and produce identical outputs to the original.
[-]
- rjpower9000 2 days ago ago
  It seems feasible, but I haven't thought enough it. One challenge is that as you Rustify the code, it's harder to keep the 1-1 mapping with C interfaces. Sometimes to make it more Rust-y, you might want an internal function or structure to change. You then lose your low-level fuzz tests.
  That said, you could have the LLM write equivalence tests, and you'd still have the top-level fuzz tests for validation.
  So I wouldn't say it's impossible, just a bit harder to mechanize directly.
namanjain01 20 hours ago ago
I am one of the authors of the Syzygy work! Thank you for writing this, it was a great read! Especially since we attempted this last year when considerably weaker models (o1-preview) were available, it is great to see how far we have come.
A significant portion of our research focuses on utilizing testing in conjunction with LLMs. Beyond translation, we are also exploring code optimization, which aligns with the "precise" specification as part of the GSO project - https://gso-bench.github.io/.
punnerud 2 days ago ago
Reading that TensorFlow is not used much anymore (besides Google) felt good to read. Had to check Google Trends: https://trends.google.com/trends/explore?date=all&q=%2Fg%2F1...
I started using TensorFlow years ago and switched to PyTorch. Hope ML will make switches like TensorFlow to PyTorch faster and easier, and not just the biggest companies eating the open source community. Like it have been for years.
[-]
- screye 2 days ago ago
  Google has moved to JAX. I know many people who prefer it over pytorch.
  [-]
  - leoh 2 days ago ago
    It’s okay. Complaints are documentation, limited community support (all kinds of architecture is much more diy for it vs PyTorch).
    Unrelated gripe: they architected it really poorly from a pure sw pov imo. Specifically it’s all about Python bindings for C++ so the py/c++ layer is tightly coupled both in code and in the build system.
    They have a huge opportunity to fix this so, for example, rust bindings could be (reasonably trivially) generated, let alone for other languages.
DinoNuggies45 a day ago ago
This matches what I’ve seen fuzzing always felt more niche to me until we started using it during porting. Any tips for minimizing false positives when adapting to a new arch?
gaogao 2 days ago ago
Domains where fuzzing is useful are generally good candidates for formal verification, which I'm pretty bullish about in concert with LLMs. This is in part because you can just formal verify by exhaustiveness for many problems, but the enhancement is being able to prove that you don't need to test certain combinations through inductive reasoning and such.
[-]
- rjpower9000 2 days ago ago
  That's an interesting idea. I hadn't thought about it, but it would be interesting to consider doing something similar for the porting task. I don't know enough about the space, could you have an LLM write a formal spec for a C function and the validate the translated function has the same properties?
  I guess I worry it would be hard to separate out the "noise", e.g. the C code touches some memory on each call so now the Rust version has to as well.
DrNosferatu 2 days ago ago
It will be inevitable that this generalizes.
fiforpg a day ago ago
That paper by Wigner about mathematics really did originate an entire naming scheme for such texts, didn't it.
[-]
- AIPedant 10 hours ago ago
  And it has a maybe 3-5% accuracy rate on actually being “unreasonably” effective! There are a lot of surprising things about transformer LLMs but they were literally designed to translate human languages, so being effective at translating programming languages seems natural.
  At least the “X Considered Harmful” clones describe something which is plausibly harmful.
akoboldfrying a day ago ago
Cool! It's really helpful to read a blow-by-blow account of someone pushing through this idea, seeing what worked in practice and what didn't. Zopfli is a very suitable project.
I do wonder though... Surely the high-level goal of porting C code to Rust is not simply to have Rust code, but rather to have code that is immune to a class of memory access problems. So to the extent that the results of your translation are "unsafe", and thus susceptible to those same errors, what has actually been gained?
If a second LLM pass is an acceptable way to get rid of those "unsafe"s, shouldn't it also be acceptable to apply such a second pass to the output of c2rust?
123sereusername 2 days ago ago
[dead]