Disassembling terabytes of random data with Zig and Capstone to prove a point

(jstrieb.github.io)

28 points | by birdculture 5 days ago ago

7 comments

kazinator 16 minutes ago ago
But look, it almost looks as if the Static Huffman (a simpler encoding of compression with fewer decoding errors) almost bears out a certain aspect of the friend's intuition, in the following way:
* only 4.4% of the random data disassembles.
* only 4.0% of the random data decodes as Static Huffman.
BUT:
* 1.2% of the data decompresses and disassembles.
Relative to the 4.0% decompression, 1.2% is 30%.
In other words, 30% of successfully decompressed material also disassembles.
That's something that could benefit from an explanation.
Why is that, evidently, the conditional probability of a good disassemble, given a successful Static Huffman expansion, much higher than the probability of a disassemble from random data?
kazinator 37 minutes ago ago
In common anecdotal experience with disassembling code, it is very common for data areas interspersed with code (like string literals) to disaassemble to instructions, momentarily causing the human to be puzzled: what is this repetition of five "or" instructions doing here referencing registers that would never be arguments?
The reason is that the opcode encoding is very dense, and has no redundancy against detecting bad encodings, and usually no relationship to neighboring words.
By that I mean that some four byte chunk (say) treated as an opcode word is treated that way regardless of what came before or what comes after. If it looks like an opcode with a four-byte immediate operand, then the disassembly will pull in that operand (which can be any bit combination) and skip another four bytes. Nothing in the operand will indicate "this is a bad instruction overall".
mfcl 42 minutes ago ago
Why the AI disclosure? Is it just for the author to make sure the readers know they are AI-skeptic and use the opportunity to link to another article, or would there be something wrong with the proof had AI been used to help write the code?
(By help I mean just help, not write an entire sloppy article.)
[-]
- jstrieb 24 minutes ago ago
  Hey, I wrote this! There are a couple of reasons that I included the disclosure.
  The main one is to set reader expectations that any errors are entirely my own, and that I spent time reviewing the details of the work. The disclosure seemed to me a concise way to do that -- my intention was not any form of anti-AI virtue signaling.
  The other reason is that I may use AI for some of my future work, and as a reader, I would prefer a disclosure about that. So I figured if I'm going to disclose using it, I might as well disclose not using it.
  I linked to other thoughts on AI just in case others are interested in what I have to say. I don't stand to gain anything from what I write, and I don't even have analytics to tell me more people are viewing it.
  All in all, I was just trying to be transparent, and share my work.
- kazinator 14 minutes ago ago
  It's like "pesticide use disclosure: our blueberries are 'no spray'; but we are not insinuating there is anything wrong with pesticides."
  :)
  I like it!
  But, here it does serve a purpose beyond hinting at the author's ideological stance.
  Nowadays, a lot of readers will wonder how much of your work is AI assisted. Their eyes will be drawn to the AI Use Disclosure, which will answer their question.
0x1ch an hour ago ago
I believe this is the third or fourth posting of this article in the last week.
[-]
- degamad an hour ago ago
  Yep: https://news.ycombinator.com/from?site=jstrieb.github.io