Regarding the typewriter approach, I've wondered for a while if anyone has explored simple backtracking with LLMs? Like, have the LLM be able to generate a backspace/delete token that lets it "undo" previously generated tokens in an append-only fashion. Not sure how this would work with teacher forcing but seems feasible with RL.
Really interesting to see the diffusion model solve the puzzles in an iterative way, which feels more similar to how I (and probably most humans) solve them.
Outwardly, it seems to be limited by unmasking too few tokens per round, even when the heatmap shows many more high-confidence guesses available. On some of the larger puzzles it looks like it's wasting many rounds filling in the 'obvious' shapes, and then gets the interesting bit in the last round. It also doesn't seem to have learned the idea of "the background is blue with shapes drawn on top," where background is often 50% of the solution in these puzzles.
It is kind of wild that most coding tasks are editing tasks, and we humans care a lot about code editing tools, but automated tools use code generation for editing where a valid block must be generated top-to-bottom in one go.
Fixing a mistake requires re-generating the file or block of code. Or, if something generated later has implications for earlier code--a new import or function parameter's required, something like that--the only option is to go back and re-generate a big chunk. That'd be inefficient for humans, not implausible it's wrong for other code generators too.
I don't know if diffusion specifically will be the approach. (Maybe there's something to generating edit sequences?) This post's note that diffusion kills KV caching is something I hadn't even considered. It does seem right to experiment with things other than strict start-to-end generation.
If you completely do away with autoregression, prompt tokens can pay attention to generated tokens, so even the prompt tokens' KV vectors change at every step and you cannot cache anything.
For this reason, models that generate text using diffusion typically generate blocks of tokens at a time, where tokens within a block freely attend to each other, but across blocks there's causal masking so that each block only depends on the preceding ones and we're back to autoregression again. That makes caching possible, but also means you still can't have diffusion change the beginning of a long text to match the end.
A massive problem with current generation llms is that they have a single globally ordered context and that the model is only allowed to append to the context.
This is like having a single tape Turing machine. They can simulate a multi tape machine, but at O(n^2) complexity.
The computation budget of an LLM is finite, so this has a massive practical impact.
Regarding the typewriter approach, I've wondered for a while if anyone has explored simple backtracking with LLMs? Like, have the LLM be able to generate a backspace/delete token that lets it "undo" previously generated tokens in an append-only fashion. Not sure how this would work with teacher forcing but seems feasible with RL.
Really interesting to see the diffusion model solve the puzzles in an iterative way, which feels more similar to how I (and probably most humans) solve them.
Outwardly, it seems to be limited by unmasking too few tokens per round, even when the heatmap shows many more high-confidence guesses available. On some of the larger puzzles it looks like it's wasting many rounds filling in the 'obvious' shapes, and then gets the interesting bit in the last round. It also doesn't seem to have learned the idea of "the background is blue with shapes drawn on top," where background is often 50% of the solution in these puzzles.
It is kind of wild that most coding tasks are editing tasks, and we humans care a lot about code editing tools, but automated tools use code generation for editing where a valid block must be generated top-to-bottom in one go.
Fixing a mistake requires re-generating the file or block of code. Or, if something generated later has implications for earlier code--a new import or function parameter's required, something like that--the only option is to go back and re-generate a big chunk. That'd be inefficient for humans, not implausible it's wrong for other code generators too.
I don't know if diffusion specifically will be the approach. (Maybe there's something to generating edit sequences?) This post's note that diffusion kills KV caching is something I hadn't even considered. It does seem right to experiment with things other than strict start-to-end generation.
You can still cache prompts; this just affects the cache for during generation produced tokens. And that's fairly harmless relatively speaking.
If you completely do away with autoregression, prompt tokens can pay attention to generated tokens, so even the prompt tokens' KV vectors change at every step and you cannot cache anything.
For this reason, models that generate text using diffusion typically generate blocks of tokens at a time, where tokens within a block freely attend to each other, but across blocks there's causal masking so that each block only depends on the preceding ones and we're back to autoregression again. That makes caching possible, but also means you still can't have diffusion change the beginning of a long text to match the end.
A massive problem with current generation llms is that they have a single globally ordered context and that the model is only allowed to append to the context.
This is like having a single tape Turing machine. They can simulate a multi tape machine, but at O(n^2) complexity.
The computation budget of an LLM is finite, so this has a massive practical impact.
Incredibly cool work, and a great primer on diffusion