R-Zero: Self-Evolving Reasoning LLM from Zero Data

(arxiv.org)

53 points | by lawrenceyan 10 hours ago ago

12 comments

nakamoto_damacy an hour ago ago
Perpetual Motion Machines were a thing at some point, too.
[-]
- YeGoblynQueenne an hour ago ago
  Don't laugh. PMMs work! I built mine ten years ago when I realised I could improve the SOTA by a huge 20%. I've been improving it for the last 10 years and I get an average performance boost of ~0.25 every year. We will have Free Energy in the next 10 years.
- api an hour ago ago
  I refer to the endless self improving runaway AI as an “information theoretic perpetual motion machine.”
  This will work in a sense. It will do… something… and learn… something. It will be unrelated to the physical universe in any way. See also: procedural landscape generators, etc.
jasonjmcghee 7 hours ago ago
Conceptually, it's effectively a GAN
[-]
- magicalhippo 2 hours ago ago
  For those not in the know, that's Generative Adversarial Networks[1], where two neural networks are trained in a competitive way.
  One network typically generates tasks for the other, and is rewarded if it manages to make the other network fail the task. The other network is rewarded if it successfully completes the task.
  Thus the adversarial network tries to find weaknesses to exploit, and the combined training makes the solving network much stronger. Or at least that's the idea.
  [1]: https://en.wikipedia.org/wiki/Generative_adversarial_network
- frumiousirc an hour ago ago
  My initial thought as well. But, what is the "Discriminator" here? What grounds the training toward reality? The "Challenger" and "Solver" adversity alone can only serve to amplify hallucination.
  Ahh, GPT-4o is the arbiter.
  So, basically, this is a way to perform LLM model compression (GPT-4o to qwen3) while maximizing the in-distribution domain size. As such, it seems reasonable and useful.
  However the reliance on an arbiter LLM makes the claim that it will overcome the problem of a lack of training data unreasonable. Once the target LLM is scaled up to reach the in-distribution domain size of the arbiter, it seems to me it will turn back into a hallucination amplifier.
- torginus an hour ago ago
  GAN's are a supervised training method, not really self-improving (after converging to being able to reproduce the training set).
clbrmbr 29 minutes ago ago
Terrible choice of name. DeepSeek developed a historically important model called “R-Zero” (this was the predecessor to R1 that was training without any coldstart SFT, and was very strong but difficult to read chain of thought because it code switches into Chinese and has no line breaks).
thom 4 hours ago ago
For values of zero quite far above zero.
[-]
- falcor84 3 hours ago ago
  What am I missing? From my skimming, there's zero external data beyond what is needed for the Challenger to generate questions.
cyberge99 7 hours ago ago
What could go wrong?
[-]
- magicalhippo 2 hours ago ago
  Just don't hook it into the nuclear missile controls. We've seen[1] how that goes[2].
  [1]: https://en.wikipedia.org/wiki/Colossus:_The_Forbin_Project
  [2]: https://en.wikipedia.org/wiki/The_Terminator