I genuinely did not expect to see a robot handling clothing like this within the next ten years at least. Insanely impressive
I do find it interesting that they state that each task is done with a fine tuned model. I wonder if that’s a limitation of the current data set their foundation model is trained on (which is what I think they’re suggesting in the post) or if it reflects something more fundamental about robotics tasks. It does remind me of a few years ago in LLMs when fine tuning was more prevalent. I don’t follow LLM training methodology closely but my impression was that the bulk of recent improvements have come from better RL post training and inference time reasoning.
Obviously they’re pursuing RL and I’m not sure spending more tokens at inference would even help for fine manipulation like this, notwithstanding the latency problems with that.
So, maybe the need for fine tuning goes away with a better foundation model like they’re suggesting? I hope this doesn’t point towards more fundamental limitations on robotics learning with the current VLA foundation model architectures
Those videos are very impressive. This is real progress on tasks at which robotics have been failing for fifty years.
Here are some of the same tasks being attempted as part of the DARPA ARM program in 2012.[1] Compare key-in-lock and door opening with the 2025 videos linked above.
Huge improvement.
I genuinely did not expect to see a robot handling clothing like this within the next ten years at least. Insanely impressive
I do find it interesting that they state that each task is done with a fine tuned model. I wonder if that’s a limitation of the current data set their foundation model is trained on (which is what I think they’re suggesting in the post) or if it reflects something more fundamental about robotics tasks. It does remind me of a few years ago in LLMs when fine tuning was more prevalent. I don’t follow LLM training methodology closely but my impression was that the bulk of recent improvements have come from better RL post training and inference time reasoning.
Obviously they’re pursuing RL and I’m not sure spending more tokens at inference would even help for fine manipulation like this, notwithstanding the latency problems with that.
So, maybe the need for fine tuning goes away with a better foundation model like they’re suggesting? I hope this doesn’t point towards more fundamental limitations on robotics learning with the current VLA foundation model architectures
Those videos are very impressive. This is real progress on tasks at which robotics have been failing for fifty years.
Here are some of the same tasks being attempted as part of the DARPA ARM program in 2012.[1] Compare key-in-lock and door opening with the 2025 videos linked above. Huge improvement.
We just might be over the hump on manipulation.
[1] https://www.youtube.com/watch?v=jeABMoYJGEU