Comment by martythemaniak

martythemaniak 1 day ago parent

I've spent the last few months looking into VLAs and I'm convinced that they're gonna be a big deal, ie they very well might be the "chatgpt moment for robotics" that everyone's been anticipating. Multimodal LLMs already have a ton of built-in understanding of images and text, so VLAs are just regular MMLLMs that are fine-tuned to output a specific sequence of instructions that can be fed to a robot.

OpenVLA, which came out last year, is a Llama2 fine tune with extra image encoding that outputs a 7-tuple of integers. The integers are rotation and translation inputs for a robot arm. If you give a vision llama2 a picture of a an apple and a bowl and say "put the apple in the bowl", it already understands apples, bowls, knows the end state should apple in bowl etc. What missing is a series of tuples that will correctly manipulate the arm to do that, and the way they did it is through a large number of short instruction videos.

The neat part is that although everyone is focusing on robot arms manipulating objects at the moment, there's no reason this method can't be applied to any task. Want a smart lawnmower? It already understands "lawn" "mow", "don't destroy toy in path" etc, just needs a finetune on how to corectly operate a lawnmower. Sam Altman made some comments about having self-driving technology recently and I'm certain it's a chat-gpt based VLA. After all, if you give chatgpt a picture of a street, it knows what's a car, pedestrian, etc. It doesn't know how to output the correct turn/go/stop commands, and it does need a great deal of diverse data, but there's no reason why it can't do it. https://www.reddit.com/r/SelfDrivingCars/comments/1le7iq4/sa...

Anyway, super exciting stuff. If I had time, I'd rig a snowblower with a remote control setup, record a bunch of runs and get a VLA to clean my driveway while I sleep.

ckcheng 1 day ago

VLA = Vision-language-action model: https://en.wikipedia.org/wiki/Vision-language-action_model

Not https://public.nrao.edu/telescopes/VLA/ :(

For completeness, MMLLM = Multimodal Large language model.

Workaccount2 1 day ago

I don't think transformers will be viable for self driving cars until they can both:

1) Properly recognize what they are seeing without having to lean so hard on their training data. Go photoshop a picture of a cat and give it a 5th leg coming out of it's stomach. No LLM will be able to properly count the cat's legs (they will keep saying 4 legs no matter how many times you insist they recount).

2.) Be extremely fast at outputting tokens. I don't know where the threshold is, but its probably going to be a non-thinking model (at first) and probably need something like Cerebras or diffusion architecture to get there.

cgearhart 16 hours ago

The current gen VLA architectures include some tricks (like compressed action tokenization and diffusion decoding) to reach action frequencies between 50-200hz. I think they’re _more_ efficient this way than regular LLMs trying to do everything thru text.

martythemaniak OP 1 day ago

1. Well, based on Karpathy's talks on Tesla FSD, his solution is to actually make the training set reflect everything you'd see in reality. The tricky part is that if something occurs 0.0000001% IRL and something else occurs 50% of the time, they both need to make 5% of the training corpus. The thing with multimodal LLMs is that lidar/depth input can just be another input that gets encoded along with everything else, so for driving "there's a blob I don't quite recognize" is still a blob you have to drive around.

2. Figure has a dual-model architecture which makes a lot of sense: A 7B model that does higher-level planning and control and a runs at 8Hz, and a tiny 0.08B model that runs at 200Hz and does the minute control outputs. https://www.figure.ai/news/helix

generalizations 1 day ago

I will be surprised if VLAs stick around, based on your description. That sounds far too low-level. Better hand that off to the 'nervous system' / kernel of the robot - it's not like humans explicitly think about the rotation of their hip & ankle when they walk. Sounds like a bad abstraction.

This item has no comments currently.