I’m optimistic about humanoid robotics, but I’m curious about the reliability issue. Biological limbs and hands are quite miraculous when you consider that they are able to constantly interact with the world, which entails some natural wear and tear, but then constantly heal themselves.
gene-h
Industrial robots at least are very reliable, MTBF is often upwards of 100,000 hours[0]. Industrial robots are optimized to be as reliable as possible because the longer they last and less often they need to be fixed, the more profitable they are. In fact, German and Japanese companies came to dominate the industrial robotics market because they focused on reliability. They developed rotary electric actuators that were more reliable. Cincinnati Millicron(US) was out competed in the industrial robot market because although their hydraulic robots were strong, they were less reliable.
I am personally a bit skeptical of anthropormophic hands achieving similarly high reliability. There's just too many small parts that need to withstand high forces.
If you E-stop an industrial robot, it stops immediately, all OK. If a humanoid were to freeze like that, it would fall over and hurt you and your stuff on the way down, when it'll damage itself.
Mechanical reliability is not the main concern IMO
marinmania
It does either get very exciting or very spooky thinking of the possibilities in the near future.
I had always assumed that such a robot would be very specific (like a cleaning robot) but it does seem like by the time they are ready they will be very generalizable.
I know they would require quite a few sensors and motors, but compared to self-driving cars their liability would be less and they would use far less material.
fragmede
The exciting part comes when two robots are able to do repairs on each other.
marinmania
I think this is the spooky part. I feel dumb saying it, but is there a point where they are able to coordinate and build a factory to build chips/more of themselves? Or other things entirely?
bamboozled
Of course there is
ta988
But this still has a massive cost. Replacing or repairing an actuator isn't cheap, in material and in time of unavailability.
jacobaul
To maybe get a little carried away with the sci-fi for a minute, why does the Actuator need to cost anything?
When the tree of costs that make up a product are traced, surely all the leaf nodes are human labour? As in, to make the actuator, I had to pay someone to assemble it and I had to buy the parts. Each part had a materials cost and a labour cost. So it goes for the factory that made the fasteners, the foundry that made the steel, the mine that extracted the ore.
Shudder to think of how to regulate resource extraction in a future where AI humanoid robots are strip mining and logging for free.
david-gpu
> When the tree of costs that make up a product are traced, surely all the leaf nodes are human labour?
What about energy, real estate and taxes?
Even at the extreme end of automation, if you want iron ore, you need to buy a mine from somebody, pay taxes on it, and power the machines to extract the minerals and transport them elsewhere for processing.
pryelluw
2 bots 1 bolt ?
UltraSane
Consumable components could be automatically replaced by other robots.
didip
I think those problems can be solved with further research in material science, no? Combined that with very responsive but low torque servos, I think this is a solvable problem.
michaelt
It's a simple matter of the number of motors you have. [1]
Assume every motor has a 1% failure rate per year.
A boring wheeled roomba has 3 motors. That's a 2.9% failure rate per year, and 8.6% failures over 3 years.
Assume a humanoid robot has 43 motors. That gives you a 35% failure rate per year, and 73% over 3 years. That ain't good.
And not only is the humanoid robot less reliable, it's also 14.3x the price - because it's got 14.3x as many motors in it.
[1] And bearings and encoders and gearboxes and control boards and stuff... but they're largely proportional to the number of motors.
elcritch
With more motors and joints also comes some degree of redundancy however. Having multiple fingers means one finger dying won't be as big of an impedement. It'd require feedback and the ability for the motion planner / AI to account for it.
Plus they'll likely be modular and able to be replaced.
IMHO, the bigger design issue for humanistic is lowering the need for mechanical precision which requires lots more metals and instead using adaptive feedback and sensors to obtain accuracy similar to how humans and animals do it. AIs should be really good at that, eventually. I think the compute will need to be about 10x what it is now though.
mewpmewp2
Would it be possible to reduce the failure rates?
ac29
The 1%/year failure rate appears to just be made up. There are plenty of electric motors that dont have anywhere near that failure rate (at least during the expected service life, failure rates certainly will probably hit 1%/year or higher eventually).
For example, do the motors in hard drives fail anywhere close to 1% a year in the first ~5 years? Backblaze data gives a total drive failure rate around 1% and I imagine most of those are not due to failure of motors.
For example, an industrial robot arm with 6 motors achieves much higher reliability than a consumer roomba with 3 motors. They do this with more metal parts, more precision machining, much more generous design tolerances, and suchlike. Which they can afford by charging 100x as much per unit.
I'm interested how differences with robots work overtime, there are a lot of machines in this world that have been patched or "jimmied up" to continue working, let's say a mining robot, it would probably get quite heavily contaminated with dust, wear would occur in different places, rock falls might bend parts.
So even though another robot could probably do the "jimmy up". it seems like overtime, the robots will "drift" into all being a bit different.
Even commercial airlines seem to go through fairly unique repairs from things like collisions with objects, tail strikes etc.
Maybe it's just easier to recycle robots?
Toritori12
Does Anyone know how easy is to join the "trusted tester program" and if they offer modules that you can easily plug-in to run the sdk?
technotony
There's a sign up button at the bottom of the article...
suyash
What sort of hardware does the SDK runs on, can it run on a modern Raspberry Pi ?
ethan_smith
According to the blog post, it requires an NVIDIA Jetson Orin with at least 8GB RAM, and they've optimized for Jetson AGX Orin (64GB) and Orin NX (16GB) modules.
v9v
Could you quote where in the blog post they claim that? CTRL+F "Jetson" gave no results in TFA.
moffkalast
Yeah they didn't really mention anything, I was almost getting my hopes up that Google might be announcing a modernized Coral TPU for the transformer age, but I guess not. It's probably all just API calls to their TPUv6 data centers lmao.
martythemaniak
You can think of these as essentially multi-modal LLMs, which is to say you can have very small/fast ones (SmolVLA - 0.5B params) that are good at specific tasks, and larger/slower more general ones (OpenVLA - a finetuned llama2 7B). So a rpi could be used for some very specific tasks, but even the more general ones could run on beefy consumer hardware.
What is the model architecture? I'm assuming it's far away from LLMs, but I'm curious about knowing more. Can anyone provide links that describe architectures for VLA?
KoolKat23
Actually very close to one I'd say.
It's a "visual language action" VLA model "built on the foundations of Gemini 2.0".
As Gemini 2.0 has native language, audio and video support, I suspect it has been adapted to include native "action" data too, perhaps only on output fine-tuning rather than input/output at training stage (given its Gemini 2.0 foundation).
Natively multimodal LLM's are basically brains.
quantumHazer
> Natively multimodal LLM's are basically brains.
Absolutely not.
KoolKat23
Lol keep telling yourself that. It's not a human brain nor is it necessarily a very intelligent brain, but it is a brain nonetheless.
martythemaniak
OpenVLA is basically a slightly modified, fine-tuned llama2. I found the launch/intro talk by lead author to be quite accessible: https://www.youtube.com/watch?v=-0s0v3q7mBk
KoolKat23
In the paper at the bottom of googles page, this VLA says it is built on the foundations of Gemini 2.0 (hence my quotations). They'd be using Gemini 2.0 rather than llama.
A more modern one, smolVLA is similar and uses a VLM but skips a few layers and uses an action adapter for outputs. Both are from HF and run on LeRobot.
meanwhile i will drink a coffee while it loads a reply from the API
jagger27
These are going to be war machines, make absolutely no mistake about it. On-device autonomy is the perfect foil to escape centralized authority and accountability. There’s no human behind the drone to charge for war crimes. It’s what they’ve always dreamed of.
Who’s going to stop them? Who’s going to say no? The military contracts are too big to say no to, and they might not have a choice.
The elimination of toil will
mean the elimination of humans all together. That’s where we’re headed. There will be no profitable life left for you, and you will be liquidated by “AI-Powered Automation
for Every Decision”[0]. Every. Decision. It’s so transparent. The optimists in this thread are baffling.
MIT spinoff Google-owned Boston Dynamics pledged not to militarize their robots. Which is very hard to believe given they're backed by DARPA, the DoD/Military investment arm.
arcticfox
This pledge would last five seconds in an actual conflict, if it makes it even that far.
paxys
Was owned by Google. Then Softbank. Now Hyundai.
jagger27
Militarize is just bad marketing. Call them cleaning machines and put them to work on dirty things.
JumpCrisscross
> These are going to be war machines, make absolutely no mistake about it
Of course they will. Practically everything useful has a military application. I'm not sure why this is considered a hot take.
jagger27
The difference between this machine and the ones that came before is that there won’t have to be a human in the loop to execute mass murder.
m00x
There's a clear task being given to the robot. If anything this will save lives. There are plenty of soldiers that love to kill for the hell of it, at least this will be easy to track down to who gave the order.
JumpCrisscross
> there won’t have to be a human in the loop to execute mass murder
This looks like an increasingly theoretical concern. (And probably always has been. Wars were far more brutal when folks fought face to face than they are today.)
bamboozled
How would these things be competitive with drones on the battlefield? They probably cost the equivalent of 1000 autonomous drones and 100x the time and materials to make, way more power would be required to make them work too.
Terminator is a good movie but in reality, a cheap autonomous drone would mess one of those up pretty good.
I've seen some of the footage from Ukraine, drones are deadly, efficient, they are terrifying on the battlefield. Even though those robots will get crazy maneuverable, it's going to be pretty hard to out run an exploding drone.
Maybe the Terminators will have shotguns, but I could imagine 5 drones per terminator being a pretty easy to achieve considering they will be built by other autonomous robots.
m00x
Good!
Workaccount2
I continued to be impressed how Google stealth releases fairly groundbreaking products, and then (usually) just kind of forgets about them.
Rather than advertising blitz and flashy press events, they just do blog posts that tech heads circulate, forget about, and then wonder 3-4 years later "whatever happened to that?"
This looks awesome. I look forward to someone else building a start-up on this and turning it into a great product.
fusionadvocate
Because the whole purpose of these kinds of projects at Google is to keep regulators at bay. They don't need these products in the sense of making money from them. They will just burn some money and move on, exactly the way they did hundreds of times. But what kind of company has such a free pass to burning money? The kind of company that is a monopoly. Monopolies are THAT profitable.
antonkar
The only way to prevent robots from being jailbroken and set to rob banks is to move GPUs to private SOTA secure GPU clouds
sajithdilshan
I wonder what kind of guardrails (like Three Laws of Robotics) there are to prevent the robots going crazy while executing the prompts
ctoth
The laws of robotics were literally designed to cause conflict and facilitate strife in a fictional setting--I certainly hope no real goddamn system is built like that,.
> To ensure robots behave safely, Gemini Robotics uses a multi-layered approach. "With the full Gemini Robotics, you are connecting to a model that is reasoning about what is safe to do, period," says Parada. "And then you have it talk to a VLA that actually produces options, and then that VLA calls a low-level controller, which typically has safety critical components, like how much force you can move or how fast you can move this arm."
conception
Of course someone will. The terror nexus doesn’t build itself, yet, you know.
hlfshell
The generally accepted term for the research around this in robotics is Constitutional AI (https://arxiv.org/abs/2212.08073) and has been cited/experimented with in several robotics VLAs.
JumpCrisscross
Is there any evidence we have the technical ability to put such ambiguous guardrails on LLMs?
hn_throwaway_99
A power cord?
sajithdilshan
what if they are battery powered?
msgodel
Usually I put master disconnect switches on my robots just to make working on them safe. I use cheap toggle switches though I'm too cheap for the big red spiny ones.
That's what we use twelve gauge buckshot for, here in America.
asadm
in practice, those laws are bs.
TZubiri
Nice. I work with some students younger than 13, so most cloud and llms are quite tricky to work with, local only models like vertex are nice for this use case. I will try this as a replacement for chatgpt as Computer Vision in robotics like Lego Mindstorm
zzzeek
THANK YOU.
Please make robots. LLMs should be put to work for *manual* tasks, not art/creative/intellectual tasks. The goal is to improve humanity. not put us to work putting screws inside of iphones
(five years later)
what do you mean you are using a robot for your drummer
martythemaniak
I've spent the last few months looking into VLAs and I'm convinced that they're gonna be a big deal, ie they very well might be the "chatgpt moment for robotics" that everyone's been anticipating. Multimodal LLMs already have a ton of built-in understanding of images and text, so VLAs are just regular MMLLMs that are fine-tuned to output a specific sequence of instructions that can be fed to a robot.
OpenVLA, which came out last year, is a Llama2 fine tune with extra image encoding that outputs a 7-tuple of integers. The integers are rotation and translation inputs for a robot arm. If you give a vision llama2 a picture of a an apple and a bowl and say "put the apple in the bowl", it already understands apples, bowls, knows the end state should apple in bowl etc. What missing is a series of tuples that will correctly manipulate the arm to do that, and the way they did it is through a large number of short instruction videos.
The neat part is that although everyone is focusing on robot arms manipulating objects at the moment, there's no reason this method can't be applied to any task. Want a smart lawnmower? It already understands "lawn" "mow", "don't destroy toy in path" etc, just needs a finetune on how to corectly operate a lawnmower. Sam Altman made some comments about having self-driving technology recently and I'm certain it's a chat-gpt based VLA. After all, if you give chatgpt a picture of a street, it knows what's a car, pedestrian, etc. It doesn't know how to output the correct turn/go/stop commands, and it does need a great deal of diverse data, but there's no reason why it can't do it. https://www.reddit.com/r/SelfDrivingCars/comments/1le7iq4/sa...
Anyway, super exciting stuff. If I had time, I'd rig a snowblower with a remote control setup, record a bunch of runs and get a VLA to clean my driveway while I sleep.
For completeness, MMLLM = Multimodal Large language model.
Workaccount2
I don't think transformers will be viable for self driving cars until they can both:
1) Properly recognize what they are seeing without having to lean so hard on their training data. Go photoshop a picture of a cat and give it a 5th leg coming out of it's stomach. No LLM will be able to properly count the cat's legs (they will keep saying 4 legs no matter how many times you insist they recount).
2.) Be extremely fast at outputting tokens. I don't know where the threshold is, but its probably going to be a non-thinking model (at first) and probably need something like Cerebras or diffusion architecture to get there.
cgearhart
The current gen VLA architectures include some tricks (like compressed action tokenization and diffusion decoding) to reach action frequencies between 50-200hz. I think they’re _more_ efficient this way than regular LLMs trying to do everything thru text.
martythemaniak
1. Well, based on Karpathy's talks on Tesla FSD, his solution is to actually make the training set reflect everything you'd see in reality. The tricky part is that if something occurs 0.0000001% IRL and something else occurs 50% of the time, they both need to make 5% of the training corpus. The thing with multimodal LLMs is that lidar/depth input can just be another input that gets encoded along with everything else, so for driving "there's a blob I don't quite recognize" is still a blob you have to drive around.
2. Figure has a dual-model architecture which makes a lot of sense: A 7B model that does higher-level planning and control and a runs at 8Hz, and a tiny 0.08B model that runs at 200Hz and does the minute control outputs. https://www.figure.ai/news/helix
generalizations
I will be surprised if VLAs stick around, based on your description. That sounds far too low-level. Better hand that off to the 'nervous system' / kernel of the robot - it's not like humans explicitly think about the rotation of their hip & ankle when they walk. Sounds like a bad abstraction.
MidoriGlow
Elon Musk said in last week’s Starship Update: the very first Mars missions are planned to be flown by Optimus humanoid robots to scout and build basic infrastructure before humans arrive
(full transcript + audio: https://transpocket.com/share/oUKhep6cUl3s/).
If Gemini Robotics On-Device can truly adapt to new tasks with ~50–100 demos, pairing that with mass-produced Optimus bodies and Starship’s lift capacity could be powerful—offline autonomy, zero-latency control, and the ability to ship dozens of robots per launch.
lm28469
Elon Musk said in 2016 that we'd have fully autonomous cars by the end of the year and we'd be on Mars by 2018, with manned missions by 2024.
Fast forward to 2025, weeks have no self driving cars, and nothing is even close to getting to Mars, let alone manned
I am personally a bit skeptical of anthropormophic hands achieving similarly high reliability. There's just too many small parts that need to withstand high forces.
[0]https://robotsdoneright.com/Articles/what-are-the-different-...
Mechanical reliability is not the main concern IMO
I had always assumed that such a robot would be very specific (like a cleaning robot) but it does seem like by the time they are ready they will be very generalizable.
I know they would require quite a few sensors and motors, but compared to self-driving cars their liability would be less and they would use far less material.
When the tree of costs that make up a product are traced, surely all the leaf nodes are human labour? As in, to make the actuator, I had to pay someone to assemble it and I had to buy the parts. Each part had a materials cost and a labour cost. So it goes for the factory that made the fasteners, the foundry that made the steel, the mine that extracted the ore.
Shudder to think of how to regulate resource extraction in a future where AI humanoid robots are strip mining and logging for free.
What about energy, real estate and taxes?
Even at the extreme end of automation, if you want iron ore, you need to buy a mine from somebody, pay taxes on it, and power the machines to extract the minerals and transport them elsewhere for processing.
Assume every motor has a 1% failure rate per year.
A boring wheeled roomba has 3 motors. That's a 2.9% failure rate per year, and 8.6% failures over 3 years.
Assume a humanoid robot has 43 motors. That gives you a 35% failure rate per year, and 73% over 3 years. That ain't good.
And not only is the humanoid robot less reliable, it's also 14.3x the price - because it's got 14.3x as many motors in it.
[1] And bearings and encoders and gearboxes and control boards and stuff... but they're largely proportional to the number of motors.
Plus they'll likely be modular and able to be replaced.
IMHO, the bigger design issue for humanistic is lowering the need for mechanical precision which requires lots more metals and instead using adaptive feedback and sensors to obtain accuracy similar to how humans and animals do it. AIs should be really good at that, eventually. I think the compute will need to be about 10x what it is now though.
For example, do the motors in hard drives fail anywhere close to 1% a year in the first ~5 years? Backblaze data gives a total drive failure rate around 1% and I imagine most of those are not due to failure of motors.
For example, an industrial robot arm with 6 motors achieves much higher reliability than a consumer roomba with 3 motors. They do this with more metal parts, more precision machining, much more generous design tolerances, and suchlike. Which they can afford by charging 100x as much per unit.
So even though another robot could probably do the "jimmy up". it seems like overtime, the robots will "drift" into all being a bit different.
Even commercial airlines seem to go through fairly unique repairs from things like collisions with objects, tail strikes etc.
Maybe it's just easier to recycle robots?
google-deepmind/mujoco_menagerie: https://github.com/google-deepmind/mujoco_menagerie
mujoco_menagerie/aloha: https://github.com/google-deepmind/mujoco_menagerie/tree/mai...
It's a "visual language action" VLA model "built on the foundations of Gemini 2.0".
As Gemini 2.0 has native language, audio and video support, I suspect it has been adapted to include native "action" data too, perhaps only on output fine-tuning rather than input/output at training stage (given its Gemini 2.0 foundation).
Natively multimodal LLM's are basically brains.
Absolutely not.
https://arxiv.org/pdf/2503.20020
https://arxiv.org/abs/2506.01844
Explanation by PhosphoAI: https://www.youtube.com/watch?v=00A6j02v450
Who’s going to stop them? Who’s going to say no? The military contracts are too big to say no to, and they might not have a choice.
The elimination of toil will mean the elimination of humans all together. That’s where we’re headed. There will be no profitable life left for you, and you will be liquidated by “AI-Powered Automation for Every Decision”[0]. Every. Decision. It’s so transparent. The optimists in this thread are baffling.
0: https://www.palantir.com/
Of course they will. Practically everything useful has a military application. I'm not sure why this is considered a hot take.
This looks like an increasingly theoretical concern. (And probably always has been. Wars were far more brutal when folks fought face to face than they are today.)
Terminator is a good movie but in reality, a cheap autonomous drone would mess one of those up pretty good.
I've seen some of the footage from Ukraine, drones are deadly, efficient, they are terrifying on the battlefield. Even though those robots will get crazy maneuverable, it's going to be pretty hard to out run an exploding drone.
Maybe the Terminators will have shotguns, but I could imagine 5 drones per terminator being a pretty easy to achieve considering they will be built by other autonomous robots.
Rather than advertising blitz and flashy press events, they just do blog posts that tech heads circulate, forget about, and then wonder 3-4 years later "whatever happened to that?"
This looks awesome. I look forward to someone else building a start-up on this and turning it into a great product.
> To ensure robots behave safely, Gemini Robotics uses a multi-layered approach. "With the full Gemini Robotics, you are connecting to a model that is reasoning about what is safe to do, period," says Parada. "And then you have it talk to a VLA that actually produces options, and then that VLA calls a low-level controller, which typically has safety critical components, like how much force you can move or how fast you can move this arm."
Please make robots. LLMs should be put to work for *manual* tasks, not art/creative/intellectual tasks. The goal is to improve humanity. not put us to work putting screws inside of iphones
(five years later)
what do you mean you are using a robot for your drummer
OpenVLA, which came out last year, is a Llama2 fine tune with extra image encoding that outputs a 7-tuple of integers. The integers are rotation and translation inputs for a robot arm. If you give a vision llama2 a picture of a an apple and a bowl and say "put the apple in the bowl", it already understands apples, bowls, knows the end state should apple in bowl etc. What missing is a series of tuples that will correctly manipulate the arm to do that, and the way they did it is through a large number of short instruction videos.
The neat part is that although everyone is focusing on robot arms manipulating objects at the moment, there's no reason this method can't be applied to any task. Want a smart lawnmower? It already understands "lawn" "mow", "don't destroy toy in path" etc, just needs a finetune on how to corectly operate a lawnmower. Sam Altman made some comments about having self-driving technology recently and I'm certain it's a chat-gpt based VLA. After all, if you give chatgpt a picture of a street, it knows what's a car, pedestrian, etc. It doesn't know how to output the correct turn/go/stop commands, and it does need a great deal of diverse data, but there's no reason why it can't do it. https://www.reddit.com/r/SelfDrivingCars/comments/1le7iq4/sa...
Anyway, super exciting stuff. If I had time, I'd rig a snowblower with a remote control setup, record a bunch of runs and get a VLA to clean my driveway while I sleep.
Not https://public.nrao.edu/telescopes/VLA/ :(
For completeness, MMLLM = Multimodal Large language model.
1) Properly recognize what they are seeing without having to lean so hard on their training data. Go photoshop a picture of a cat and give it a 5th leg coming out of it's stomach. No LLM will be able to properly count the cat's legs (they will keep saying 4 legs no matter how many times you insist they recount).
2.) Be extremely fast at outputting tokens. I don't know where the threshold is, but its probably going to be a non-thinking model (at first) and probably need something like Cerebras or diffusion architecture to get there.
2. Figure has a dual-model architecture which makes a lot of sense: A 7B model that does higher-level planning and control and a runs at 8Hz, and a tiny 0.08B model that runs at 200Hz and does the minute control outputs. https://www.figure.ai/news/helix
Fast forward to 2025, weeks have no self driving cars, and nothing is even close to getting to Mars, let alone manned