Comment by AlotOfReading

AlotOfReading Jun 19, 2025 parent

Aren't continuous, stochastic, partial knowledge environments where you need long horizon planning with strict deadlines and limited compute exactly the sort of environments muzero variants struggle with? Because that's driving.

It's also worth mentioning that humans intentionally (and safely) drive into "solid" objects all the time. Bags, steam, shadows, small animals, etc. We also break rules (e.g. drive on the wrong side of the road), and anticipate things we can't even see based on a theory of mind of other agents. Human driving is extremely sophisticated, not reducible to rules that are easily expressed in "simple" language.

ActorNightly 3 days ago

I didn't say use Mu Zero end to end, I said leverage it.

This is how I would do it:

First, you come up with a compressed representation of the state space of the terrain + other objects around your car that encodes the current states of everything, and its predicted evolution like ~5 seconds into the future.

The idea is that you would leverage physics, which means objects need to behave according to laws of motion, so this means you can greatly compress how this is represented. For example, a meshgrid of "terrain" other than empty road that is static, lane lines representing the road, and 3d boxes representing moving objects with a certain mass, with initial 6 dof state (xyz position, orientation), intial 6dof velocities, and 6 dof forcing functions with parameter of time that represent how these objects move.

So given this representation, you can write a program that simulates the evolution of the state space given any initial condition, and essentially simulate collisions.

Then you divide into 3 teams.

1st team trains a model to translate sensor data into this state space representation, with continuous updates on every cycle, leveraging things like Kalman filtering because of the correlation of certain things that leads to better accuracy. Overall you would get something where things like red brake lights would lead to deceleration forcing functions.

(If you wanted to get fancy, instead of a simulation, you build out probability space instead. I.e when you run the program, it would spit out a heat map of where certain objects are more likely to end up)

2nd team trains a model on real world traffic to find correlations between forcing functions of vehicles. I.e if a car slows down, the cars behind it would slow down. You could do this kinda like Tesla did - equip all your cars with sensors, assume driver inputs as the forcing function, observe the state space change given the model from team 1.

3nd team trains a Mu Zero like model given the 2 above. Given a random initial starting state, the "game" is to chose the sequence of accelerations, decelerations, and steering (quantized with finite values) that gets the highest score by a) avoiding collision b) following traffic laws, c) minimizing disturbance to other vehicles, and d) maximizing space around your own vehicle.

What all of this does is allow the model to compute not only expected behavior, but things that are realistically possible. For example, in a situation where collision is imminent, like you sitting at a red stop light, and the sensors detect a car rapidly approaching, the model would make a decision to drive into the intersection when there are no cars present to avoid getting rear ended, which is quantifiably way better than average human.

Furthermore, the models from team 2 and 3 can self improve real time, which is equivalent to humans getting used to driving habits of others in certain areas. You simply to batch training runs to improve prediction capability of other drivers. Then when your policy model makes a correct decision, you build a shortcut into the MCTS that lets you know that this works, which then means in the finite time compute span, you can search away from that tree for a more optimal solution, and if you don't find it, you already have the best one that works, and next time you search even more space. So essentially you get a processing speed up the more you use it.

This item has no comments currently.