From Simulation to Foundation Models for Robots

Why train in a simulator first

A learning robot is hungry for experience. To get good at a task it may need millions of tries — and on real hardware every try costs time, electricity, and worn-out gears, while a clumsy mistake can dent the robot or break whatever it is holding. So most robot learning begins inside a robot simulator: a physics engine that models gravity, contact, and friction well enough that a control policy practicing there learns something useful about the real world.

Simulation has three gifts. It is fast — a single machine can run hundreds of copies of the robot in parallel, far faster than real time. It is safe — a fall or a collision just resets the scene. And it is fully observed — the simulator knows the exact position of every object, so it can compute a reward function for free, with no extra sensors. Together these turn a slow physical experiment into a cheap data factory.

But a policy trained only in simulation has a problem: the simulator is not reality. Its friction is a guess, its motors are idealized, and its camera images look a little too clean. When you copy the learned policy onto a real robot, performance drops — sometimes catastrophically. This mismatch is called the reality gap, and bridging it is the whole game.

Domain randomization: making the policy unshakeable

The most influential trick for crossing the gap is wonderfully counterintuitive. Instead of building one perfect simulator and hoping it matches reality, you build a thousand sloppy ones and never tell the robot which is real. This is domain randomization: during training you randomly vary the simulated world — friction, mass, lighting, camera angle, the colors and textures of objects — on every single episode.

Why does scrambling the world help? Because a policy that must succeed across a huge range of physics cannot memorize any one version of it. It is forced to learn the robust core of the skill — keep the grasp centered, push until contact, correct when the object slips. To such a policy, the real world looks like just one more random variation it has already seen. Reality becomes unremarkable.

The famous demonstration was a robot hand that learned to manipulate a cube entirely in randomized simulation, then performed in-hand reorientation on real hardware it had never touched. There is a cost: a policy trained to survive every imaginable world is more conservative than one tuned to a single accurate model, so it may be a little less crisp. The art is choosing how wide to make the randomization — wide enough to contain reality, narrow enough to stay sharp.

World models: letting the robot imagine

So far the simulator is something we humans built. What if the robot built its own? That is the idea behind the split between model-based and model-free reinforcement learning. A model-free learner just maps situations to actions by trial and error. A model-based learner first learns a predictor of what happens next — and then uses it to plan.

That learned predictor is a world model: a compact internal simulator the robot trains from its own experience. Feed it the current state and a proposed action, and it forecasts the next state — and the one after that. With a world model in hand, the robot can rehearse a plan in its head, rolling out dozens of imagined futures and picking the action that leads somewhere good, all before a single real motor moves.

The payoff is sample efficiency. Because the robot can generate vast amounts of practice inside its own imagination, it needs far fewer real-world trials to reach competence — which matters enormously when each trial is slow or risky. The catch is that the world model is itself learned and imperfect; if the robot plans too far ahead, small prediction errors compound, and the imagined future drifts away from anything that could really happen.

The frontier: vision-language-action and robot foundation models

The classic recipe trains one policy per task: this robot, this object, this lab. The frontier asks a bolder question — can a single model learn to control robots in general? The answer taking shape is the vision-language-action model, or VLA: a large neural network that takes in a camera image and a sentence of instruction and outputs the next motor commands directly.

These models borrow the trick that transformed text and images: pretrain on enormous, varied data, then specialize. A VLA starts from a vision-language model that has already read much of the web, so it arrives knowing what a mug, a drawer, and the word 'fold' mean. It is then fine-tuned on huge collections of robot trajectories — much of it teleoperated demonstration data gathered by humans guiding real robots through tasks.

Because the demonstrations show what to do rather than handing out rewards, this stage is largely imitation learning at massive scale — the model copies expert behavior much as a behavior cloning policy would, but across thousands of tasks at once. The hope is a robot foundation model: pretrain once, then prompt it with a new instruction and have it generalize, the way a language model handles a question it was never explicitly trained on. Early systems can already follow plain-English commands like 'put the banana in the bowl', even on objects they have not seen.

Where is the field heading? Toward scale and unification. The bottleneck is no longer ideas but data — robot experience is far scarcer than text, so labs are pooling demonstrations across many robot types and pouring in randomized simulation to fill the gaps. The plausible future is one big model that drives many different bodies, picks up a new chore from a handful of examples, and is steered by ordinary language. Simulation, the reality gap, and foundation models are converging into a single pipeline for teaching machines to act.