Learning From People: Imitation and Behavior Cloning

When Designing a Reward Is the Hard Part

In the last chapter the robot learned by trial and error: it tried things, and a reward function told it how well each try went. That works beautifully when you can write down what "good" means as a number. But think about folding a towel, wiping up a spill, or plugging in a cable. How many points is a neatly folded towel worth versus a slightly crooked one? Hand-crafting that score — and patching it so the robot can't cheat — is often harder than the task itself.

There's a friendlier alternative. You probably already know how to do the task — so just show the robot. This is the idea behind imitation learning: instead of specifying a reward and letting the robot discover the behavior the slow way, you provide examples of the behavior you want, and the robot learns to reproduce it. The broader practice of teaching a robot from examples a person provides is called learning from demonstration.

Behavior Cloning: Treat Demonstrations as Homework

The simplest way to imitate is also the most direct, and it has a name: behavior cloning. The trick is to notice that a demonstration is really a long list of paired examples. At each instant, the robot sees something — a camera image, joint angles, the gripper's state — and the human did something — move this way, close the gripper. Pair them up and you get "situation → action" cards, exactly like flashcards.

Once you frame it that way, the learning problem becomes ordinary supervised learning — the same kind of pattern-matching used to label photos. The input is the situation, the "correct answer" is the action the human took, and the network is trained to predict that action. What you get out is a control policy: a function that maps any new situation to an action. No reward, no exploration — just copy what the expert did, and hope the patterns generalize to situations a little different from the ones you showed.

for each demonstration:
    for each moment t in the demonstration:
        observation = what the robot saw at t   (image, joint angles, ...)
        action      = what the human did at t   (move, grasp, ...)
        add (observation -> action) to dataset

train policy to predict action from observation   # plain supervised learning

Behavior cloning, in pseudocode: slice demonstrations into (observation, action) pairs, then train a policy to predict the action from the observation.

Where the Demonstrations Come From

A policy is only as good as the examples it learned from, so the next question is practical: how do you get hundreds of high-quality demonstrations? The most common answer is to have a person drive the robot directly. The human guides the real arm — by holding it and moving it, or by steering it with a joystick, a glove, or a matching handheld controller — while every observation and command is logged. The recordings this produces are called teleoperated demonstration data.

This is why so much modern robot learning starts with a person spending an afternoon teleoperating dozens of pick-and-place attempts, each one a fresh demonstration. The appeal is that the data is recorded on the actual robot in the actual world, so the situations and actions already "speak the robot's language" — the same cameras, the same motors. The catch is cost: a person has to sit and demonstrate, and you usually need many varied examples before the policy stops being brittle.

The Drift Problem: Small Mistakes Snowball

Behavior cloning has one famous weakness, and it's worth understanding before you trust a cloned policy. During training, the robot only ever saw situations the human visited — and a skilled human stays on a tidy path. But when the policy runs on its own, it will eventually make a tiny error: it ends up a centimeter to the left of anywhere the demonstrator ever was. Now it faces a situation slightly off the demonstrated path — one it was never trained on — so its next guess is a little worse, which pushes it further off, and the errors compound.

Picture learning to drive only by watching a flawless driver who always stays dead-center in the lane. You never saw how to recover from drifting toward the shoulder — because your teacher never drifted. The moment you wander, you're lost, and each correction is a fresh guess that may make things worse. This compounding is the core risk of pure behavior cloning: the robot can drift into states no human ever showed it, where it has no idea what to do.

The fixes all aim at the same target: show the robot how to recover. You can deliberately demonstrate corrections from off-path situations; you can let the policy run and have a human take over whenever it starts to drift, so its own mistakes get labeled with the right fix; or you can fall back on a reward signal to keep practicing in the regions cloning never covered. Done well, reward shaping and imitation reinforce each other — demonstrations give a strong head start, and reward fills in the gaps.