From Seeing to Acting: Pose Estimation and Visual Servoing

When detection and depth are not enough

In earlier chapters a robot learned to spot things with object detection and to measure how far away they are with depth estimation and a point cloud. That is plenty for a self-driving car that only needs to know "a pedestrian is two metres ahead." But the moment a robot has to pick something up, knowing where an object sits is only half the story. To wrap a gripper around a mug, the robot also needs to know which way the mug is turned — handle left or handle right, upright or on its side.

So this final chapter ties the whole perception pipeline together. Detection, segmentation, stereo, depth, and point clouds all converge on one demanding question: what is the object's full position and orientation in space? Answering it, and then acting on the answer in real time, is what turns seeing into doing.

Six degrees of freedom: an object's full pose

Object pose estimation is the task of recovering an object's complete placement in 3D: three numbers for position (where its centre is — left/right, up/down, near/far) and three for orientation (how it is rotated about each axis). Together these six numbers are called the 6-DoF pose — six degrees of freedom. We met the same idea earlier as a pose expressed by a homogeneous transformation: it answers not just "where" but "which way is it facing."

Why does grasping live or die on orientation? Picture a screwdriver lying on a bench. Its position tells the arm where to reach, but only its orientation tells the end-effector how to twist the gripper so the jaws close across the handle rather than slamming into it end-on. Get the rotation wrong by ninety degrees and the grasp fails even though the position was perfect. This is why a grasp and reliable pick-and-place depend on pose, not position alone.

How is pose actually estimated? Classically, the robot holds a 3D model of the object and searches for the rotation and translation that best line that model up with what the camera sees — matching a known shape against the point cloud or against feature descriptors in the image. Modern systems often train a neural network to regress the pose directly from a single picture. Either way, the result is a clean 6-DoF answer the arm can act on.

Closing the loop: visual servoing

Estimating a pose once and then reaching blindly is fragile: the object might shift, the arm's own joints flex, or the calibration drifts a millimetre. The robust answer is to keep looking while you move. Visual servoing is a closed control loop where live camera images continuously correct the arm's motion — the same feedback control idea from the control chapters, but with the camera as the sensor. See, nudge, see again, nudge again, until the hand arrives.

There are two great families here, and the difference is where the error is measured. This is the image-based vs position-based distinction. In position-based servoing (PBVS), the robot first estimates the full 3D pose of the target, computes the gap between where the hand is and where it should be in real-world coordinates, and steers to close that gap. In image-based servoing (IBVS), the robot never reconstructs 3D at all: it picks a few visual features and drives the arm so those features slide across the image toward their desired pixel locations.

Each has a personality. Position-based servoing moves in straight, intuitive lines through real space but leans heavily on accurate pose estimation and calibration — small 3D errors steer you off course. Image-based servoing is wonderfully forgiving of calibration error because it only cares about pixels, but its real-world path can curve oddly, and it can stumble near visual singularities. The bridge between camera motion and image motion is the image Jacobian (often called the interaction matrix): it predicts how a small motion of the camera shifts each feature in the image, so the controller can invert that relationship and know which way to move.

A small servoing loop, sketched

The whole closed loop is shorter than you might expect. At its heart is a single repeating cycle: measure the error in image space, multiply by the pseudo-inverse of the image Jacobian to turn that pixel error into a commanded velocity, move the arm a little, then look again.

# image-based visual servoing, one control tick
target_px = desired feature pixels    # where features SHOULD be
loop:
    obs_px = detect_features(camera)   # where they ARE now
    e = obs_px - target_px             # image-space error
    if norm(e) < tol: break            # close enough -> done
    v = -gain * pseudo_inv(J_image) @ e  # pixel error -> camera velocity
    move_arm(v)                        # nudge, then look again

Pseudo-code: pixels in, a commanded velocity out. J_image is the interaction matrix; the gain sets how aggressively the loop chases the error.

Notice there is no separate "plan a path" step — the loop simply shrinks the error a little on every tick, the way a feedback controller does. That is the quiet power of servoing: it self-corrects against a moving target or a clumsy calibration, because every new image resets the plan.

The frontier: from pixels straight to actions

Everything so far has been a hand-built pipeline: detect, estimate pose, compute a Jacobian, control. The frontier of robot perception asks a bolder question — what if a single deep neural network learned the whole chain from raw pixels to motor commands, skipping the explicit pose entirely? This is end-to-end learned perception, and it grew out of the imitation learning and behavior cloning ideas from the learning chapters, where a robot copies human demonstrations.

The boldest version is the vision-language-action model (VLA). It takes in camera images and a plain-language instruction — "pick up the red block and put it in the bowl" — and outputs robot actions directly, frame after frame. Built on the same kind of large models behind chatbots, a VLA fuses what it sees with what it is told and maps that straight onto motion, no separate detector or pose estimator in sight. It is, in a sense, visual servoing whose entire controller has been replaced by a learned network.

These learned systems are exciting because they can handle messy, novel scenes that no hand-coded pose estimator anticipated. But two hard challenges remain. Robustness: a network can fail silently when lighting, clutter, or a strange new object pushes it outside its training experience. Generalization: a policy trained on a thousand kitchen tasks may still fumble the thousand-and-first. Bridging the gap between the simulator where these models are often trained and the real world — the sim-to-real problem, eased by tricks like domain randomization — is very much an open frontier.