Recovering Depth: Stereo, Point Clouds, and 3D

The dimension a photo throws away

A camera takes the rich 3D world and squashes it onto a flat 2D image. Every point in front of the lens lands on a single pixel, and the pinhole camera model tells us exactly how: a 3D point is projected along a ray through the lens onto the image plane. But that projection is a one-way street. Once a point becomes a pixel, the camera forgets how far away it was. A coffee mug on your desk and an identical mug across the room can produce the very same pixel — depth is the information the flat photo silently discards.

For a robot, that missing number is everything. To grasp the mug, drive around the table, or avoid the wall, the robot needs distance in real units — meters, not pixels. The whole job of depth perception is to *put the third dimension back*. The trick that makes this possible is beautifully simple: look at the same scene from two slightly different viewpoints, and compare what each one sees.

Stereo vision: two eyes, one triangle

Stereo vision mimics that finger-blinking trick with two cameras mounted a fixed distance apart, called the *baseline*. Each camera captures the scene from its own viewpoint. Because the viewpoints differ, the same real-world point lands at a different horizontal position in the left image than in the right. That left-right shift, measured in pixels, is called disparity — and it is the heart of depth estimation.

The relationship is wonderfully direct: large disparity means close, small disparity means far. A point right in front of you shifts a lot between the two views; a mountain on the horizon barely shifts at all. If you know the baseline and the camera's focal length (from camera calibration), one short formula converts disparity straight into distance. The two cameras and the scene point form a triangle, and we solve for its depth — which is why this is called *triangulation*.

depth Z = (focal_length f  ×  baseline B) / disparity d

   big disparity d   ->  small Z  (object is near)
   small disparity d  ->  large Z  (object is far)

The stereo depth formula: once you find the disparity d for a pixel, depth follows directly.

The hard part is the *matching*: for a pixel in the left image, which pixel in the right image is the same real point? This is the same correspondence problem you meet when matching image keypoints with a feature descriptor, but stereo gets a big shortcut. After calibration, the matching pixel always lies on the same horizontal line in the other image, so the search collapses from the whole 2D image to a single row. Plain-colored walls and shiny surfaces still cause trouble — with no texture to match, disparity is a guess.

From depth to a point cloud

Once every pixel has a depth, we can lift the flat image off the screen and back into space. Take each pixel's image position, attach its measured distance, and use the calibrated camera model to compute where that point truly sits in 3D. Do this for the whole image and you get a point cloud: a dense scatter of thousands or millions of little (x, y, z) dots floating in the robot's coordinate frame, each often carrying a color too. It is the raw, unstructured 3D output of perception.

A point cloud is powerful but messy. It has no notion of objects or surfaces — just dots. So robots run further processing: clustering nearby points into objects, fitting flat planes to find tabletops and floors, or stitching several clouds taken from different angles into one. Aligning two overlapping clouds is the job of iterative closest point (ICP), which nudges one cloud until its points sit as close as possible on top of the other.

When you fuse many views into a single clean surface, you cross from raw points into 3D reconstruction — turning the cloud into a watertight mesh or a tidy model you could measure, print, or drop into a simulator. The point cloud also feeds downstream tasks like 6-DoF object pose estimation, where the robot fits a known object's shape to the dots to learn exactly how it is positioned and oriented before grasping it.

One moving camera: structure from motion

You do not actually need two cameras. A single camera that *moves* sees the scene from many viewpoints over time — and the gap between frame 1 and frame 10 acts just like a stereo baseline. Structure from motion (SfM) exploits this. It tracks the same features across a sequence of images and, from how they slide across the frames, solves a remarkable double puzzle at once: the 3D shape of the scene *and* the path the camera took to capture it.

Detect features in each frame and match them across frames, tracking how each one moves — closely related to optical flow, the per-pixel motion between images.
From the matched motion, estimate how the camera moved between views (its changing pose).
Triangulate each tracked feature across two or more known camera poses to place it in 3D, building a sparse point cloud.
Jointly refine every camera pose and every 3D point together (bundle adjustment) so the whole reconstruction is consistent.

Run this live, frame by frame, and SfM becomes the visual backbone of a robot that maps the world while figuring out its own location. Pair the camera with an inertial sensor and you get visual-inertial odometry, where the motion guesses from vision and from the inertial unit reinforce each other for a steadier estimate of where the robot is and where everything around it sits.

Where depth goes wrong

Recovered depth is never perfect, and two pitfalls show up again and again. The first is scale ambiguity. A single moving camera can recover the *shape* of a scene and the *shape* of its own path — but not their true size. A small model train filmed up close looks identical to a real train filmed from far away. SfM reconstructs the world up to an unknown scale factor; without help it cannot tell you whether a doorway is one meter wide or a hundred.

The fix is to feed in one real measurement: a known baseline from a stereo rig, an inertial sensor that senses true acceleration, a wheel odometer, or a calibration object of known size. Any one of these pins the floating reconstruction to real meters. This is exactly why a second camera, or an IMU, earns its keep — it nails down the scale that vision alone leaves dangling.

The second pitfall is noisy depth. Disparity becomes unreliable on plain walls, glass, mirrors, and shadows, where there is nothing to match. Far-away points are especially fragile: at long range a tiny disparity error becomes a huge distance error, so depth grows fuzzier the farther you look. Good robots treat depth as a measurement with uncertainty, not gospel — they filter it, fuse it with other sensors, and trust a nearby reading far more than a distant one.