From 3D World to Flat Image: The Pinhole Camera

A camera is a box with a hole

Picture a sealed shoebox with a tiny pinhole punched in one face and a sheet of paper taped to the opposite face. Light bouncing off the world outside streams in through that hole. Because the hole is so small, only one thin ray from each point on an object can pass through, and it lands on exactly one spot on the paper. Do this for every point, and a complete picture of the scene forms on the back wall — upside down, but unmistakably real. That is the entire idea behind a camera: it takes a 3D world full of light traveling in every direction and squeezes it down to a flat 2D image.

The pinhole forces every ray of light through a single point, so the image is really a bundle of straight rays all crossing at one place. A modern camera swaps the pinhole for a lens (to gather more light and stay sharp) and the paper for a digital sensor — a grid of millions of light-counting cells called pixels. But the geometry is the same: the world arrives as rays, and the sensor records where each ray lands on the grid.

How a 3D point becomes a pixel

The clean mathematical version of the shoebox is called the pinhole camera model. It places the pinhole at the origin and an imaginary image plane one focal length in front of it (in front, so the picture comes out right-side up instead of flipped). A point in the world is connected to the pinhole by a straight ray; wherever that ray pierces the image plane is where the point appears. Similar triangles do the rest.

Two numbers govern the squeeze. The focal length is how far the image plane sits from the pinhole; a long focal length spreads the scene out (zoomed in, narrow view), a short one packs it together (wide view). The principal point is where the central ray — the one straight ahead — hits the sensor; it marks the image center in pixel coordinates. Knowing both, the model turns the ray's direction into an exact pixel position (its row and column).

u = fx * (X / Z) + cx     # pixel column
v = fy * (Y / Z) + cy     # pixel row
# (X, Y, Z): point in front of the camera, Z = depth
# fx, fy: focal length in pixels   cx, cy: principal point

Projection in two lines: divide by depth Z, scale by focal length, shift to the image center.

Two parameter sets: the lens and its place in the world

To project a point, the model needs two kinds of knowledge, and it pays to keep them separate. The first is everything baked into the camera itself: the focal length, the principal point, and the pixel grid's shape. These are the camera intrinsic parameters — the camera's own internal geometry. They never change as the robot drives around, because they describe the box, not where the box is.

The second kind is where the camera sits and which way it points. A point in the world is usually measured in some fixed reference frame — the room, the robot's base — but the projection math wants it measured from the camera's own viewpoint. The camera extrinsic parameters are the rotation and translation that bridge those two coordinate frames. They are exactly the camera's pose: its position plus orientation. Mount the camera on a moving arm and the extrinsics change every instant; the intrinsics stay put.

Calibration: measuring your camera before you trust it

Nobody hands you the intrinsics on a spec sheet you can trust down to the pixel — two cameras of the same model differ, and real lenses bend light in ways the clean pinhole model ignores. Straight lines near the edge of a wide lens bow outward or pinch inward; this is lens distortion. Camera calibration is the procedure that measures the true intrinsics and the distortion, so the software can undo both.

Print a checkerboard with squares of an exactly known size and tape it to something flat and rigid.
Take a dozen-plus photos of it from many angles and distances, filling different parts of the frame.
Software finds every corner in every photo — and it already knows where those corners sit on the real board.
It searches for the intrinsics and distortion that make the projected corners best match the detected ones, leaving the smallest leftover error.

The payoff is direct and physical. Suppose a robot arm spots a part, computes its distance from the image, and reaches for it. If the focal length is off by a few percent or distortion is left uncorrected, the geometry lies: the arm misjudges how far away the part is and how big it is, then stops short or crashes through. The same flaw corrupts depth from a stereo pair and every later step that depends on it — so calibration is the quiet first move before any robot is allowed to look and reach. It is the visual cousin of the sensor calibration every other sensor on the robot needs.