Pose Graphs, Cameras, and LiDAR

The Back-End as a Graph

By now you know that SLAM asks a robot to figure out where it is while drawing the very map it is using to do so. Earlier rungs split that job into a front-end that turns raw sensor data into measurements and a back-end that fuses everything into one consistent estimate — the front-end / back-end division of labour. This guide is about the modern back-end, and the dominant way to think about it is wonderfully visual: as a graph.

In pose-graph SLAM, every place the robot stood is a node holding one pose — a position-and-orientation guess. The edges between nodes are measurements: "from here, I drove forward two metres and turned left a little." Each edge is a constraint that says how two poses should relate. Crucially, the edges disagree slightly, because every measurement carries noise. The graph is over-determined and a little self-contradictory, and the back-end's job is to find the single arrangement of poses that satisfies all those noisy constraints as well as possible.

The more general formulation behind this is the factor graph. There, the things we want to estimate (poses, and sometimes landmark positions) are one kind of node, and each measurement becomes a little function — a factor — that scores how well a candidate set of values explains that observation. A pose graph is really just a factor graph whose factors all happen to be pose-to-pose constraints. Factor graphs are the language modern SLAM libraries actually speak, because they let you mix odometry, loop closures, GPS, and landmark sightings in one tidy structure.

Pulling the Graph Taut

Here is the mental picture that makes graph SLAM click. Imagine each node is a bead and each edge is a tiny spring. Every spring has a natural length — the measurement, what it thinks the distance and turn between two beads should be. When the beads are placed wrong, the springs are stretched or squished, and they store energy. The total stored energy is the total error. Optimization simply lets go of the beads and lets the springs pull everything into the lowest-energy shape they can all live with.

Springs are not all equally strong, and that matters. A confident measurement — a crisp laser scan match — is a stiff spring that strongly resists being stretched. A vague one is a soft spring the optimizer can bend without much penalty. This stiffness is exactly the measurement's certainty (formally, the inverse of its covariance). When the optimizer settles the graph, stiff springs barely move and soft springs absorb most of the slack — which is precisely how you want trust to flow.

total_error = 0
for each edge (i, j) with measurement z and stiffness W:
    predicted = relative_pose(node[i], node[j])
    residual  = predicted - z          # how wrong this edge is
    total_error += residual^T * W * residual   # stiff W -> costly
# adjust all nodes to make total_error as small as possible

Each edge adds a penalty; stiff edges (large W) cost more, so the solver bends the soft ones first.

Seeing with a Camera

The graph is the same no matter what sensor fills in the edges — but the choice of sensor changes everything about cost, robustness, and the kind of map you get. The first big family is visual SLAM, which builds the map from ordinary camera images. Its appeal is plain: cameras are cheap, light, and low-power, and an image is astonishingly rich — texture, colour, signs, faces, the lot. From frame to frame, the system tracks recognizable image points using a feature descriptor and infers how the camera must have moved to make them shift the way they did.

That richness is also visual SLAM's weakness. A camera sees light, not geometry, so it is at the mercy of lighting. Drive into a dark tunnel, point at a blank white wall, swing past a dazzling window, or face a textureless corridor, and the trackable points vanish — the system goes blind exactly when you need it most. A plain single camera also can't read absolute scale from one image: it can tell that something moved, but not whether the scene is a dollhouse a metre away or a real house far off. That is why visual systems usually lean on a second eye or an extra sense.

The usual remedies are practical. A stereo pair or an RGB-D camera recovers true distance and fixes the scale problem outright. Better still, pairing the camera with an inertial sensor — visual-inertial odometry — lets the two cover each other's gaps: the motion sensor carries the estimate through the brief moments the camera is blinded, and the camera reins in the motion sensor's slow drift. Cheap, rich, light-sensitive: that is the visual bargain in a phrase.

Seeing with a Laser

The other big family is LiDAR SLAM. A LiDAR fires laser pulses and times the echoes, so it doesn't infer distance — it measures it directly, hundreds of thousands to millions of points per second, producing a point cloud: a dense spray of 3D dots tracing every surface around the robot. Because it carries its own light, a LiDAR works in pitch dark exactly as well as in daylight, and it doesn't care whether a wall is patterned or plain. Where vision guesses geometry, LiDAR hands it to you.

So how does a laser give you the graph's edges? Through scan matching: take the cloud captured a moment ago and the cloud captured now, and find the rotation and translation that slides one onto the other most snugly. The motion that aligns the two scans is exactly the robot's movement between them. The workhorse algorithm is ICP — Iterative Closest Point — and its loop is almost embarrassingly simple, which is part of why it has lasted decades.

For each point in the new scan, find the nearest point in the old scan and tentatively call them the same surface — a guessed correspondence.
Compute the single rotation and translation that, on average, pulls those paired points closest together.
Apply that motion to the new scan so it shifts toward the old one.
Repeat: the closest-point guesses improve, the fit tightens, and after a handful of passes the two clouds lock together. The accumulated motion is your edge.

LiDAR's trade-offs mirror vision's. The geometry is precise and lighting-proof, but the sensors are heavier, pricier, and power-hungry, and a bare point cloud is geometrically perfect yet semantically mute — it knows a surface is there, not that it is a door or a person. ICP can also be fooled where geometry is repetitive: a long featureless corridor looks the same whether you slid a metre forward or not, so the scan slides freely along it and the estimate drifts. In practice this is why robust systems fuse both worlds — a laser for solid structure, a camera for meaning.

The Frontier: Richer, Smarter, Lifelong

Classic SLAM was content with a sparse skeleton — just enough landmarks to localize against. The frontier wants more. Dense mapping reconstructs full continuous surfaces, so the robot has not a cloud of dots but a watertight model it can plan paths through and even render. Semantic mapping goes further, attaching meaning to the geometry using tools like semantic segmentation: this patch is floor, that is a chair, that is a doorway. A map that knows what things are is one a robot can reason about, not merely avoid.

On the front-end, hand-crafted feature detectors are increasingly joined by learned features — neural networks trained to pick points that stay recognizable across harsh lighting, weather, and viewpoint changes. These learned descriptors are especially good at place recognition for loop closure, spotting that two very different-looking images are in fact the same corner of a room. The graph and the optimizer underneath barely change; it is the quality of the edges feeding them that quietly leaps forward.

The hardest open problem is lifelong SLAM: keeping a map usable for months in a world that refuses to hold still. Furniture moves, seasons change, shops repaint, parked cars come and go. A map frozen at first sight slowly becomes a lie. Lifelong systems must decide what to trust as permanent, what to forget as transient, and how to update the graph without it ballooning forever — a robot that lives somewhere should get better at navigating it over time, not gradually more confused.