What Is It and Where: Detection and Segmentation

From colored pixels to meaning

A camera hands the robot a grid of numbers. Each pixel says “here is a little patch of reddish-brown” — but it never says “this patch belongs to a coffee mug, and the mug is sitting on the desk to your left.” Knowing a pixel’s color is easy; knowing what that pixel is part of, and where that thing sits in the world, is the hard and useful part.

Bridging that gap is the job of two closely related tasks. One answers what is it and where with rough boxes; the other answers it pixel by pixel, with exact outlines. Both are powered by convolutional neural networks that have studied enormous piles of example images, and both feed the robot the labels it needs before it can plan a grasp or steer around an obstacle.

Object detection: a box and a name

Object detection draws a tight rectangle — a “bounding box” — around each thing it recognizes and stamps it with a class label and a confidence score: “mug, 0.93” or “person, 0.88.” In one pass over an image it can find many objects at once, even overlapping ones, and tell you roughly where each lives in the picture.

A robot almost always needs both halves of that answer. The class tells it how to behave — you pick up a mug, you stop for a person, you ignore a shadow. The location tells it where to point the arm or which way to turn. A label without a location is a rumor; a location without a label is a blur. Detection gives you both, fast enough to run on live video.

Semantic segmentation: labeling every pixel

Semantic segmentation takes the same question to its limit: it assigns a class to every single pixel. Instead of a box that says “a mug is somewhere in here,” you get a mug-shaped stencil that follows the handle and the rim exactly, with “desk” and “wall” filled in around it. The output is a color-coded map of the scene where each color means a category.

That sharpness pays off in two everyday robot jobs. For grasping, the exact outline tells the gripper precisely which pixels are the object and which are the table behind it, so it can place its fingers on the real edges. For navigation, a pixel-perfect mask of “floor” versus “obstacle” lets a mobile robot trace the true boundary of a doorway or a curb instead of swerving around a fat box.

How the network learns it — and how it fails

Both tasks are learned, not programmed. A convolutional neural network slides small filters across the image; early layers react to edges and blobs, and deeper layers stack those cues into “handle-like” and “mug-like” patterns. Nobody writes those filters by hand — they are shaped by training, the heart of modern deep learning.

Collect millions of images where humans have already drawn boxes or painted pixel masks — the answer key.
Show the network an image, let it guess, and measure how far the guess is from the human answer.
Nudge every filter a little to shrink that error, then repeat over the whole dataset many times.
Stop when, on images it never saw during training, its boxes and masks line up with what a person would draw.

Because the model only ever knows what it was shown, this same recipe underlies broader machine learning — and it explains where perception breaks. A model trained on daytime city streets may flounder at night or on a farm; an object absent from the training set gets forced into the nearest familiar class or missed entirely.