Detecting & Segmenting Objects

From "what" to "where"

In the previous guide you built an image classifier: feed in a picture, get back a single label — *cat*. That is genuinely useful, but it answers only one question. A self-driving car does not just need to know that a pedestrian *exists somewhere* in the frame; it needs to know where, how big, and whether there are three pedestrians or one. The jump from one label per image to *locating* things is the leap from classification to detection and segmentation.

There is a ladder of increasing precision. Classification gives one label for the whole image. Object detection draws a bounding box around each thing and labels it — *cat here, dog there*. Semantic segmentation goes finer still, labeling every single pixel by category. Instance segmentation is finest: it labels every pixel *and* separates one cat from another. Each rung asks the model to commit to more, and each is harder to train and to evaluate.

Boxes, and how we score them: IoU

A bounding box is just four numbers — usually the corner coordinates, or a center plus width and height — together with a class label and a confidence score. Predicting a box turns part of the problem into *regression*: instead of choosing among classes, the network outputs real-valued coordinates and is trained to nudge them toward the true box. But this raises a sharp question: if the model predicts a box and the ground truth is a slightly different box, how *right* was it? You cannot demand pixel-perfect equality.

The answer the whole field agreed on is [[intersection-over-union|Intersection over Union]], or IoU. Take the predicted box and the true box; measure the area where they overlap, then divide by the area they cover together. If the boxes coincide perfectly, IoU is 1; if they do not touch at all, it is 0. A common rule is to count a detection as "correct" only if its IoU with a true box clears a threshold, often 0.5. This single ratio shows up everywhere — in training, in deciding which prediction matches which object, and in the leaderboard metrics.

IoU = area(overlap) / area(union)

     true box        predicted box
     +--------+
     |   +----|-----+      overlap = the shared rectangle
     |   |    |     |      union   = both boxes combined
     +---|----+     |
         +----------+      IoU = overlap / union   (0 .. 1)

IoU rewards boxes that line up. A detector typically produces many overlapping boxes for one object; we then keep the most confident and suppress the rest (non-max suppression) using IoU to decide what counts as a duplicate.

Because IoU just measures region overlap, it is not exclusive to boxes — the very same ratio scores segmentation masks, where you compare predicted pixels against true pixels. Keep this in mind: IoU is the connective tissue between everything in this guide. When a paper brags about "mAP at 0.5," it is averaging detection precision across classes using an IoU threshold of 0.5 to decide hits and misses.

Two ways to detect: regions first, or all at once

The first family that worked well is [[r-cnn|R-CNN]] and its descendants (Fast R-CNN, Faster R-CNN). The idea is a two-stage pipeline: first *propose* a few hundred regions that might contain something, then run a classifier on each proposal to say what it is and refine its box. It is accurate, because the second stage gets to look carefully at each candidate. The cost is speed — running a classifier on hundreds of regions per image is heavy, which made early R-CNN far too slow for video.

The second family flipped the design. [[yolo|YOLO]] — "You Only Look Once" — does it all in a single forward pass. It divides the image into a grid and, for each cell, predicts boxes and class probabilities directly. No separate proposal stage; one network, one look. That makes it fast enough to run on live video, which is why YOLO and its single-stage cousins power most real-time detection you see in the wild — cameras, drones, sports analytics.

From boxes to pixels: segmentation

A box is a blunt tool. A box around a winding road, a cat curled in a circle, or a person with arms outstretched includes a lot of background it does not mean. Segmentation fixes this by classifying every pixel. In [[semantic-segmentation-vision|semantic segmentation]], each pixel gets a *category* label — *road*, *sky*, *car* — but all the cars share one label. It answers "what is this pixel?" without caring how many distinct objects there are.

[[instance-segmentation|Instance segmentation]] adds the missing distinction: it labels each pixel *and* tells one object from another, so the three cars in a scene get three separate masks. A popular approach, Mask R-CNN, simply bolts a small mask-prediction head onto the Faster R-CNN detector — first find the box, then color in which pixels inside it belong to the object. This is a tidy illustration that detection and segmentation are not rivals; the pixel mask is often built *on top of* a detected box.

How does a network output a full-resolution label map? The backbone shrinks the image down into coarse feature maps, so a segmentation model has to *upsample* back to the original size. An elegant design for this is the U-Net: a contracting path that captures context, a symmetric expanding path that recovers spatial detail, and skip connections that hand fine-grained edges from the early layers straight across to the late ones. Originally built for biomedical images, its shape recurs across the segmentation world.

What it takes to make this work — and where it breaks

None of this is magic; it is paid for in labeled data. A classification example needs one tag per image. A detection example needs a box drawn around *every* object. An instance-segmentation example needs a hand-traced outline of *every* object — painstaking, expensive annotation. This is why benchmark datasets like COCO, with their hundreds of thousands of carefully outlined objects, were such landmark efforts: the model's ceiling is set by the quality and coverage of those labels.

Choose a task: boxes (detection), per-pixel categories (semantic), or per-pixel + per-object (instance) — each demands a different label format and budget.
Take a backbone pretrained on a large image dataset and fine-tune it; you rarely train from scratch.
Train against a loss combining classification, box regression, and (for segmentation) per-pixel terms.
Evaluate with IoU-based metrics, then look at the actual mistakes — not just the headline number.

Be honest about the failure modes, because they are everywhere in deployed systems. Detectors struggle with tiny, distant, or heavily overlapping objects; they can hallucinate confident boxes around nothing, or miss an obvious object under unusual lighting. A model trained on daytime city streets degrades sharply on snow or at night — a distribution shift the IoU on your test set will never warn you about. High benchmark scores measure performance on data that looks like the training set, not robustness to the messy world.

Where next? The same locate-and-label machinery generalizes: track those boxes across video frames, lift them into 3D for robotics, or replace the convolutional backbone with a vision transformer — the subject of the next guide. The core ideas you now hold — boxes versus masks, semantic versus instance, and IoU as the ruler for all of it — carry straight through.