How Computers See Images

A photo is a grid of numbers

Zoom into any digital photo far enough and the picture dissolves into tiny colored squares. Each square is a pixel — short for "picture element" — and to a computer it is not a color at all but a number measuring brightness. In a grayscale image, one pixel is a single value, by convention from 0 (black) to 255 (white), with everything in between a shade of gray. A 1000×1000 photo is therefore just a million such numbers laid out in a grid: rows and columns of brightness. There is no "cat" in there, no "sky" — only a vast spreadsheet of intensities.

Color works the same way, just stacked. A color pixel is usually three numbers — how much red, green, and blue light it emits — because the human eye has three types of color receptor, and mixing those three primaries reproduces most colors we can see. So a color photo is really three grids stacked on top of each other: one for red, one for green, one for blue. These stacked grids are called channels.

From grids to a tensor

You already know the ladder of shapes: a single number is a scalar, a line of numbers is a vector, a grid is a matrix, and anything with more axes is a tensor. A grayscale image is exactly a matrix — height × width. A color image is one step up: height × width × 3 channels — a tensor with three axes. This is the standard shape that flows into every vision model, and getting the axis order right (channels-first vs. channels-last) is a daily, unglamorous part of the job.

In practice we rarely process one image at a time. We stack many images into a batch, adding a fourth axis up front, so the tensor that actually moves through training is batch × height × width × channels. That four-dimensional block of numbers is the true "input" — and every layer you studied earlier, every matrix multiply and convolution, is just arithmetic reshaping that block.

grayscale  shape = (H, W)            # one value per pixel
color      shape = (H, W, 3)         # red, green, blue stacked
batch      shape = (N, H, W, 3)      # N images flowing together

# a single pixel in a color image is just three numbers:
img[120, 64] = [231, 76, 60]         # reddish: high R, low G, low B

The same picture as numbers: a grayscale matrix, a 3-channel color tensor, and a batch of N images — the shape every vision model expects.

What is an edge, really?

Raw pixels are almost useless on their own — knowing that pixel (120, 64) is reddish tells you nothing about what the picture is of. The first useful thing you can extract is an edge: a place where brightness changes sharply from one pixel to the next. Run your eye across a photo of a mug on a table, and the boundary between mug and background is exactly where the numbers jump. An edge is just a large local *difference* between neighboring pixels.

You can find edges with a tiny sliding window — a small grid of weights that you slide over the image, multiplying and summing as it goes. Choose the weights so the window outputs a big value where pixels differ and near zero where they are flat, and you have an edge detector. This sliding-window operation *is* a convolution, and its output is a fresh grid called a feature map that lights up wherever that pattern occurs. For decades, engineers hand-designed these little filters — Sobel, Canny — because deciding what counts as a useful feature was a craft in itself.

The quiet revolution of convolutional networks was to stop hand-designing those filters and *learn* them instead. The weights in each window become parameters tuned by gradient descent. Stack several layers and the network builds a hierarchy: the first layer learns edges and color blobs, the next combines edges into corners and textures, the next into eyes, wheels, and leaves, and the top into whole objects. Edges are simply the bottom rung of that ladder — and crucially, the machine discovers it on its own.

Why vision is genuinely hard

Here is the deep difficulty. A single object — say, one specific cat — can produce a practically infinite number of completely different pixel grids. Move it closer and every number changes. Rotate it, dim the lights, let a chair half-hide it, photograph it against snow instead of grass — each time the raw tensor is unrecognizably different, yet your brain says "same cat" instantly. The computer must learn that one stable idea ("cat") hides behind a staggering variety of pixel arrangements. This gap between raw pixels and meaning is called the semantic gap, and bridging it is the whole problem of vision.

The numbers make it vivid. A modest 224×224 color image has over 150,000 raw values, and the space of all possible such images is unimaginably vast — yet only an unbelievably thin sliver of it looks like anything real. The model has to carve the meaningful regions out of that enormous space from a relatively tiny pile of examples — a face of the curse of dimensionality you met earlier. The reason it works at all is that real images are not random: they are full of structure — nearby pixels are correlated, edges form lines, textures repeat — and a good inductive bias like convolution bakes that structure in.

How the rest of this rung builds on this

Everything ahead in this rung stands on the picture you now have: an image is a tensor, features are learned in a hierarchy, and the central enemy is variation. Naming the single dominant object is image classification; drawing a box around each object and saying *where* it is becomes object detection; labeling every single pixel becomes segmentation. Later guides also revisit the convolutional assumption itself — the vision transformer throws away the sliding window, chops the image into patches, and lets attention decide what relates to what.

Two practical habits carry through it all. First, because variation is the enemy, we deliberately manufacture more of it during training — flipping, cropping, recoloring, and rotating images so the model learns to ignore the changes that do not matter; this is data augmentation, and it is one of the cheapest ways to make a vision model robust. Second, almost nobody trains from scratch: they start from a network already trained on the giant labeled set ImageNet and adapt it — the transfer-learning move you saw earlier, which lets a small dataset stand on the shoulders of a huge one.