Image Classification & ImageNet

The simplest question you can ask a picture

Of all the things you might ask a computer about an image, the most basic is also the most studied: "What is this a picture of?" Show the machine a photo and it returns a single word — cat, fire truck, golden retriever. That is image classification, and it is the task on which modern computer vision was first proven and is still benchmarked. You met images-as-tensors in the previous guide; here we ask what to *do* with that tensor once we have it.

Mechanically, classification is a familiar supervised-learning setup wearing a visual costume. The input feature is the raw grid of pixel numbers; the label is one category drawn from a fixed list the designers chose in advance. The network's job is to turn that grid into a vector of scores, one per category, and a softmax turns those scores into probabilities that sum to one. The predicted class is simply the highest-scoring entry.

Why pixels are a cruel input

It is tempting to flatten the pixel grid into one long list of numbers and feed it to an ordinary fully-connected network. People tried; it works poorly. A modest 224x224 color image is over 150,000 numbers, and the first layer alone would need hundreds of millions of weights — a recipe for runaway cost and brutal overfitting. Worse, such a network treats the top-left pixel and the center pixel as unrelated coordinates, so a cat shifted three pixels to the right looks like a completely new input.

The fix is the convolutional neural network you met earlier: slide a small filter across the image so the same pattern-detector is reused everywhere, building up edges, then textures, then parts, then whole objects layer by layer. This baked-in assumption — that a useful pattern is useful no matter where it appears — is a powerful inductive bias, and it is exactly what classification on natural photos needs. The pixels stop being a flat list and become a spatial map the network can reason over.

ImageNet and the 2012 earthquake

None of this could be proven without something to prove it on. In the late 2000s, Fei-Fei Li and collaborators built ImageNet: roughly fourteen million photographs, each hand-labeled into one of tens of thousands of categories organized by the WordNet hierarchy. The cleverness was less the algorithm and more the annotation — armies of crowd workers labeled images at a scale no academic dataset had reached. An annual contest, the ImageNet Large Scale Visual Recognition Challenge, judged systems on a 1,000-category subset of about 1.2 million training images.

For years the winners were elaborate hand-engineered pipelines, and progress crawled. Then in 2012 a network called AlexNet — built by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton — won by a margin that stunned the field: its top-5 error fell to about 16%, while the next-best entry sat near 26%. There was no single magic trick. It was a deep convolutional net trained on two consumer GPUs, using ReLU activations, dropout to fight overfitting, and heavy image augmentation. The lesson was that the ingredients had finally arrived together: enough data, enough compute, and a model with the right structure.

Transfer learning: don't start from zero

The deepest consequence of 2012 was not the winning score — it was what the trained network turned out to contain. A net that learns to tell apart a thousand ImageNet classes must, along the way, learn generally useful visual machinery: edge detectors, texture detectors, shape and part detectors. Those early and middle layers are not about "dogness" at all; they are about *seeing*. This is the promise of transfer learning: features learned on one big task can be reused on a different, smaller one.

In practice this is the everyday workflow of vision. You take a backbone whose pre-training on ImageNet is already done and freely downloadable, lop off its final 1,000-way classification head, and bolt on a small new head for your own classes — say, ten kinds of skin lesion. Then you do fine-tuning: train on your few thousand labeled images, often keeping the early layers frozen and only adjusting the later ones. A task that once demanded a million images and a GPU farm can now succeed with a few thousand images on a single card.

backbone = load_pretrained("resnet_imagenet")   # features, learned once
freeze(backbone.early_layers)                   # keep generic edge/texture detectors
backbone.head = NewHead(num_classes=10)         # your task: 10 skin lesions
for images, labels in your_small_dataset:       # only a few thousand examples
    loss = cross_entropy(backbone(images), labels)
    update(backbone.head, backbone.late_layers) # adjust the top, not the bottom

Transfer learning in five lines: reuse a pretrained backbone, replace the head, fine-tune lightly.

A newer twist is worth naming. Beyond convolutions, the vision transformer chops an image into patches and treats them like tokens, applying the same attention machinery you saw in language models. Vision transformers can beat CNNs — but mainly when pre-trained on truly enormous datasets, because they carry less built-in spatial bias and must learn it from data. For most everyday projects, a fine-tuned convolutional backbone remains a strong, cheap baseline; bigger is not automatically better.

What the leaderboard hides

It is easy to read a 95% accuracy and assume the model "understands" images. It does not. A classifier optimizes whatever correlates with the label in its training set, and that often includes shortcuts no one intended — a model meant to spot pneumonia may quietly be reading which hospital's scanner took the X-ray, because one hospital saw sicker patients. This is the difference between learning the disease and learning a spurious correlation. High test accuracy does not prove the right thing was learned; it only proves the test set shared the shortcut.

ImageNet itself carries dataset bias: it skews toward Western, internet-photographed objects, so a classifier trained on it recognizes a Western wedding dress far better than traditional dress from much of the world. Models are also brittle to distribution shift — a slight change in lighting, camera, or background can crater accuracy that looked flawless in the lab. None of this means classification is broken; it means a number on a leaderboard is the beginning of evaluation, not the end. The honest practitioner asks not just "how accurate?" but "accurate on whom, and by reading what?"

Step back and the arc is clear. Classification gave the field a clean, scoreable question; ImageNet gave it data at a scale that made deep learning finally pay off; and transfer learning turned that one expensive triumph into a reusable foundation the whole community now builds on. The next guides in this rung add the abilities classification lacks — locating objects, segmenting scenes, and generating images — but every one of them stands on the representations this task taught machines to see.