ImageNet Classification with Deep Convolutional Neural Networks
A deep network, trained on a mountain of images by GPUs, saw better than any program before it.
A deep neural network, fed over a million photos and trained on gaming graphics cards, learned to recognise objects far better than any program before — and kicked off the AI boom we're living in.
The idea, unpacked
For decades, getting a computer to tell a dog from a cat in a photo was painfully hard. Programmers hand-crafted rules about edges and shapes, and the results were mediocre. This paper took a different path: don't write the rules — let the machine learn them from examples, a huge pile of examples.
The team trained a “deep” neural network — a stack of simple layers loosely inspired by neurons in the brain — on over a million labelled photos. The first layers learn to spot tiny features like edges and patches of colour; deeper layers combine those into shapes, then parts, then whole objects. The clever engineering was using graphics cards (the chips built for video games) to do the enormous number of sums fast, plus a couple of tricks to keep such a big network learning well. The result didn't just edge ahead — it blew past everything before it.
Where it came from
In 2012, two graduate students — Alex Krizhevsky and Ilya Sutskever — working with Geoffrey Hinton at the University of Toronto, entered the ImageNet competition, a yearly contest to sort photos into a thousand categories. Their network, soon nicknamed AlexNet, didn't just win; it won by a margin so wide that within months the whole field abandoned older methods and switched to deep learning. It's often called the “big bang” of the modern AI era.
Why it mattered
This was the proof, in public and at scale, that learning from data beats hand-written rules — and that the approach keeps getting better as you add more data and more computing power. Almost every AI you use today, from photo tagging and voice assistants to translation and the chatbots that came later, traces its lineage to the moment this network won.
How a network “sees”
A convolutional network looks at an image through a tiny sliding window. Each window runs a little filter — a small grid of numbers — that lights up when it finds a particular pattern, like a vertical edge or a patch of one colour. Slide that filter across the whole picture and you get a “feature map” showing where the pattern appears. Stack these up and the network builds from edges to shapes to objects. Try sliding a filter yourself below.
Where you've met it
The descendants of AlexNet are everywhere you let a computer look at images: the face unlock on your phone, the search that finds “photos of dogs” in your camera roll, medical scanners that flag tumours, and the cameras in self-driving cars. The same broad recipe — a deep network, lots of data, lots of compute — also underlies the AI that understands speech and language.
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes.
All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.