JOVANA
Library Glossary Getting Started Three Levels Fields How it works Mission
Join the mission
Back to the library
Artificial Intelligence 2012

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever & Geoffrey Hinton

A deep network, trained on a mountain of images by GPUs, saw better than any program before it.

Choose your version
In depth · the introduction

A deep neural network, fed over a million photos and trained on gaming graphics cards, learned to recognise objects far better than any program before — and kicked off the AI boom we're living in.

The idea, unpacked

For decades, getting a computer to tell a dog from a cat in a photo was painfully hard. Programmers hand-crafted rules about edges and shapes, and the results were mediocre. This paper took a different path: don't write the rules — let the machine learn them from examples, a huge pile of examples.

The team trained a “deep” neural network — a stack of simple layers loosely inspired by neurons in the brain — on over a million labelled photos. The first layers learn to spot tiny features like edges and patches of colour; deeper layers combine those into shapes, then parts, then whole objects. The clever engineering was using graphics cards (the chips built for video games) to do the enormous number of sums fast, plus a couple of tricks to keep such a big network learning well. The result didn't just edge ahead — it blew past everything before it.

Where it came from

In 2012, two graduate students — Alex Krizhevsky and Ilya Sutskever — working with Geoffrey Hinton at the University of Toronto, entered the ImageNet competition, a yearly contest to sort photos into a thousand categories. Their network, soon nicknamed AlexNet, didn't just win; it won by a margin so wide that within months the whole field abandoned older methods and switched to deep learning. It's often called the “big bang” of the modern AI era.

Why it mattered

This was the proof, in public and at scale, that learning from data beats hand-written rules — and that the approach keeps getting better as you add more data and more computing power. Almost every AI you use today, from photo tagging and voice assistants to translation and the chatbots that came later, traces its lineage to the moment this network won.

How a network “sees”

A convolutional network looks at an image through a tiny sliding window. Each window runs a little filter — a small grid of numbers — that lights up when it finds a particular pattern, like a vertical edge or a patch of one colour. Slide that filter across the whole picture and you get a “feature map” showing where the pattern appears. Stack these up and the network builds from edges to shapes to objects. Try sliding a filter yourself below.

An interactive convolution: a 3×3 kernel slides over a small grayscale image and a feature map grows alongside; hover a cell to light up its 3×3 receptive field and read the weighted sum, switch between edge, outline, sharpen and blur kernels, and the Expert panel shows the kernel matrix, the cell's sum and ReLU, and the 9×9 ∗ 3×3 → 7×7 output-size arithmetic.

Where you've met it

The descendants of AlexNet are everywhere you let a computer look at images: the face unlock on your phone, the search that finds “photos of dogs” in your camera roll, medical scanners that flag tumours, and the cameras in self-driving cars. The same broad recipe — a deep network, lots of data, lots of compute — also underlies the AI that understands speech and language.

The original document
Original source text
A. Krizhevsky, I. Sutskever, G. E. Hinton · NeurIPS 25 (2012)
Abstract
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes.
The network and its result
On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Making it train
To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective.
Conclusion
Our results show that a large, deep convolutional neural network is capable of achieving record-breaking results on a highly challenging dataset using purely supervised learning.
All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.
The full paper details the architecture layer by layer, the data augmentation and dropout regularization, the split across two GPUs, and the competition results; it runs to nine pages and is available in full at the source below.
University of Toronto · 2012