Artificial Intelligence 2012

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever & Geoffrey Hinton

A deep network, trained on a mountain of images by GPUs, saw better than any program before it.

Choose your version

In depth · the introduction

A deep neural network, fed over a million photos and trained on gaming graphics cards, learned to recognise objects far better than any program before — and kicked off the AI boom we're living in.

The idea, unpacked

For decades, getting a computer to tell a dog from a cat in a photo was painfully hard. Programmers hand-crafted rules about edges and shapes, and the results were mediocre. This paper took a different path: don't write the rules — let the machine learn them from examples, a huge pile of examples.

The team trained a “deep” neural network — a stack of simple layers loosely inspired by neurons in the brain — on over a million labelled photos. The first layers learn to spot tiny features like edges and patches of colour; deeper layers combine those into shapes, then parts, then whole objects. The clever engineering was using graphics cards (the chips built for video games) to do the enormous number of sums fast, plus a couple of tricks to keep such a big network learning well. The result didn't just edge ahead — it blew past everything before it.

Where it came from

In 2012, two graduate students — Alex Krizhevsky and Ilya Sutskever — working with Geoffrey Hinton at the University of Toronto, entered the ImageNet competition, a yearly contest to sort photos into a thousand categories. Their network, soon nicknamed AlexNet, didn't just win; it won by a margin so wide that within months the whole field abandoned older methods and switched to deep learning. It's often called the “big bang” of the modern AI era.

Why it mattered

This was the proof, in public and at scale, that learning from data beats hand-written rules — and that the approach keeps getting better as you add more data and more computing power. Almost every AI you use today, from photo tagging and voice assistants to translation and the chatbots that came later, traces its lineage to the moment this network won.

How a network “sees”

A convolutional network looks at an image through a tiny sliding window. Each window runs a little filter — a small grid of numbers — that lights up when it finds a particular pattern, like a vertical edge or a patch of one colour. Slide that filter across the whole picture and you get a “feature map” showing where the pattern appears. Stack these up and the network builds from edges to shapes to objects. Try sliding a filter yourself below.

Where you've met it

The descendants of AlexNet are everywhere you let a computer look at images: the face unlock on your phone, the search that finds “photos of dogs” in your camera roll, medical scanners that flag tumours, and the cameras in self-driving cars. The same broad recipe — a deep network, lots of data, lots of compute — also underlies the AI that understands speech and language.

The original document

Original source text

A. Krizhevsky, I. Sutskever, G. E. Hinton · NeurIPS 25 (2012)

Abstract

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes.

The network and its result

On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

Making it train

To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective.

Conclusion

Our results show that a large, deep convolutional neural network is capable of achieving record-breaking results on a highly challenging dataset using purely supervised learning.

All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.

The full paper details the architecture layer by layer, the data augmentation and dropout regularization, the split across two GPUs, and the competition results; it runs to nine pages and is available in full at the source below.

University of Toronto · 2012