Artificial Intelligence 1986

Learning representations by back-propagating errors

David E. Rumelhart, Geoffrey E. Hinton & Ronald J. Williams

Send the error backward through the layers — and the hidden units learn what to become.

Choose your version

In depth · the introduction

Stack simple artificial neurons in layers and you get something powerful but untrainable — until you let the network's mistakes flow backwards to fix it. That trick is back-propagation.

The idea, unpacked

A neural network is layers of simple units, each passing numbers to the next, with an adjustable "weight" on every connection. Show it an input and it produces an output. At first the output is wrong. The question that stumped researchers for years: with hidden units in the middle — units that have no "correct answer" of their own — how do you know which weights to blame for the mistake, and by how much?

Back-propagation answers it. Measure the error at the output, then pass that error backwards through the network, layer by layer, splitting the blame along the very connections the signal came forward on. Each weight learns how much it contributed to the mistake and nudges itself to do a little better. Repeat over many examples and the network gradually teaches itself — and the hidden units, on their own, become useful feature detectors that no human designed.

Where it came from

In 1969 Marvin Minsky and Seymour Papert proved that a single-layer perceptron — the device in Rosenblatt's 1958 paper, also in this library — could not even compute XOR. The result was widely read as a death sentence for neural networks, and funding and interest drained away for over a decade. Adding hidden layers could lift the limitation in principle, but no one had a reliable way to train them.

In 1986 David Rumelhart, Geoffrey Hinton and Ronald Williams published a short paper in Nature that gave a clean, convincing way to do it. Honestly, the core mathematics had been found before — in control theory, and by Paul Werbos for neural nets in 1974 — and the authors said so. But their crisp demonstration, in the right place at the right time, is what finally convinced the field, and the modern era of neural networks began.

Why it mattered

Back-propagation made deep, multilayer networks trainable for the first time, lifting the ceiling Minsky and Papert had pointed to. Just as important, it let networks invent their own features instead of relying on humans to hand-craft them — the network figures out what to look for. That single algorithm, scaled up with far more data and computing power, is what eventually produced today's image recognition, speech recognition, and language models.

An everyday picture

Imagine a factory ships a faulty product and traces the fault back along the assembly line: this station added a little to the error, that one a lot, and each adjusts for next time. Back-propagation is that blame-assignment done with calculus — the error at the end is shared backwards, fairly, among all the connections that helped produce it, so every one of them learns which way to adjust.

What came before, what came after

It sits at the hinge of the neural-network story. Before it: McCulloch and Pitts' logical neuron (1943), Rosenblatt's perceptron (1958), and the Minsky–Papert critique that froze the field. After it: the deep-learning era — AlexNet (2012), the Transformer (2017) and AlphaFold (2021), all in this library, are trained by back-propagation to this day. The algorithm barely changed; the world around it did.

The original document

Original source text

D. E. Rumelhart, G. E. Hinton & R. J. Williams · Nature 323, 533–536 · 9 October 1986

Abstract

We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector.

As a result of the weight adjustments, internal 'hidden' units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.

The learning procedure

The body of the letter defines a layered network of units whose output is a smooth (logistic) function of their total input, sets the total error E as one-half the summed squared difference between actual and desired outputs, and derives how to compute ∂E/∂w for every weight by propagating the error backward from the output units through the network.

It then gives the gradient-descent weight change, notes that a pure steepest-descent step can be accelerated by adding a fraction of the previous weight change (a momentum term), and remarks that the procedure can become trapped in local minima, though in their experience this was rarely a serious problem.

Demonstrations

Worked examples show hidden units learning to detect mirror symmetry in an input string, and a network of relationships in two family trees in which the hidden units spontaneously come to encode meaningful features such as nationality and generation — evidence that the procedure constructs useful internal representations rather than merely fitting outputs.

[ … ]

The complete four-page letter, with its equations, its figures of the learned hidden-unit weights, and its discussion of the relationship to the brain, is available in full at the source below.

Nature · 9 October 1986