Learning representations by back-propagating errors
Send the error backward through the layers — and the hidden units learn what to become.
Stack simple artificial neurons in layers and you get something powerful but untrainable — until you let the network's mistakes flow backwards to fix it. That trick is back-propagation.
The idea, unpacked
A neural network is layers of simple units, each passing numbers to the next, with an adjustable "weight" on every connection. Show it an input and it produces an output. At first the output is wrong. The question that stumped researchers for years: with hidden units in the middle — units that have no "correct answer" of their own — how do you know which weights to blame for the mistake, and by how much?
Back-propagation answers it. Measure the error at the output, then pass that error backwards through the network, layer by layer, splitting the blame along the very connections the signal came forward on. Each weight learns how much it contributed to the mistake and nudges itself to do a little better. Repeat over many examples and the network gradually teaches itself — and the hidden units, on their own, become useful feature detectors that no human designed.
Where it came from
In 1969 Marvin Minsky and Seymour Papert proved that a single-layer perceptron — the device in Rosenblatt's 1958 paper, also in this library — could not even compute XOR. The result was widely read as a death sentence for neural networks, and funding and interest drained away for over a decade. Adding hidden layers could lift the limitation in principle, but no one had a reliable way to train them.
In 1986 David Rumelhart, Geoffrey Hinton and Ronald Williams published a short paper in Nature that gave a clean, convincing way to do it. Honestly, the core mathematics had been found before — in control theory, and by Paul Werbos for neural nets in 1974 — and the authors said so. But their crisp demonstration, in the right place at the right time, is what finally convinced the field, and the modern era of neural networks began.
Why it mattered
Back-propagation made deep, multilayer networks trainable for the first time, lifting the ceiling Minsky and Papert had pointed to. Just as important, it let networks invent their own features instead of relying on humans to hand-craft them — the network figures out what to look for. That single algorithm, scaled up with far more data and computing power, is what eventually produced today's image recognition, speech recognition, and language models.
An everyday picture
Imagine a factory ships a faulty product and traces the fault back along the assembly line: this station added a little to the error, that one a lot, and each adjusts for next time. Back-propagation is that blame-assignment done with calculus — the error at the end is shared backwards, fairly, among all the connections that helped produce it, so every one of them learns which way to adjust.
What came before, what came after
It sits at the hinge of the neural-network story. Before it: McCulloch and Pitts' logical neuron (1943), Rosenblatt's perceptron (1958), and the Minsky–Papert critique that froze the field. After it: the deep-learning era — AlexNet (2012), the Transformer (2017) and AlphaFold (2021), all in this library, are trained by back-propagation to this day. The algorithm barely changed; the world around it did.
We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector.
As a result of the weight adjustments, internal 'hidden' units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.