人工智能 1986

通过反向传播误差来学习表征

大卫·鲁梅尔哈特、杰弗里·辛顿、罗纳德·威廉姆斯

把误差沿着各层往回传——隐藏单元便自己学会该成为什么。

Choose your version

In depth · the introduction

把简单的人工神经元一层层叠起来，你会得到一个强大却无法训练的东西——除非，你让网络犯的错往回流，去修正它。这个妙招，就是反向传播。

把这个想法拆开看

神经网络，是一层层简单的单元，每一层把一串数字传给下一层，而每条连接上都挂着一个可调的「权重」。给它一个输入，它便产出一个输出。一开始，输出是错的。多年来难倒研究者的问题是：中间那些隐藏单元——它们自己并没有「正确答案」——你怎么知道该怪哪些权重、又该怪多少？

反向传播给出了答案。在输出端量出误差，再把这误差沿着信号原先正向走过的那些连接，一层层往回传，把「责任」分摊下去。每个权重都学到自己对这次错误贡献了多少，并把自己朝着「做得好一点」的方向轻推一下。在许多样例上反复这样做，网络便一点点自学成才——而那些隐藏单元，竟自发地变成了有用的特征探测器，无需任何人去设计。

它从哪里来

1969 年，马文·明斯基与西摩·派普特证明：单层的感知机——即罗森布拉特 1958 年论文里的那个装置，本馆亦有收录——连 XOR 都算不出来。这个结论被广泛读作神经网络的死刑判决，经费与兴趣随之流失了十余年。加上隐藏层，原理上能解除这一局限，可没人有一套可靠的办法去训练它们。

1986 年，大卫·鲁梅尔哈特、杰弗里·辛顿与罗纳德·威廉姆斯，在《自然》上发表了一篇短文，给出了一个清晰而有说服力的做法。说句实话，其核心数学此前已被找到——在控制论里，以及由保罗·韦尔博斯在 1974 年用到神经网络上——作者也如实说了。但正是他们那个利落的示范，在恰当的地方、恰当的时机登出，最终说服了整个领域，现代神经网络的纪元，就此开始。

它为何重要

反向传播头一次让深的、多层的网络变得可训练，抬高了明斯基与派普特所指出的那道天花板。同样要紧的是：它让网络自己去发明特征，而不必依赖人来手工雕琢——网络自己琢磨出该看什么。这一个算法，配上多得多的数据与算力放大开来，最终造出了今天的图像识别、语音识别，以及语言模型。

一个日常的画面

想象一家工厂出了一件次品，于是沿着流水线往回追责：这个工位给误差添了一点，那个工位添了很多，每个工位据此为下一次做出调整。反向传播，就是用微积分来做的这种「分摊责任」——末端的误差，被公平地往回分给所有参与制造它的连接，于是每一条都学到自己该往哪个方向调。

之前是什么，之后又如何

它，正处在神经网络故事的枢纽上。在它之前：麦卡洛克与皮茨的逻辑神经元（1943）、罗森布拉特的感知机（1958），以及那场把领域冻住的明斯基—派普特批判。在它之后：深度学习的纪元——AlexNet（2012）、Transformer（2017）与 AlphaFold（2021），本馆皆有收录，至今仍由反向传播来训练。算法几乎没变，变的是它周围的整个世界。

The original document

Original source text

D. E. Rumelhart, G. E. Hinton & R. J. Williams · Nature 323, 533–536 · 9 October 1986

Abstract

We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector.

As a result of the weight adjustments, internal 'hidden' units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.

The learning procedure

The body of the letter defines a layered network of units whose output is a smooth (logistic) function of their total input, sets the total error E as one-half the summed squared difference between actual and desired outputs, and derives how to compute ∂E/∂w for every weight by propagating the error backward from the output units through the network.

It then gives the gradient-descent weight change, notes that a pure steepest-descent step can be accelerated by adding a fraction of the previous weight change (a momentum term), and remarks that the procedure can become trapped in local minima, though in their experience this was rarely a serious problem.

Demonstrations

Worked examples show hidden units learning to detect mirror symmetry in an input string, and a network of relationships in two family trees in which the hidden units spontaneously come to encode meaningful features such as nationality and generation — evidence that the procedure constructs useful internal representations rather than merely fitting outputs.

[ … ]

The complete four-page letter, with its equations, its figures of the learned hidden-unit weights, and its discussion of the relationship to the brain, is available in full at the source below.

Nature · 9 October 1986