人工智慧 1986

透過反向傳播誤差來學習表徵

大衛·魯梅爾哈特、傑弗里·辛頓、羅納德·威廉斯

把誤差沿著各層往回傳——隱藏單元便自己學會該成為什麼。

Choose your version

In depth · the introduction

把簡單的人工神經元一層層疊起來，你會得到一個強大卻無法訓練的東西——除非，你讓網路犯的錯往回流，去修正它。這個妙招，就是反向傳播。

把這個想法拆開看

神經網路，是一層層簡單的單元，每一層把一串數字傳給下一層，而每條連接上都掛著一個可調的「權重」。給它一個輸入，它便產出一個輸出。一開始，輸出是錯的。多年來難倒研究者的問題是：中間那些隱藏單元——它們自己並沒有「正確答案」——你怎麼知道該怪哪些權重、又該怪多少？

反向傳播給出了答案。在輸出端量出誤差，再把這誤差沿著訊號原先正向走過的那些連接，一層層往回傳，把「責任」分攤下去。每個權重都學到自己對這次錯誤貢獻了多少，並把自己朝著「做得好一點」的方向輕推一下。在許多樣例上反覆這樣做，網路便一點點自學成才——而那些隱藏單元，竟自發地變成了有用的特徵偵測器，無需任何人去設計。

它從哪裡來

1969 年，馬文·明斯基與西摩·派普特證明：單層的感知器——即羅森布拉特 1958 年論文裡的那個裝置，本館亦有收錄——連 XOR 都算不出來。這個結論被廣泛讀作神經網路的死刑判決，經費與興趣隨之流失了十餘年。加上隱藏層，原理上能解除這一侷限，可沒人有一套可靠的辦法去訓練它們。

1986 年，大衛·魯梅爾哈特、傑弗里·辛頓與羅納德·威廉斯，在《自然》上發表了一篇短文，給出了一個清晰而有說服力的做法。說句實話，其核心數學此前已被找到——在控制論裡，以及由保羅·韋爾博斯在 1974 年用到神經網路上——作者也如實說了。但正是他們那個俐落的示範，在恰當的地方、恰當的時機登出，最終說服了整個領域，現代神經網路的紀元，就此開始。

它為何重要

反向傳播頭一次讓深的、多層的網路變得可訓練，抬高了明斯基與派普特所指出的那道天花板。同樣要緊的是：它讓網路自己去發明特徵，而不必依賴人來手工雕琢——網路自己琢磨出該看什麼。這一個演算法，配上多得多的資料與算力放大開來，最終造出了今天的影像辨識、語音辨識，以及語言模型。

一個日常的畫面

想像一家工廠出了一件次品，於是沿著生產線往回追責：這個工位給誤差添了一點，那個工位添了很多，每個工位據此為下一次做出調整。反向傳播，就是用微積分來做的這種「分攤責任」——末端的誤差，被公平地往回分給所有參與製造它的連接，於是每一條都學到自己該往哪個方向調。

之前是什麼，之後又如何

它，正處在神經網路故事的樞紐上。在它之前：麥卡洛克與皮茨的邏輯神經元（1943）、羅森布拉特的感知器（1958），以及那場把領域凍住的明斯基—派普特批判。在它之後：深度學習的紀元——AlexNet（2012）、Transformer（2017）與 AlphaFold（2021），本館皆有收錄，至今仍由反向傳播來訓練。演算法幾乎沒變，變的是它周圍的整個世界。

The original document

Original source text

D. E. Rumelhart, G. E. Hinton & R. J. Williams · Nature 323, 533–536 · 9 October 1986

Abstract

We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector.

As a result of the weight adjustments, internal 'hidden' units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.

The learning procedure

The body of the letter defines a layered network of units whose output is a smooth (logistic) function of their total input, sets the total error E as one-half the summed squared difference between actual and desired outputs, and derives how to compute ∂E/∂w for every weight by propagating the error backward from the output units through the network.

It then gives the gradient-descent weight change, notes that a pure steepest-descent step can be accelerated by adding a fraction of the previous weight change (a momentum term), and remarks that the procedure can become trapped in local minima, though in their experience this was rarely a serious problem.

Demonstrations

Worked examples show hidden units learning to detect mirror symmetry in an input string, and a network of relationships in two family trees in which the hidden units spontaneously come to encode meaningful features such as nationality and generation — evidence that the procedure constructs useful internal representations rather than merely fitting outputs.

[ … ]

The complete four-page letter, with its equations, its figures of the learned hidden-unit weights, and its discussion of the relationship to the brain, is available in full at the source below.

Nature · 9 October 1986