人工智慧 1997

長短期記憶

塞普·霍赫賴特與于爾根·施密德胡貝爾

一個帶閘的記憶細胞，把值原封不動地存住——網路終於能跨越長間隔去記憶。

Choose your version

In depth · the introduction

幾十年裡，神經網路對「長久地記住點什麼」毫無辦法。1997 年的這篇論文，給了牠們一個記憶細胞——牠們，終於能記住了。

把這個想法拆開看

讀序列的網路——詞語、聲音、感測器讀數——會把一份「我至今見過什麼」的小小摘要，從這一步傳給下一步。麻煩在於，這份摘要在每一步都被弄模糊、弄褪色。等真正有用的東西到來時，網路已經忘了百步之前那條本能解釋牠的線索。

長短期記憶用一個記憶細胞補救這一點：一個小盒子，能把一個值穩穩地、無限期地、毫不褪色地存住。兩個開關守著這個盒子——輸入閘決定何時把新東西寫進去，輸出閘決定何時把裡面的內容讀出來。在兩者之間，這個值就只是待在那裡，被完好無損地保存著。這就是全部的訣竅：不是一份更好的記憶，而是一份你能上鎖的記憶。

它從哪裡來

種子，是一項令人沮喪的發現。1991 年，塞普·霍赫賴特——當時是于爾根·施密德胡貝爾在慕尼黑的學生——用數學算出了循環網路為何學不會長程模式：學習訊號在沿時間往回傳時呈指數萎縮，直到太微弱、什麼也教不了。這是對「整套方法為何屢屢失敗」的一個精確診斷。

兩人沒有拋棄循環網路，而是設計了一劑解藥。他們的答案——包在閘裡的恆定誤差傳送帶——1997 年發表於《神經計算》。多年裡，牠都是一篇被低估的論文，身處一個很快便轉向別處的領域；直到很久以後，當序列問題成為 AI 的核心，牠才成了這一領域被引用最多的論文之一。

它為何重要

因為世界上有太多東西，本就是序列。語音是聲音的序列；一句話是詞語的序列；一段心跳波形、一條股價、一支旋律——全都是序列，此刻要緊的東西，可能取決於很久以前的某件事。LSTM 是第一個能可靠地學會這些長程依賴的設計，近二十年裡，牠一直是語音辨識器、翻譯器與手寫辨識器背後的引擎。2010 年代初幫你做聽寫的那部手機，很可能正執行著一個 LSTM。

一幅日常的畫面

把記憶細胞想成一個有兩扇門的小保險箱。只有當你有值得留下的東西時，輸入閘才打開——你把字條放進去，再關上。兩扇閘都關著時，裡面的字條不會褪色、不會糊、不會跑偏；保險箱只是穩穩地保存著牠。之後，當你真正需要這張字條時，便打開輸出閘，把牠讀出來。一個普通的網路，就像把字條寫在手背上：每走一步都暈開一點，直到再也看不清。LSTM 給網路的，是一個保險箱。在下方，親手開關這兩扇門試試。

它在何處

LSTM 立在本館兩個想法之間。牠身後，是反向傳播（Rumelhart、Hinton 與 Williams，1986），那是牠所依賴的學習規則；牠身前，是 Transformer（Vaswani 等，2017），後者最終在最大的語言模型上取代了牠——靠的是把任意兩個詞直接連接，而非一步步地傳遞記憶。但 LSTM 提出的那個問題——機器如何跨越一段長長的間隔，緊緊抓住要緊的東西？——至今仍是每一個讀序列的模型的核心問題。

The original document

Original source text

S. Hochreiter & J. Schmidhuber · Neural Computation 9(8): 1735–1780 · November 15, 1997 · MIT Press

Abstract

Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow.

Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow.

The abstract goes on to note that LSTM is local in space and time, with computational complexity per time step and weight of O(1); that the experiments use local, distributed, real-valued and noisy pattern representations; and that against real-time recurrent learning, backpropagation through time, recurrent cascade correlation, Elman nets and neural sequence chunking, LSTM yields many more successful runs and learns much faster.

The problem: vanishing error

The introduction first reviews Hochreiter's 1991 analysis of why gradient-based recurrent nets cannot learn long-range dependencies: as the error signal is propagated back through time, it is repeatedly multiplied by weights and squashing-function derivatives, so it shrinks (or, less often, blows up) exponentially. After enough steps it is too faint to teach the network anything.

The constant error carousel

The fix is a special unit whose self-recurrent connection has a fixed weight of 1.0 and an identity activation — a "constant error carousel" (CEC). Error flowing back through it is neither scaled down nor up: it is carried, unchanged, across arbitrarily many time steps.

Gates

A bare CEC would let every input overwrite the stored value and every later unit read it indiscriminately. So the cell is wrapped in two multiplicative gates: an input gate that protects the stored contents from irrelevant inputs, and an output gate that protects other units from the cell's contents until they are needed. (The familiar forget gate is not in this 1997 paper; it was added later — see Limits.)

[ … ]

Experiments

The paper reports a battery of artificial long-time-lag tasks — embedded Reber grammars, the noisy/adding/multiplication problems, and tasks with delays of 1000 steps — on which prior recurrent algorithms fail and LSTM succeeds. The full 46-page article, with the detailed cell diagram, the gradient-truncation derivation and all task tables, is available at the source below.

IDSIA, Lugano · 1997