人工智能 1997

长短期记忆

塞普·霍赫赖特与于尔根·施密德胡贝尔

一个带门的记忆细胞，把值原封不动地存住——网络终于能跨越长间隔去记忆。

Choose your version

In depth · the introduction

几十年里，神经网络对「长久地记住点什么」毫无办法。1997 年的这篇论文，给了它们一个记忆细胞——它们，终于能记住了。

把这个想法拆开看

读序列的网络——词语、声音、传感器读数——会把一份「我至今见过什么」的小小摘要，从这一步传给下一步。麻烦在于，这份摘要在每一步都被弄模糊、弄褪色。等真正有用的东西到来时，网络已经忘了百步之前那条本能解释它的线索。

长短期记忆用一个记忆细胞补救这一点：一个小盒子，能把一个值稳稳地、无限期地、毫不褪色地存住。两个开关守着这个盒子——输入门决定何时把新东西写进去，输出门决定何时把里面的内容读出来。在两者之间，这个值就只是待在那里，被完好无损地保存着。这就是全部的诀窍：不是一份更好的记忆，而是一份你能上锁的记忆。

它从哪里来

种子，是一项令人沮丧的发现。1991 年，塞普·霍赫赖特——当时是于尔根·施密德胡贝尔在慕尼黑的学生——用数学算出了循环网络为何学不会长程模式：学习信号在沿时间往回传时呈指数萎缩，直到太微弱、什么也教不了。这是对「整套方法为何屡屡失败」的一个精确诊断。

两人没有抛弃循环网络，而是设计了一剂解药。他们的答案——包在门里的恒定误差传送带——1997 年发表于《神经计算》。多年里，它都是一篇被低估的论文，身处一个很快便转向别处的领域；直到很久以后，当序列问题成为 AI 的核心，它才成了这一领域被引用最多的论文之一。

它为何重要

因为世界上有太多东西，本就是序列。语音是声音的序列；一句话是词语的序列；一段心跳波形、一条股价、一支旋律——全都是序列，此刻要紧的东西，可能取决于很久以前的某件事。LSTM 是第一个能可靠地学会这些长程依赖的设计，近二十年里，它一直是语音识别器、翻译器与手写识别器背后的引擎。2010 年代初帮你做听写的那部手机，很可能正运行着一个 LSTM。

一幅日常的画面

把记忆细胞想成一个有两扇门的小保险箱。只有当你有值得留下的东西时，输入门才打开——你把字条放进去，再关上。两扇门都关着时，里面的字条不会褪色、不会糊、不会跑偏；保险箱只是稳稳地保存着它。之后，当你真正需要这张字条时，便打开输出门，把它读出来。一个普通的网络，就像把字条写在手背上：每走一步都晕开一点，直到再也看不清。LSTM 给网络的，是一个保险箱。在下方，亲手开关这两扇门试试。

它在何处

LSTM 立在本馆两个想法之间。它身后，是反向传播（Rumelhart、Hinton 与 Williams，1986），那是它所依赖的学习规则；它身前，是 Transformer（Vaswani 等，2017），后者最终在最大的语言模型上取代了它——靠的是把任意两个词直接连接，而非一步步地传递记忆。但 LSTM 提出的那个问题——机器如何跨越一段长长的间隔，紧紧抓住要紧的东西？——至今仍是每一个读序列的模型的核心问题。

The original document

Original source text

S. Hochreiter & J. Schmidhuber · Neural Computation 9(8): 1735–1780 · November 15, 1997 · MIT Press

Abstract

Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow.

Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow.

The abstract goes on to note that LSTM is local in space and time, with computational complexity per time step and weight of O(1); that the experiments use local, distributed, real-valued and noisy pattern representations; and that against real-time recurrent learning, backpropagation through time, recurrent cascade correlation, Elman nets and neural sequence chunking, LSTM yields many more successful runs and learns much faster.

The problem: vanishing error

The introduction first reviews Hochreiter's 1991 analysis of why gradient-based recurrent nets cannot learn long-range dependencies: as the error signal is propagated back through time, it is repeatedly multiplied by weights and squashing-function derivatives, so it shrinks (or, less often, blows up) exponentially. After enough steps it is too faint to teach the network anything.

The constant error carousel

The fix is a special unit whose self-recurrent connection has a fixed weight of 1.0 and an identity activation — a "constant error carousel" (CEC). Error flowing back through it is neither scaled down nor up: it is carried, unchanged, across arbitrarily many time steps.

Gates

A bare CEC would let every input overwrite the stored value and every later unit read it indiscriminately. So the cell is wrapped in two multiplicative gates: an input gate that protects the stored contents from irrelevant inputs, and an output gate that protects other units from the cell's contents until they are needed. (The familiar forget gate is not in this 1997 paper; it was added later — see Limits.)

[ … ]

Experiments

The paper reports a battery of artificial long-time-lag tasks — embedded Reber grammars, the noisy/adding/multiplication problems, and tasks with delays of 1000 steps — on which prior recurrent algorithms fail and LSTM succeeds. The full 46-page article, with the detailed cell diagram, the gradient-truncation derivation and all task tables, is available at the source below.

IDSIA, Lugano · 1997