Artificial Intelligence 1997

Long Short-Term Memory

Sepp Hochreiter & Jürgen Schmidhuber

A gated memory cell that holds a value untouched — so a network can finally remember across long gaps.

Choose your version

In depth · the introduction

For decades, neural networks were hopeless at remembering anything for long. This 1997 paper gave them a memory cell — and they finally could.

The idea, unpacked

Networks that read a sequence — words, sounds, sensor readings — pass a little summary of "what I've seen" from one step to the next. The trouble is that this summary gets blurred and faded at every step. By the time something useful arrives, the network has forgotten the clue from a hundred steps ago that would explain it.

Long Short-Term Memory fixes this with a memory cell: a little box that can hold one value steady, indefinitely, without it fading. Two switches guard the box — an input gate that decides when something new is written in, and an output gate that decides when the contents are read out. In between, the value just sits there, perfectly preserved. That is the whole trick: not a better memory, but a memory you can lock.

Where it came from

The seed was a discouraging discovery. In 1991, Sepp Hochreiter, then a student of Jürgen Schmidhuber in Munich, worked out mathematically why recurrent networks couldn't learn long-range patterns: the learning signal shrinks exponentially as it travels back through time, until it's too faint to teach anything. It was a precise diagnosis of why the whole approach kept failing.

Rather than abandon recurrent networks, the two designed a cure. Their answer — the constant error carousel wrapped in gates — appeared in Neural Computation in 1997. It was, for years, an underappreciated paper in a field that would soon move on to other things; only much later, when sequence problems became central to AI, did it become one of the most-cited papers in the field.

Why it mattered

Because so much of the world is a sequence. Speech is a sequence of sounds; a sentence is a sequence of words; a heartbeat trace, a stock price, a melody — all sequences where what matters now may depend on something far in the past. LSTM was the first design that could reliably learn those long-range dependencies, and for nearly two decades it was the engine behind speech recognizers, translators, and handwriting readers. The phone that took your dictation in the early 2010s was very likely running an LSTM.

A everyday picture

Think of the memory cell as a small safe with two doors. The input door opens only when you have something worth keeping — you put the note in, and shut it. While both doors are closed, the note inside doesn't fade, smudge, or drift; the safe simply holds it. Later, when you actually need the note, you open the output door and read it. An ordinary network is like writing the note on your hand: it smears a little with every step until it's unreadable. LSTM gives the network a safe instead. Try opening and closing the two doors yourself below.

Where it sits

LSTM stands between two ideas in this Library. Behind it is backpropagation (Rumelhart, Hinton & Williams, 1986), the learning rule it relies on; ahead of it is the Transformer (Vaswani et al., 2017), which eventually replaced it for the largest language models by connecting any two words directly instead of passing memory step by step. But the question LSTM posed — how does a machine hold on to what matters across a long gap? — is still the central question of every model that reads a sequence.

The original document

Original source text

S. Hochreiter & J. Schmidhuber · Neural Computation 9(8): 1735–1780 · November 15, 1997 · MIT Press

Abstract

Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow.

Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow.

The abstract goes on to note that LSTM is local in space and time, with computational complexity per time step and weight of O(1); that the experiments use local, distributed, real-valued and noisy pattern representations; and that against real-time recurrent learning, backpropagation through time, recurrent cascade correlation, Elman nets and neural sequence chunking, LSTM yields many more successful runs and learns much faster.

The problem: vanishing error

The introduction first reviews Hochreiter's 1991 analysis of why gradient-based recurrent nets cannot learn long-range dependencies: as the error signal is propagated back through time, it is repeatedly multiplied by weights and squashing-function derivatives, so it shrinks (or, less often, blows up) exponentially. After enough steps it is too faint to teach the network anything.

The constant error carousel

The fix is a special unit whose self-recurrent connection has a fixed weight of 1.0 and an identity activation — a "constant error carousel" (CEC). Error flowing back through it is neither scaled down nor up: it is carried, unchanged, across arbitrarily many time steps.

Gates

A bare CEC would let every input overwrite the stored value and every later unit read it indiscriminately. So the cell is wrapped in two multiplicative gates: an input gate that protects the stored contents from irrelevant inputs, and an output gate that protects other units from the cell's contents until they are needed. (The familiar forget gate is not in this 1997 paper; it was added later — see Limits.)

[ … ]

Experiments

The paper reports a battery of artificial long-time-lag tasks — embedded Reber grammars, the noisy/adding/multiplication problems, and tasks with delays of 1000 steps — on which prior recurrent algorithms fail and LSTM succeeds. The full 46-page article, with the detailed cell diagram, the gradient-truncation derivation and all task tables, is available at the source below.

IDSIA, Lugano · 1997