Long Short-Term Memory
A gated memory cell that holds a value untouched — so a network can finally remember across long gaps.
For decades, neural networks were hopeless at remembering anything for long. This 1997 paper gave them a memory cell — and they finally could.
The idea, unpacked
Networks that read a sequence — words, sounds, sensor readings — pass a little summary of "what I've seen" from one step to the next. The trouble is that this summary gets blurred and faded at every step. By the time something useful arrives, the network has forgotten the clue from a hundred steps ago that would explain it.
Long Short-Term Memory fixes this with a memory cell: a little box that can hold one value steady, indefinitely, without it fading. Two switches guard the box — an input gate that decides when something new is written in, and an output gate that decides when the contents are read out. In between, the value just sits there, perfectly preserved. That is the whole trick: not a better memory, but a memory you can lock.
Where it came from
The seed was a discouraging discovery. In 1991, Sepp Hochreiter, then a student of Jürgen Schmidhuber in Munich, worked out mathematically why recurrent networks couldn't learn long-range patterns: the learning signal shrinks exponentially as it travels back through time, until it's too faint to teach anything. It was a precise diagnosis of why the whole approach kept failing.
Rather than abandon recurrent networks, the two designed a cure. Their answer — the constant error carousel wrapped in gates — appeared in Neural Computation in 1997. It was, for years, an underappreciated paper in a field that would soon move on to other things; only much later, when sequence problems became central to AI, did it become one of the most-cited papers in the field.
Why it mattered
Because so much of the world is a sequence. Speech is a sequence of sounds; a sentence is a sequence of words; a heartbeat trace, a stock price, a melody — all sequences where what matters now may depend on something far in the past. LSTM was the first design that could reliably learn those long-range dependencies, and for nearly two decades it was the engine behind speech recognizers, translators, and handwriting readers. The phone that took your dictation in the early 2010s was very likely running an LSTM.
A everyday picture
Think of the memory cell as a small safe with two doors. The input door opens only when you have something worth keeping — you put the note in, and shut it. While both doors are closed, the note inside doesn't fade, smudge, or drift; the safe simply holds it. Later, when you actually need the note, you open the output door and read it. An ordinary network is like writing the note on your hand: it smears a little with every step until it's unreadable. LSTM gives the network a safe instead. Try opening and closing the two doors yourself below.
Where it sits
LSTM stands between two ideas in this Library. Behind it is backpropagation (Rumelhart, Hinton & Williams, 1986), the learning rule it relies on; ahead of it is the Transformer (Vaswani et al., 2017), which eventually replaced it for the largest language models by connecting any two words directly instead of passing memory step by step. But the question LSTM posed — how does a machine hold on to what matters across a long gap? — is still the central question of every model that reads a sequence.
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow.
Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow.