Recurrent Nets, LSTMs & Sequences

Why a sequence is different

In the last guide you met the convolutional net, which treats an image as one frozen grid and slides the same filter over it everywhere. That works beautifully because a cat in the corner is still a cat. But now consider a sentence. The words arrive in order, and the order is the meaning: "dog bites man" and "man bites dog" use identical words yet describe opposite events. A network that sees only a frozen snapshot has no way to respect that order.

Sequences are everywhere: words in a sentence, notes in a melody, prices over days, frames in a video, readings from a sensor. They share two awkward traits for a standard feed-forward net. First, they have no fixed length — a tweet and a novel are both text. Second, what came earlier should color how we read what comes later. We need an architecture whose very wiring carries information forward through time.

The loop: one sticky note, rewritten

A recurrent neural network (RNN) reads a sequence one step at a time, and at every step it keeps a running memory called the hidden state. Picture reading aloud with a single sticky note in hand. You read the first word and jot a quick summary; you read the next word and rewrite the note to fold in what you just learned; and so on to the end. The note never grows — it just gets overwritten each step — yet by the last word it carries a compressed trace of everything before.

The crucial detail: it is the *same* update rule applied at every step. The network has one small set of weights that it reuses for word one, word two, word three thousand. That looping reuse is what "recurrent" means, and it is why a single modest network can handle a sequence of any length. The new hidden state is computed from the previous hidden state and the current input, then squashed by an activation like tanh.

h = zeros()              # the sticky note starts blank
for x in sequence:       # read one step at a time
    h = tanh(W_h @ h + W_x @ x + b)   # rewrite the note
    y = output(h)        # optional: emit something each step
# same W_h, W_x, b reused at EVERY step

An RNN in five lines: one hidden state h, rewritten with the same weights at each step.

Training works by an idea called backpropagation through time. You mentally "unroll" the loop into a long chain — one copy of the network per step — and run ordinary backprop backward along it, summing each weight's contribution across all the steps where it was used. Conceptually it is just the same chain rule you already know, stretched out across time.

The memory problem

Here is the stubborn weakness that nearly killed the RNN. Because the memory is squeezed and rewritten at every step, the influence of early words tends to fade. Worse, during training the error signal must travel back through that same long chain, getting multiplied by a number at every step. If those numbers are below one — which the flat tails of tanh make easy — multiplying many of them drives the signal toward zero, exponentially fast. This is the vanishing gradient problem you met in the deep-net rung, now stretched across time instead of layers.

In practice this means a plain RNN struggles to connect things that are far apart. Consider: "The keys that I left on the kitchen table this morning before the long, chaotic meeting ... are gone." By the time the network reaches "are," the subject "keys" is many steps back, and the faded memory may no longer remember it was plural. Long-range dependencies — a pronoun and the noun it refers to ten sentences earlier — are exactly what the plain RNN cannot hold.

There is a mirror-image danger too: if those per-step numbers are above one, the signal blows up instead — the exploding gradient — and training destabilizes into wild swings. The common patch is gradient clipping: cap the gradient's size whenever it grows too large. Clipping tames explosions, but it does nothing for the fading direction. The vanishing problem needed a deeper fix.

Gates: LSTM and GRU

The breakthrough was to stop overwriting the whole memory every step. The LSTM (long short-term memory) keeps a protected memory line called the cell state — picture a conveyor belt running straight through the whole sequence, picking up and dropping off cargo only when told to. Information can ride along it almost untouched from word three to word three hundred. And because it can ride untouched, the error signal can flow back along it without fading, sidestepping the vanishing-gradient trap.

What does the telling is a set of small valves called gates — tiny learned controllers, each emitting a number between 0 (fully shut) and 1 (fully open) via a sigmoid. The forget gate decides what old memory to wipe; the input gate decides what new information is worth storing; the output gate decides how much of the memory to reveal right now. Because the gates can choose to leave the cell state alone, a fact learned early simply survives.

The GRU (gated recurrent unit) is a streamlined cousin that chases the same goal with fewer moving parts. It uses just two gates — an update gate that blends "keep what I knew" against "take in what just arrived," and a reset gate that decides how much past to ignore — and it merges the cell state and hidden state into one. In practice GRUs and LSTMs perform similarly; neither is universally better, so people often try both. The GRU is lighter and a touch faster to train, which helps on smaller datasets or tighter hardware.

What RNNs built — and why transformers took over

From the late 1990s until around 2017, gated RNNs were the workhorse of sequence learning. They powered the first wave of genuinely good machine translation, speech recognition, handwriting recognition, and the predictive text on your phone. A key pattern they enabled was the encoder–decoder setup: one RNN reads the whole input and compresses it into a summary, and a second RNN generates the output one token at a time, conditioned on that summary and on what it has produced so far.

That recipe had a bottleneck: squeezing a whole sentence into one summary vector chokes on long inputs. The fix was to let the decoder look back at any part of the input as needed — an attention mechanism. Attention was first bolted onto RNNs as a helper. Then in 2017 came the realization that you could throw away the recurrence entirely and keep only attention: the Transformer.

Why did Transformers win? Not because attention is magic, but because of a practical engineering fact: an RNN must read one step at a time, so it cannot be parallelized across the sequence, and it is slow on long inputs. A Transformer reads the whole sequence at once, which maps perfectly onto the parallel hardware of modern GPUs. That made it possible to train on vastly more data — and at the scale of today's large language models, scale is what mattered most.