The Transformer Architecture

From a clever trick to a whole machine

In the last two guides you built the engine: self-attention lets every word gather information from every other word, and multi-head attention runs several of those lookups in parallel so the model can track grammar, meaning, and reference all at once. But an engine is not a car. A single attention layer, on its own, is shallow — it mixes information once and stops. The [[transformer-architecture|Transformer]] is what you get when you take that engine and build the rest of the vehicle around it: how to stack the layers, how to keep them stable, and how to make the whole thing actually produce text.

The famous 2017 paper that introduced it was called *Attention Is All You Need* — a deliberate jab at the recurrent networks that ruled language at the time. Those older models read a sentence strictly left to right, one word at a time, passing a running memory along like a relay baton. That made them slow to train and prone to forgetting the start of a long sentence by the time they reached the end. The Transformer's bet was radical: throw out the recurrence entirely, look at all the words at once with attention, and let stacked layers do the deep thinking.

One block: attention, then a little brain

The Transformer is built from one repeated unit, the *block*, stacked many times. Each block does two jobs in sequence. First, an attention layer lets every position mix in information from the others — this is the *communication* step, where words talk to each other. Second comes a [[feed-forward-block|feed-forward block]]: a small two-layer network applied to each position *independently*, the same little network re-used at every word. If attention is words talking, the feed-forward block is each word going off to think privately about what it just heard.

That division of labor matters more than it first looks. Attention moves information *between* positions but does almost no nonlinear reasoning; the feed-forward block does the heavy per-token transformation but cannot see its neighbors. Alternating them — gather, then process, gather, then process — is what gives a deep Transformer its power. Counting parameters, the feed-forward blocks are usually the *bulk* of the model, not the attention; much of what a model 'knows' lives in those quiet little networks.

One more piece runs *before* the very first block: a positional encoding. Because attention looks at all words simultaneously, it has no built-in sense of order — to raw attention, "dog bites man" and "man bites dog" are the same bag of words. The positional encoding stamps each token with a signal for *where* it sits, restoring the word order that recurrence used to give for free.

Residuals and layer norm: the glue that lets it go deep

Stacking dozens of blocks sounds easy, but deep networks are notoriously hard to train — gradients fade away or blow up as they travel back through many layers (you met this as the *vanishing gradient* problem earlier in the ladder). Two simple tricks make the depth survivable. The first is the [[residual-connection|residual connection]]: instead of replacing its input, each sub-layer *adds* its output to it. The block computes a small correction, and the original signal flows straight through untouched.

Picture an editor marking up a draft. A residual connection means the editor hands back the original page *plus* their notes in the margin, rather than rewriting the whole thing from scratch. The text survives the trip through fifty editors; gradients, traveling backward along that same straight path, survive too. Without residuals, training a Transformer past a handful of layers barely works at all.

The second trick is [[layer-normalization|layer normalization]]: before (or after) each sub-layer, the numbers at each position are re-centered and re-scaled to a tidy range. This keeps the signal from drifting to extreme values as it climbs the stack, so every layer receives input in a comfortable zone. *Where* you place it — "pre-norm" (normalize before the sub-layer) versus "post-norm" (after) — is a real design choice; modern large models almost all use pre-norm because it trains far more stably at depth.

# one Transformer block (pre-norm style)
x = x + attention(layer_norm(x))      # communicate: words mix
x = x + feed_forward(layer_norm(x))   # compute: each word thinks
# stack this block N times (N = 12, 48, 96, ...)

The whole block in four moving parts: normalize, attend, add back; normalize, process, add back. Stack it N times — that 'add back' is the residual highway gradients ride home on.

Encoder, decoder, or both

The original Transformer was built for machine translation, so it had two stacks of blocks — an [[encoder-decoder-stack|encoder–decoder stack]]. The *encoder* reads the whole source sentence and builds a rich representation of it, with every word free to attend to every other word in both directions. The *decoder* then generates the translation, and it does two kinds of attention: it attends to the words it has produced so far, and through [[cross-attention|cross-attention]] it reaches back into the encoder's representation to consult the source. Encoder digests, decoder writes while glancing at the encoder's notes.

Researchers soon noticed you don't always need both halves. Keep only the *encoder* and you get a model that reads and understands but never generates — great for classifying a sentence or tagging it (the BERT family works this way). Keep only the *decoder* and you get a pure text generator. Almost every chatbot-style large language model you have heard of is *decoder-only*: a single stack of blocks whose only job is to predict the next word, over and over.

How a decoder writes: one token at a time

A decoder generates text by [[autoregressive-decoding|autoregressive decoding]]: it predicts one token, appends it to the sequence, and feeds the whole thing back in to predict the next — over and over, like a writer who can only ever add the next word and never skip ahead. The catch is that to learn this, a word must never be allowed to peek at words that come *after* it; otherwise the model would cheat by reading the answer. This is enforced by [[masked-attention|masked attention]]: in the decoder, attention is blocked from looking forward, so each position sees only itself and the past.

Run the prompt through the stack and look at the final position's output vector.
Turn that vector into a probability over the whole vocabulary (a softmax over tens of thousands of possible next tokens).
Pick a token from that distribution — greedily, or by sampling for variety.
Append it to the sequence and repeat from step 1 until you hit a stop token.

This loop is also why generation feels *sequential* and can be slow: tokens come out one after another, each waiting on the last. A clever optimization called the KV cache saves the attention computations from earlier tokens so they aren't recomputed every step — without it, generating a long reply would be hopelessly slow. It changes the speed, not the nature: a Transformer still writes the way you read aloud, one word after another.

It is worth being honest about what this loop is and is not. The model is not planning a whole answer and then typing it out; at each step it is only estimating *what token tends to come next* given everything so far. The fluency is real and often astonishing, but there is no hidden ledger of facts being consulted — which is exactly why these models can state falsehoods with total confidence. The architecture is a magnificent next-word predictor, not an oracle.

Why it won

The Transformer displaced almost everything before it for one underrated reason: it is *parallel-friendly*. Because a layer processes every word at the same time rather than waiting for the previous one, you can pour a whole sentence — or a whole book — through it in one shot on a GPU. That made it possible to train on staggering amounts of text, and it turned out that bigger Transformers trained on more data just kept getting better, smoothly and predictably. The architecture didn't just win on quality; it won on *trainability at scale*.

None of this means the Transformer is the final word, and the hype that it is the last architecture we will ever need should be taken with salt. Its quadratic attention cost is a genuine ceiling, the positional encoding scheme is still actively re-invented, and serious alternatives are being explored. What is fair to say is that the same handful of ideas you now understand — stacked blocks of attention and feed-forward, glued by residuals and layer norm, generating one token at a time — underlie nearly every system people call 'AI' today. You are no longer looking at a black box. You can name its parts.