The thing the old models couldn't do: wait for nothing
By now you can read a Transformer block by hand. So why did it win? The single biggest reason is almost mechanical, and it has little to do with intelligence: a Transformer reads a whole sequence at once. A recurrent network processes word 1, then word 2, then word 3 — each step waits for the one before it. That dependency chain is fine on paper, but it strangles the one resource that mattered most in the 2010s: the GPU, a chip that is gloriously fast only when thousands of computations run side by side.
Self-attention has no such chain. Every token looks at every other token in one big matrix multiply, so a whole sentence — or a whole page — is processed in parallel. That means a Transformer turns a long, idle, step-by-step task into one fat batch of math that a GPU can devour. The architecture didn't just fit the hardware; it fit the hardware *the rest of the field was racing to build*. Speed of training, more than any clever idea, is what let people try Transformers at sizes nobody had dared before.
"Attention is all you need" — what the title really claimed
The 2017 paper Attention Is All You Need has a deliberately cheeky title. Its real claim was narrow and bold at once: you can throw away recurrence and convolution entirely, keep only attention plus simple feed-forward layers, and still do machine translation better. For years people had bolted attention *onto* recurrent models as a helper. The paper's move was to make attention the whole load-bearing structure — the Transformer you have now built from parts.
Be honest about the title, though, because the field rarely is. "All you need" was true for that translation benchmark in 2017; it is *not* a law of nature. Transformers still need positional encodings (attention alone is order-blind), still need normalization and residual connections to train, and still lean on the feed-forward blocks for much of their raw capacity. The slogan stuck because it was catchy, not because attention is literally the only ingredient.
Scaling and transfer: the two engines that made it a juggernaut
A fast-to-train architecture only matters if growing it keeps paying off — and here Transformers got lucky in a way few designs do. Researchers found scaling laws: across many orders of magnitude, loss drops smoothly and predictably as you add parameters, data, and compute. No cliff, no obvious ceiling in the range tested. That turned a research gamble into something closer to engineering: spend 10x the compute, get a measurably better model. Money could buy capability, so money poured in.
The second engine is transfer. Instead of training one model per task, you do pretraining once on an ocean of unlabeled text, learning general structure, then cheaply fine-tune or even just prompt it for each specific job. This is the transfer learning idea from earlier rungs, but the Transformer made it spectacular: one big foundation model becomes a reusable substrate for translation, summarization, code, and chat alike. The large language model is exactly this — a single pretrained Transformer pressed into a thousand jobs.
Mixture-of-experts: paying for a giant brain, using a sliver of it
Scaling laws say bigger is better, but bigger also means every token pays for every parameter — cost explodes. Mixture-of-experts (MoE) is the clever dodge. You replace one big feed-forward block with, say, 64 smaller "expert" blocks, plus a tiny router that, for each token, picks just 2 of them to run. The model can *hold* a huge number of parameters, but any single token only *activates* a small fraction of them.
# one MoE feed-forward layer, per token
scores = router(token) # how well each expert fits this token
top2 = argtop(scores, k=2) # pick the 2 best experts
out = 0
for e in top2: # run ONLY those 2, not all 64
out += softmax(scores)[e] * expert[e](token)
# total params: 64 experts. compute paid: 2 experts.The win is real: you decouple total knowledge (parameter count) from per-token inference cost. The honest costs are real too. All those experts must live in memory even though most stay idle, so MoE is memory-hungry and trickier to serve. Routing can collapse so a few experts hog every token, needing balancing tricks. MoE is a scaling *trick*, not a leap in intelligence — it buys you a bigger model at a friendlier compute bill, nothing more mystical than that.
The honest tradeoffs — and why "took over" doesn't mean "won forever"
Now the part the marketing skips. Self-attention compares every token with every other token, so its cost grows with the *square* of sequence length. Double the context length, quadruple the work. This quadratic bottleneck is why long documents are expensive and why a small industry — FlashAttention, sparse and linear attention variants — exists just to claw that cost down. The architecture's defining strength, looking everywhere at once, is also its defining expense.
Two more honest limits. Transformers are spectacularly data- and compute-hungry; the bill in dollars and energy is not a footnote. And as language models, they predict plausible next tokens — they have no built-in guarantee of truth, which is why hallucination (fluent, confident falsehoods) is structural, not a bug to be patched away. So-called emergent abilities — skills that seem to appear suddenly at scale — are exciting but contested: some shrink to smooth, unsurprising curves once you measure them more carefully.
So why did it take over? Not because it is the final or smartest design, but because it was parallel enough to train at scale, general enough to transfer everywhere, and lucky enough that scaling kept paying. Notice the reach beyond text: the same block now powers the Vision Transformer for images, and audio, protein structure, and more. That generality is the real headline. Whether something faster-at-long-context eventually replaces it is an open question — "took over" is a snapshot of an era, not a verdict for all time.