Attention Is All You Need
Drop recurrence; let every word look directly at every other. The Transformer is born.
This is the design that lets an AI weigh every word against every other word at once — and it became the foundation of today's chatbots.
The idea, unpacked
Older language AIs read a sentence in order, one word at a time, trying to remember what came before. That made them slow to train and forgetful over long passages — by the end of a paragraph they'd half-lost the start. The Transformer threw out that step-by-step reading.
Its key idea is called attention. Instead of reading in sequence, the model looks at all the words together and, for each word, decides how much every other word matters to its meaning. All of those comparisons happen at the same time rather than one after another, which is exactly what made the model so fast to train — and so easy to make enormous.
Where it came from
In 2017, a team of eight researchers at Google published an eight-page paper at the NeurIPS conference with the cheeky title "Attention Is All You Need." It was aimed at machine translation, and on that narrow task it already beat the best systems of the day. But its real significance was the architecture it introduced — the Transformer — which spread through the field with startling speed, becoming the default backbone for language AI within just a couple of years.
Why it mattered
By removing the slow, sequential step, the Transformer became trainable at a scale nothing before it could reach. That unlocked a simple, powerful loop: make the model bigger, feed it more text, and it gets smarter. Bigger and more text, again and again — that loop is the entire LLM era, and this paper drew its blueprint. The chatbots, writing tools, translators and assistants that arrived afterwards are all built on it.
A tiny example
Take the sentence "the trophy didn't fit in the suitcase because it was too big." What does "it" refer to — the trophy or the suitcase? You know instantly it's the trophy, but a computer has to figure it out. Attention is how the model does it: it lets the word "it" look across the whole sentence and lean most heavily on "trophy." Try the same thing yourself below.
What came next
The two families that grew straight out of this paper became household names. BERT used the Transformer to read and understand text; the GPT models used it to generate text, and grew into the assistants people now talk to every day. Almost every AI you've used that handles language traces its design back to these eight pages.
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.
Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results … by over 2 BLEU.