JOVANA
Library Glossary Getting Started Three Levels Fields How it works Mission
Join the mission
Back to the library
Artificial Intelligence 2017

Attention Is All You Need

Ashish Vaswani et al. (Google)

Drop recurrence; let every word look directly at every other. The Transformer is born.

Choose your version
In depth · the introduction

This is the design that lets an AI weigh every word against every other word at once — and it became the foundation of today's chatbots.

The idea, unpacked

Older language AIs read a sentence in order, one word at a time, trying to remember what came before. That made them slow to train and forgetful over long passages — by the end of a paragraph they'd half-lost the start. The Transformer threw out that step-by-step reading.

Its key idea is called attention. Instead of reading in sequence, the model looks at all the words together and, for each word, decides how much every other word matters to its meaning. All of those comparisons happen at the same time rather than one after another, which is exactly what made the model so fast to train — and so easy to make enormous.

Where it came from

In 2017, a team of eight researchers at Google published an eight-page paper at the NeurIPS conference with the cheeky title "Attention Is All You Need." It was aimed at machine translation, and on that narrow task it already beat the best systems of the day. But its real significance was the architecture it introduced — the Transformer — which spread through the field with startling speed, becoming the default backbone for language AI within just a couple of years.

Why it mattered

By removing the slow, sequential step, the Transformer became trainable at a scale nothing before it could reach. That unlocked a simple, powerful loop: make the model bigger, feed it more text, and it gets smarter. Bigger and more text, again and again — that loop is the entire LLM era, and this paper drew its blueprint. The chatbots, writing tools, translators and assistants that arrived afterwards are all built on it.

A tiny example

Take the sentence "the trophy didn't fit in the suitcase because it was too big." What does "it" refer to — the trophy or the suitcase? You know instantly it's the trophy, but a computer has to figure it out. Attention is how the model does it: it lets the word "it" look across the whole sentence and lean most heavily on "trophy." Try the same thing yourself below.

An interactive sentence: click any word to see lines fan out to the other words it attends to, with thicker lines for stronger attention; switch between attention heads to see different patterns — one head links words by meaning, another links neighbouring words, another resolves which noun a pronoun refers to.

What came next

The two families that grew straight out of this paper became household names. BERT used the Transformer to read and understand text; the GPT models used it to generate text, and grew into the assistants people now talk to every day. Almost every AI you've used that handles language traces its design back to these eight pages.

The original document
Original source text
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin · NeurIPS 30 (2017)
Abstract
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
The remainder of the abstract reports that the model is both more parallelizable and faster to train, and that it sets a new state of the art on two WMT 2014 machine-translation tasks.
Attention
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.
The paper then defines Scaled Dot-Product Attention and Multi-Head Attention, and explains how the same mechanism is reused in the encoder, the decoder, and the cross-attention between them.
Self-attention
Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
A table compares self-attention to recurrent and convolutional layers on computational complexity, the amount of parallel work, and the maximum path length between any two positions.
[ … ]
Results
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results … by over 2 BLEU.
The full paper, with its architecture diagram, the attention-visualisation figures, and the complete tables of BLEU scores and training cost, runs to eight pages and is available in full at the source below.
Google · June 2017