Self-Attention & Multi-Head Attention

The question every word asks

Read the sentence "the trophy didn't fit in the suitcase because it was too big." What does "it" refer to? You know instantly: the trophy. But notice what your mind did — to understand one word, it reached across the sentence and pulled in another. This reaching-across is the whole idea behind the [[attention-mechanism|attention mechanism]]. Earlier in this ladder you met the recurrent network, which processed words one at a time and tried to carry meaning forward in a single memory. Attention throws out that bottleneck: every word looks at every other word, directly, at once.

By this point each word is already a vector — an [[embedding|embedding]] that places the word somewhere in a high-dimensional meaning space. [[self-attention|Self-attention]] is the operation that lets these vectors talk to each other and update themselves. "Self" simply means the words attend to other words in the same sequence (not to some separate input). After one round, the vector for "it" has quietly absorbed a little of "trophy," so it now means something closer to "it (the trophy)." That mixing is the engine of the whole Transformer.

Query, key, value: a tiny library

The mechanism gives every word three roles, and the cleanest analogy is a library search. The query is what a word is looking for ("I'm a pronoun — which noun do I belong to?"). The key is the label each word advertises about itself ("I'm a singular, concrete object"). The value is the actual content a word will hand over if it gets chosen. This trio is so central it has its own name: query, key, value, usually written Q, K, V.

Crucially, the model is not handed these three vectors. Each word's embedding is multiplied by three separate learned weight matrices — W_Q, W_K, W_V — to produce its query, its key, and its value. So the same word "bank" can advertise one key in a river sentence and a different useful role in a money sentence, because the network learned, through gradient descent over mountains of text, what makes a helpful question, a helpful label, and helpful content. Q, K, and V are learned projections, not fixed properties of words.

Scaled dot-product attention, step by step

Now we can run the search. To decide how much word A should listen to word B, we take A's query and B's key and compute their dot product — a single number that is large when the two vectors point the same way. So the dot product literally measures "how well does B's label answer A's question?" Do this for every pair and you get a grid of raw scores: every word against every other word.

Score: for each word's query, dot it against every word's key. This gives a row of raw compatibility scores.
Scale: divide every score by the square root of the key dimension. This is the "scaled" part, and it matters more than it looks.
Normalize: pass the row through softmax, turning the scores into positive weights that sum to 1 — a probability-like distribution of attention.
Mix: use those weights to take a weighted average of all the value vectors. That blended vector is the word's new, context-aware representation.

This complete recipe is scaled dot-product attention. The softmax step is what makes it a soft choice rather than a hard one: a word doesn't pick a single other word, it spreads its attention — maybe 0.7 on "trophy," 0.1 on "suitcase," a little smeared everywhere else. The output is a fresh vector for every position, each one a custom blend of the whole sentence's values, weighted by relevance.

Many heads, many points of view

One attention pass can only form one kind of blend — one opinion about what matters. But language needs several at once: who-refers-to-what, who-is-the-subject-of-this-verb, what's-the-tense, what-mood. So instead of one big attention computation, the Transformer runs several smaller ones side by side, each with its own W_Q, W_K, W_V. Each one is a head, and the whole arrangement is multi-head attention.

Each head projects the words into a smaller subspace, does its own scaled dot-product attention there, and produces its own blended output. The heads' outputs are then stuck together (concatenated) and passed through one more learned matrix that fuses them back into a single vector per word. Because the heads share no weights, they're free to specialize — and when researchers inspect trained models, some heads really do seem to track grammatical relationships, others nearby words, others rare long-range links.

# one head, sketched (Q,K,V already projected)
scores = Q @ K.T / sqrt(d_k)   # every query vs every key
weights = softmax(scores)       # rows sum to 1
out = weights @ V               # weighted blend of values

# multi-head: do the above H times in parallel, then
# combined = concat(out_1, ..., out_H) @ W_O

Scaled dot-product attention for one head, then how heads combine.

What attention can't see — and what fixes it

Here's a humbling fact: plain self-attention has no idea what order the words came in. Because it computes a weighted average over a set, "dog bites man" and "man bites dog" would look identical to it. That set-based view is exactly what gives attention its speed and its all-at-once parallelism — but word order obviously carries meaning. The fix is positional encoding: before attention runs, each word's embedding gets a position-dependent signal added in, so the vectors quietly carry "I'm word 1," "I'm word 2," and so on.

The other honest limit is cost. Since every word compares against every other word, the work grows with the square of the sequence length: double the text and you roughly quadruple the computation and memory. That quadratic scaling is why long inputs are expensive and why a model has a finite context-length, and it's the central problem that a whole research line — faster attention variants, smarter memory schemes — keeps chipping away at. Attention is powerful precisely because it connects everything to everything; that same property is what makes it hungry.

Step back and the picture is clean. A word starts as a context-blind embedding; attention lets it gather relevant information from everywhere in the sequence; multiple heads gather several kinds at once; and stacking many such layers lets meaning compound, layer after layer, until "it" reliably knows it's the trophy. In the next guide we'll assemble these blocks — plus normalization and feed-forward layers — into the full Transformer, and see why this design swept the field.