The Attention Idea

The bottleneck that started it all

In the last rung you met the recurrent network and its job of reading a sequence one step at a time. To translate a sentence, the classic sequence-to-sequence setup had an encoder read the whole input and squeeze it into a single fixed-size vector — a kind of compressed summary — which a decoder then unrolled into the output. It worked, and for a while it was the state of the art. But notice the quiet violence in that design: *every* sentence, whether five words or fifty, gets crushed into a vector of the same length.

Imagine being asked to read a paragraph, then write it down word-for-word in another language — but you may only keep a single sticky note while reading, and you cannot look back at the original. For a short phrase, fine. For a long, winding sentence, the note overflows; details near the start blur out by the time you reach the end. This is the fixed-size summary problem, and it is not a bug you can tune away. It is built into the shape of the architecture.

Let the model look back

The fix is almost embarrassingly natural. Instead of forcing the encoder to hand over one frozen summary, keep *all* of its step-by-step representations — one vector per input word — and let the decoder reach back into that pile whenever it needs to. When it is about to produce the next output word, it asks: of all the input words, which ones matter *right now*? That question — "which words should I pay attention to?" — is the seed of the entire attention mechanism.

Concretely: when translating "the cat sat on the mat" into French and the model is deciding the word for "sat," it should lean hard on the input word *sat* and only lightly on *the* or *mat*. Attention gives each input word a weight — a number saying how relevant it is to the word being produced — and then blends the input vectors together in those proportions. The result is a fresh summary, custom-built for this exact moment, instead of one frozen note reused for the whole sentence.

Notice the shift in mindset. The model no longer needs to memorize the whole sentence in advance. It only needs to know, at each step, *where to look* — and the original information stays available, undamaged, the whole time. We have traded an act of compression for an act of selective retrieval.

Weighted focus: a soft average, not a single pick

Here is the subtle, beautiful part. Attention does *not* point at one winning word and ignore the rest. That would be a hard choice — pick the single most relevant input and use only it. Instead it spreads its focus *softly*: maybe 70% on *sat*, 15% on *cat*, and small slivers across everything else, with all the weights adding up to 1. The output is a weighted blend of all the input vectors. This is why people call it soft attention — focus is a smooth distribution, not an on/off switch.

Why soft and not hard? Two reasons, both deep. First, real language is genuinely ambiguous — "it" might refer mostly to one noun but partly to another, and a blend captures that honestly. Second, and more practically, a soft blend is differentiable: you can take its gradient and train the weights end-to-end with backpropagation, exactly as you learned to do for ordinary networks. A hard pick has no useful slope to follow. Smoothness is what makes attention *learnable*.

How do raw relevance scores become a clean set of weights that sum to 1? With a function you already know: the softmax. It takes any list of numbers and turns them into positive fractions that add to one, gently exaggerating the big ones. Feed the relevance scores through softmax and out come the attention weights, ready to blend.

A soft lookup: queries, keys, and values

There is an even cleaner way to think about all this, and it is the framing the rest of this rung will run on. Think of an ordinary dictionary lookup. You have a query (what you're searching for), a set of keys (the labels things are filed under), and values (the contents). You match your query against the keys, find the one that fits, and return its value. Attention is the *soft* version of exactly this — a query-key-value lookup where, instead of one exact match, you get a weighted blend across all of them.

How does the model measure how well a query "fits" a key? Both are just vectors — points in space, the same kind of embedding you met earlier — so it uses a similarity score. The simplest choice is the dot product: large when two vectors point the same way, small or negative when they don't. Score the query against every key, softmax those scores into weights, then take the weighted sum of the values. That three-line recipe is the heart of attention.

# soft lookup over a memory of (key, value) pairs
scores  = [ dot(query, k) for k in keys ]   # how well query fits each key
weights = softmax(scores)                    # positive, sum to 1
output  = sum( w * v for w, v in zip(weights, values) )  # weighted blend
# one exact key would give weights like [0,0,1,0]; attention stays soft

Attention in three lines: score, normalize, blend. A hard lookup snaps to one value; a soft lookup mixes them all by relevance.

The next guide will reveal the twist that makes this explode in power: the queries, keys, and values don't have to come from two different sentences. Let every word in a *single* sentence emit its own query and also serve as a key and value for the others, and you get self-attention — the engine of the Transformer. But the whole edifice rests on the modest idea you just built: a lookup that's allowed to be fuzzy.

What attention is — and isn't

It is tempting, once you see those weights, to read them as the model's *reasons*. Researchers even draw heatmaps of attention and call them explanations. Be careful here. Attention weights show *where information was gathered from*, which is suggestive, but they are not a faithful account of why the model decided what it did. A large literature on attention as explanation has shown you can often shuffle the weights and get the same answer. Attention is a mechanism for routing information, not a window into a mind.

A second honest caveat: attention did not, on its own, invent intelligence. It was first bolted *onto* recurrent translation models around 2014 and made them clearly better — it was an improvement, not yet a revolution. The revolution came later, when researchers asked whether you could throw the recurrence away entirely and keep *only* attention. That story, and the famous paper title behind it, is for a later guide. For now, resist the breathless framing: attention is a wonderfully good idea, not magic.