The blind spot inside attention
By now you have seen how self-attention lets every token look at every other token and decide how much each one matters. But pause on *how* it does the looking. Each token becomes a query, key, and value, and attention compares queries to keys with a scaled dot product. That comparison is a sum — and a sum does not care about order. Shuffle the words and the same set of values comes back out, just rearranged. Attention, on its own, is permutation-invariant: to it, "the dog bit the man" and "the man bit the dog" look identical.
This is the price the Transformer paid for its great trick. The recurrent networks it replaced read a sentence one word at a time, so order was baked into the very act of reading. The Transformer threw that out to read everything in parallel — fast, but order-blind. So order has to be put back by hand, before attention ever runs.
Handing out the name tags: positional encoding
Each token already arrives as an embedding — a vector that captures *what* the word means. Positional encoding adds a second vector that captures *where* it sits. The two are simply summed, so a token's final representation carries both its meaning and its place. The clever part is what that position vector looks like.
The original Transformer used a fixed pattern of sine and cosine waves of many different wavelengths. Position 0 gets one combination of wave values, position 1 a slightly shifted one, and so on. Because waves repeat smoothly, nearby positions get similar codes and distant ones get distinct codes — and, crucially, the *relative* distance between two positions shows up as a consistent shift, which is exactly what attention can exploit. Nothing here is learned; it is pure geometry, computed once.
# meaning + place, summed before attention
for pos in range(sequence_length):
token_vec[pos] = embedding[pos] + position_code(pos)
# position_code(pos): a vector of sin/cos waves
# wave_k(pos) = sin( pos / 10000^(k/d) ) # many wavelengths k
# nearby pos -> similar code; relative offset -> consistent shiftOther schemes exist. Some models *learn* a position vector per slot instead of fixing it. Most of today's large models use rotary encodings (often called RoPE), which rotate the query and key vectors by an angle that grows with position — so the dot product between two tokens naturally depends on how far apart they are. The detail to remember is the shared goal: give attention a reliable sense of *relative* distance, not just an absolute index.
What a context window actually is
The context window is the maximum number of tokens the model can hold in view at one moment — your prompt plus everything it has generated so far, all counted together. When people say a model has a "128K context," that is the context length: roughly the size of the room everyone is talking in. Go past it and the earliest tokens fall off the edge; the model simply cannot attend to what no longer fits.
Two honest clarifications. First, the window is measured in *tokens*, not words — thanks to tokenization, a long or rare word may split into several tokens, so "how much text fits" is fuzzier than the headline number suggests. Second, the window is not memory between separate chats. Once a conversation ends, nothing carries over; each new request rebuilds the whole context from scratch. The model has no diary.
Why is the window finite at all? Because attention's cost grows with the *square* of the sequence length: double the tokens and you roughly quadruple the comparisons, since every token still attends to every other. A longer window is not a switch someone forgot to flip — it is a real bill in compute and memory that someone has to pay on every forward pass.
The KV cache: why generation speeds up after the first word
When a model writes text, it does so by autoregressive decoding: predict one token, append it, predict the next. Naively, each new token would re-run attention over the entire history from scratch — and since the history only grows, that would get slower and slower with every word. The fix is the KV cache.
Here is the key observation: once a token's key and value vectors have been computed, they never change. Token 5's key is the same whether the sentence is 6 tokens long or 600. So the model computes each token's key and value once, stores them, and reuses them forever after. To generate the next token it only needs to compute *one* new query, compare it against all the cached keys, and combine the cached values. The expensive history is paid for once, not on every step.
Long context: harder than "just make it bigger"
Stretching the window runs into three walls at once. The quadratic cost of attention makes raw compute explode. The KV cache balloons in memory. And positional encodings trained on short sequences often *extrapolate* poorly — feed the model positions far beyond anything it saw in training and its sense of distance can quietly break down.
Researchers chip away at each wall. FlashAttention reorganizes the computation to use memory far more efficiently, making long sequences practical without changing the math. Other approaches sparsify or approximate attention so each token attends to a chosen subset rather than everyone. RoPE-style encodings can be rescaled to reach beyond their training length. None of these is a free lunch — each trades exactness, generality, or implementation complexity for reach.
Here is the part the marketing skips. A big window means the model *can* read a lot — not that it reads it *well*. On "needle in a haystack" tests, models often retrieve facts buried at the very start or end of a long context while missing things parked in the middle. A long window is a capacity, not a guarantee of attention. Treat "1M tokens" as a ceiling on what the model can ingest, not a promise of what it will faithfully use.
Why care, concretely? Because window size and the KV cache together drive inference cost — the latency and dollars behind every reply. Understanding them turns vague frustration ("the long document answers feel worse and slower") into a diagnosis you can act on, and it explains why even a large language model with a giant window still benefits from being fed *relevant* text rather than *everything* you have.