One token at a time
By now you know that a large language model is a transformer trained, during pretraining, on a single deceptively simple game: predict the next piece of text. When you chat with it, that same game is all it ever does. The model never plans a whole paragraph in advance and then types it out. Instead it generates one token at a time, and each token it produces is fed back in as part of the input for the next step.
A token is a chunk of text — often a word, sometimes a fragment like "-ing" or a single character — produced by tokenization (you saw this earlier with schemes like byte-pair encoding). On every step the model reads the entire text so far, runs it through its layers, and outputs not a word but a score for *every* token in its vocabulary — tens of thousands of numbers. A softmax turns those scores into a probability distribution: a list saying "there's a 40% chance the next token is ' Paris', 12% it's ' the', 0.001% it's ' banana'," and so on across the whole vocabulary.
This loop — predict a distribution, pick a token, append it, repeat — is called autoregressive decoding. "Auto-regressive" just means each output depends on the model's own previous outputs. The text grows left to right, like a writer who can only ever choose the very next word and never look ahead, until a special "stop" token appears or a length limit is hit.
context = tokenize(prompt)
while not done:
scores = model(context) # one score per vocab token
probs = softmax(scores / T) # T = temperature
next = sample(probs) # pick ONE token
context.append(next) # feed it back in
done = (next == END) or len(context) > limitFrom a probability list to a single word
Here is the crucial gap: the model hands you a *distribution*, but you need exactly *one* token to write down. How you choose is called the decoding strategy or sampling strategy, and it is a separate dial from the model's weights — you can change it without retraining anything. The simplest choice is greedy decoding: always take the single most likely token. It sounds safe, but it tends to produce flat, repetitive text and can get stuck in loops ("the the the").
The alternative is to genuinely sample: roll a weighted die, where each token's chance of being picked equals its probability. Now ' Paris' usually wins but not always, and the text gains variety and life. The art is controlling *how adventurous* that die is, and that is exactly what temperature, top-k, and top-p let you do.
Temperature, top-k, and top-p
Temperature reshapes the distribution before you draw from it. Notice the `scores / T` in the loop above: dividing the scores by a number T and then taking the softmax stretches or flattens the curve. A low temperature (say 0.2) makes the peaks sharper — the top choices get even more probable, so output is focused and predictable. A high temperature (say 1.2) flattens everything, giving rare tokens a real shot, so output is creative but riskier. Temperature 0 collapses to greedy decoding. This dial is temperature sampling.
Top-k and top-p instead *prune* the list before sampling, so the die can never land on absurd tokens. Top-k keeps only the k most probable tokens (say the top 40) and zeroes out the rest. Top-p, also called *nucleus sampling*, is smarter: it keeps the smallest set of top tokens whose probabilities add up to p (say 0.9), then samples among those. So when the model is confident, top-p might keep just two or three candidates; when it's unsure, it keeps many. These two cutoffs are top-k and top-p sampling, and in practice people often combine a temperature with a top-p cap.
The context window: the model's whole world
At every step the model reads "all the text so far" — but there is a hard ceiling on how much that can be. The context window (also called the context length) is the maximum number of tokens the model can attend to at once: your system instructions, the conversation history, any documents you pasted, and the reply being generated all share this single budget. Modern models range from a few thousand tokens to hundreds of thousands, but the limit is always finite.
When a conversation grows past the window, the oldest tokens fall off the edge — the model literally cannot see them anymore, which is why a long chat can seem to "forget" what you said at the start. This finite window is also why the self-attention mechanism is so important and so costly: every token must be compared against every other, so doubling the context can roughly quadruple the work. Understanding the window explains a lot of real behavior, and it's the reason techniques like retrieval-augmented generation exist — to feed the model only the most relevant slices of a huge knowledge base rather than everything at once.
Why outputs vary — and what that really means
Now the famous question: ask the same model the same thing twice and you may get two different answers. Why? Almost always it's the sampling. If temperature is above zero, you are rolling that weighted die at every single token, and a different roll early on sends the sentence down a different path. Set temperature to 0 (greedy) and the model becomes far more *deterministic* — though even then, tiny numerical differences in how the hardware sums up floating-point numbers across parallel cores can occasionally flip a close call.
Here is the honest part, and it corrects a common misconception. Sampling makes text *fluent and varied*, not *true*. The model is choosing tokens that are statistically likely given its training — it has no separate check that the claim is correct. When the most-likely continuation happens to be false, you get a confident, well-formed falsehood: a hallucination. Lowering temperature reduces randomness, but it does not make the model more honest; it just makes it more reliably repeat whatever it would most likely say, right or wrong.
It also helps to deflate two myths. First, the model does not "try all answers at once" — autoregressive decoding is strictly one token after another. Second, the variety you see is not a sign of creativity or understanding *welling up*; it is a deliberate randomness you can turn down to zero. The genuinely surprising thing about these models — that a next-token predictor trained at huge scale can translate, summarize, and reason in steps it was never explicitly taught — comes from the scale of training (see scaling laws and the debate over so-called emergent abilities), not from the dice you roll at generation time.
Putting the dials to work
You now have the whole picture: a transformer turns context into a probability list, a sampling strategy collapses that list into one token, the token is appended, and the loop runs again inside a finite window. Everything you tune at generation time lives in that collapse step. A short checklist for choosing settings:
- Need facts, code, or anything you'll repeat? Use a low temperature (0–0.3) and a tight top-p — predictable, focused, easy to test.
- Want brainstorming, varied phrasings, or fiction? Raise temperature toward 0.8–1.0 and loosen top-p to about 0.95.
- Getting repetitive loops or gibberish? Don't crank temperature blindly — first add a top-k/top-p cutoff so impossible tokens are pruned out.
- Hitting weird memory limits on long inputs? You're likely overflowing the context window — trim history or switch to retrieval rather than pasting everything.
With this, the path from raw probabilities to readable prose is no longer a black box. The next guide steps outside text entirely, into how *diffusion* and multimodal models generate images — a very different decoding story, but built on the same idea of turning a learned distribution into something concrete.