Language Modeling: Predicting the Next Word

The one game behind everything

You have already met the pieces: text gets chopped into tokens, and each token can be turned into a vector with word2vec or GloVe so that meaning becomes geometry. Now we put those pieces to work on a single, almost childish task — language modeling: given the words so far, guess what comes next. "The cat sat on the ___." You felt "mat" or "floor" before you finished reading. That reflex, made mechanical, is the whole field.

More precisely, a language model assigns a probability to every possible next token given the context, then to the whole sentence by chaining those probabilities together. It does not output one word; it outputs a distribution over the entire vocabulary — maybe 50,000 numbers that sum to one. "mat" might get 0.4, "floor" 0.2, "refrigerator" 0.0001. Everything else flows from that one habit of scoring what comes next.

Counting: the n-gram era

The oldest honest approach is just to count. An n-gram model estimates the next word from the last n−1 words by tallying how often that combination appeared in a big pile of text. A bigram model looks back one word, a trigram two. Want P("mat" | "on", "the")? Count how many times "on the mat" appeared, divide by how many times "on the" appeared. That is it — no neurons, just bookkeeping. This is maximum-likelihood estimation in its plainest dress.

P(w | context) = count(context + w) / count(context)

# trigram example
P("mat" | "on","the") = count("on the mat") / count("on the")

An n-gram model is just division of counts — and that simplicity is both its charm and its ceiling.

This works startlingly well for short ranges, and it powered phone keyboards and early speech recognition for decades. But two walls appear fast. First, sparsity: most plausible 4-word combinations never appear in your data, so their count is zero and the model claims they are impossible (clever "smoothing" tricks patch this, but only patch it). Second, the fixed, short window: a trigram cannot connect "The keys that I left in the kitchen this morning are" to its true subject "keys" — it only ever sees the last word or two. Meaning that lives across a sentence is invisible to it.

From counting to understanding: neural LMs

The neural turn fixes sparsity by refusing to treat words as opaque symbols. Instead of counting "dog" and "puppy" as unrelated strings, a neural LM feeds their embeddings — those dense meaning-vectors — into a network. Now "puppy" can borrow from everything the model learned about "dog", because they sit near each other in vector space. The model doesn't memorize phrases; it composes a prediction from features. A phrase it has never seen can still get a sensible probability.

To read a whole sentence in order, early neural LMs used a recurrent network that walks left to right, carrying a running summary — a hidden state — and updating it at each word. In principle that state can remember the subject from twelve words ago; in practice plain RNNs forget fast, which is why LSTMs added gates to hold onto information longer. The final layer turns the network's output into a probability over the vocabulary using softmax, so the whole thing is still answering the same question: what comes next?

How do we know it is getting better? We measure surprise. A model that confidently assigns high probability to the words that actually appear is "less surprised" by real text. The standard score is perplexity — roughly, how many words the model is effectively choosing between at each step. A perplexity of 20 means it is about as confused as someone picking from 20 equally likely options; lower is better. Training nudges the network, via gradient descent, to drive that surprise down across billions of words.

How a guessing game becomes a capability

Here is the surprise that reshaped the whole field: to predict the next word really well, a model is quietly forced to learn almost everything else. To finish "The capital of France is ___" you need a fact. To finish "She poured the water until the glass was ___" you need a hint of physics. To close a quotation or balance a parenthesis you need syntax. None of this is taught directly — it falls out as a side effect of being good at the guessing game.

Once a model can score the next token, you can also let it write. Feed it a prompt, sample a token from its distribution, append that token, and ask again — a loop called autoregressive decoding. Do it a few hundred times and you have a paragraph. Crucially, the model was never trained to "write essays"; it was trained to predict, and writing is just prediction run forward. This single objective, scaled up with the transformer you met earlier, is what turned language models into the large language models behind today's chat assistants.

What prediction is not

Now the honest part, because this is where hype piles up. A language model predicts what text is likely; it does not check whether that text is true. "The capital of Australia is Sydney" is a fluent, high-probability sentence — and wrong. When a model states a confident falsehood it is not lying; it is doing exactly its job, producing plausible continuations. This is the root of hallucination, and no amount of fluency removes it.

A few more gentle corrections. Predicting the next word is not understanding in the human sense — the model has learned the statistics of how words co-occur, which is powerful but not the same as grounded experience. Lower perplexity does not guarantee a more truthful or more helpful model, only one less surprised by text. And the famous "emergent" abilities are real but partly an artifact of how we measure them; capability grows with scale, but not by magic, and not without limit. Treat fluency as evidence of fluency, nothing more.

So hold both truths at once. Next-token prediction is the quiet, almost embarrassingly simple objective that, scaled across the world's text, produced systems that can translate, summarize, and converse. And it is still, underneath, a prediction machine — confident, fluent, occasionally wrong, and indifferent to truth unless we work hard to make it otherwise. In the next guide we meet BERT and its kin, where masked prediction is bent toward deep understanding rather than generation.