What Is a Large Language Model?

One absurdly simple job

When you chat with a ChatGPT-style system, it feels like you are talking to something that understands you. Lift the hood, though, and the engine is doing one stubbornly narrow task: given the text so far, predict what comes next. That's it. A large language model (LLM) is, at its core, a very elaborate guess-the-next-piece machine, trained until its guesses are astonishingly good.

This is exactly the language modeling objective you met earlier, just scaled far past anything that came before. Train a model on a huge pile of text by repeatedly hiding the next piece and asking it to guess, and reward it for being right. Do that billions of times and it stops merely parroting — it starts capturing grammar, facts, idioms, code patterns, and the rhythm of an argument, all as a side effect of getting better at one prediction game.

Tokens: the pieces it predicts

The "piece" an LLM predicts is not a word and not a letter — it's a token. Through tokenization, text is chopped into common chunks: whole short words, word fragments, punctuation, spaces. Modern systems use byte-pair encoding, which learns its vocabulary from data, so frequent words stay intact while rare ones split into reusable parts. "Unhappiness" might become "un", "happ", "iness"; an emoji might be one token; a long German compound, several.

Why bother with this odd middle ground? Letters alone make sequences painfully long; whole words make the vocabulary explode and leave the model helpless at any word it never saw. Tokens are the practical compromise — a fixed vocabulary (often 50,000–200,000 entries) that can spell out literally anything, including typos and brand-new words, by combining pieces. Every token maps to an embedding, a learned vector, which is the actual numeric form the network reads.

Parameters and the shape of "large"

So what does the model actually store? Its knowledge lives in its [[parameter|parameters]] — the weights and biases inside the network, the same kind of dials you tuned in earlier rungs, just far more of them. A small classic model had thousands; today's LLMs have billions to hundreds of billions. "7B" or "70B" in a model's name is a parameter count. Each one is a number nudged during training to make the next-token guess a little better.

Those parameters are arranged in a transformer architecture — the design that made this scale possible. Its key move is self-attention, which lets every token look back at every other token and decide which ones matter for predicting the next one. Stack dozens of these attention layers, give them enough parameters and enough text, and you get a foundation model: one general-purpose base trained once, then reused for translation, coding, Q&A, and a hundred tasks nobody trained it for explicitly.

input:  "The capital of France is"
  tokens -> [The][ capital][ of][ France][ is]
  model -> probability over the whole vocabulary:
           " Paris"  0.71
           " the"    0.06
           " a"      0.04
           ...        (tens of thousands more)
  pick one, append it, feed it all back in, repeat

One step of generation: the model outputs a probability for every possible next token, not a single answer.

Notice what the model emits: not a word, but a probability for every token in the vocabulary, produced by a final softmax layer. Generation is then a loop — sample one token, glue it onto the input, and run the whole thing again. This step-by-step, left-to-right loop is called autoregressive decoding, and it's why an LLM writes the way it does: one token at a time, each one conditioned on everything it has said so far.

From raw predictor to helpful assistant

A freshly trained base model is a brilliant text continuer, not an assistant. Ask it a question and it might continue with ten more questions, because in its training data questions often come in lists. Turning it into something helpful takes two more stages on top of the giant first stage of pretraining on oceans of text.

Pretraining: read a large slice of the internet, books, and code, learning next-token prediction. This is where almost all the knowledge and skill is absorbed — and where almost all the cost and energy go.
Fine-tuning on instructions: show it many examples of a request followed by a good response, so it learns the assistant format — answer the question, follow the instruction.
Preference alignment (RLHF and friends): have humans rank competing answers, then nudge the model toward the kind people preferred — more helpful, less toxic, harder to bait into nonsense.

That third stage, RLHF (reinforcement learning from human feedback), is the polish that makes ChatGPT-style systems feel cooperative and safe-ish. But be honest about what it is: a layer of taste and manners trained on top of the predictor, not a guarantee of correctness. RLHF teaches the model what kind of answer people like; it does not teach it what is true. A confidently wrong answer that sounds helpful can sail right through.

What scale buys — and what it doesn't

Here is the genuinely surprising finding of the last decade: predictably, as you add more parameters, more data, and more compute, the model gets better in a smooth, almost lawlike way. These scaling laws are why labs keep building bigger. Scale buys fluency, broad knowledge, and the ability to handle a task from just a few examples in the prompt — in-context learning, where the model adapts on the fly without any change to its parameters.

You'll also hear about emergent abilities — skills that seem to switch on only past a certain size. Treat the word with care. Some of that "emergence" is real, but a lot of it is an artifact of harsh all-or-nothing scoring: a model improving gradually can look like it jumps from zero to hero simply because the test only counts perfect answers. Capability grows; it rarely teleports. There is no magic threshold where the model wakes up.

What scale does NOT buy is just as important. It does not buy truthfulness: the same machine that fluently states facts will, with equal fluency, invent a fake citation or a non-existent law — a hallucination — because it is optimizing for plausible continuations, not verified ones. It does not buy genuine reasoning over arbitrary new problems, real-time knowledge of events after its training cutoff, or any inner goals. And it does not buy general intelligence: a bigger next-token predictor is a more capable narrow tool, not a mind on the verge of waking up.