Pretraining: Learning from the Internet

The trick: turn text into its own answer key

You already know the transformer — the engine that lets a model weigh every word against every other through attention. Pretraining is what we feed that engine, and the secret is almost embarrassingly simple. Take an ocean of text, hide the next word, and ask the model to guess it. The text it came from is the answer, so no human ever has to label anything. That is [[self-supervised-learning|self-supervised learning]]: the data supervises itself.

This is just language modeling, the task you met in the NLP rung, scaled to absurdity. The model reads a sequence of tokens and outputs a probability for every possible next token via softmax. We compare that prediction to the real next token, measure the error with a loss function, and nudge billions of weights downhill with stochastic gradient descent. Repeat trillions of times. Nothing here is new — it is backpropagation you already understand, running for months.

From predictor to foundation model

When you train one model on a broad enough sweep of data that it becomes a reusable starting point for many tasks, you have a [[foundation-model|foundation model]]. The name captures the shift: instead of training a fresh network per task — one for translation, one for sentiment — you pretrain once at great cost, then adapt cheaply. A large language model is simply a foundation model whose pretraining diet is text.

This is transfer learning at planetary scale. The pretrained weights hold general-purpose representations of language; a later, far smaller stage of fine-tuning reshapes them for a specific use. Crucially, the raw pretrained model is not yet a chatbot. It will happily continue your prompt as a web page would — completing, padding, even inventing — because completion is all it was ever asked to do. Turning it into a helpful assistant is the *next* guide's job.

raw text  ──tokenize──▶  [The, capital, of, France, is]
                                         │
                              transformer (billions of weights)
                                         │
                          softmax over the whole vocabulary
                                         ▼
            P("Paris")=0.71  P("a")=0.04  P("home")=0.01  ...
            loss = -log P(true next token)  →  backprop  →  repeat

One training step: predict the next token, score the guess, adjust the weights — repeated across trillions of tokens.

The three ingredients: data, compute, parameters

Three knobs decide how good a pretrained model gets. Data: a curated sweep of the web, books, code, and more — typically trillions of tokens, heavily filtered for quality and de-duplicated, because a model is shaped by what it eats. Compute: thousands of GPUs or TPUs running for weeks via distributed training, the largest single computation most organizations ever run. [[parameter|Parameters]]: the billions of weights that store what was learned.

The remarkable finding is that these are not random gambles. [[scaling-laws-capability|Scaling laws]] show that loss falls smoothly and predictably as you grow data, compute, and parameters together — often as a clean power law you can extrapolate before spending the money. They also tell you the *balance*: for a given compute budget, there is an optimal ratio of model size to training tokens, and early giant models were badly under-trained on too little data for their size.

Emergent abilities — read carefully

As models grew, researchers noticed skills that seemed absent in small models and present in large ones — multi-step arithmetic, following unusual instructions, solving puzzles. These got called [[emergent-abilities|emergent abilities]]: capabilities that appear to switch on past some scale rather than improving gradually. It is a genuinely striking observation, and it is also one of the most over-hyped phrases in the field.

Here is the honest caveat. Later work showed many "sudden" jumps are partly an artifact of the *metric*. Score a task all-or-nothing (right only if every digit is correct) and progress looks like a cliff; score it with partial credit (per-token probability) and the same skill improves smoothly, exactly as scaling laws predict. The underlying competence often grows gradually; the sharp line is sometimes in our ruler, not the model.

So hold two truths at once. Scale does unlock qualitatively new behavior — that is real and important. But "emergent" does not mean magical, unpredictable, or a sign that the model is waking up. Much of what looks like a leap is a measurement choice meeting steady, lawful improvement. Treat sweeping claims about sudden intelligence with the same skepticism you would bring to any extraordinary result.

What pretraining is — and what it is not

It helps to be precise about what this stage delivers. A pretrained model is a vast statistical compression of how text tends to continue. It has read far more than any person could, yet it has no goals, no understanding of truth, and no awareness that it is answering anyone. Its fluent confidence is exactly why hallucination — stating false things in a plausible voice — is not a bug bolted on by accident but a direct consequence of optimizing for plausible continuation.

Pretraining also has real costs and edges worth naming. The data carries human bias, which the model absorbs and can amplify. Knowledge is frozen at the training cutoff, so the model has no inherent way to know about anything newer. And the energy and water behind training a frontier model is substantial — the environmental cost is a genuine part of the ledger, not an afterthought.

Keep this frame as you read on. Pretraining builds a powerful, knowledgeable, but raw and aimless engine. The remaining guides in this rung add steering — fine-tuning and RLHF to make it helpful, sampling to turn its probabilities into prose. None of that creates a mind; it shapes a predictor into a tool. Holding that line clearly is the difference between using these systems well and being fooled by them.