Information, Entropy & Surprise

How surprised should you be?

Start with a feeling everyone knows: surprise. If a friend says "the sun rose this morning," you learn almost nothing — you were already sure. If they say "it snowed in the desert," you sit up; that was unlikely, so it carries a lot of news. Information theory turns this hunch into a number. The rule is simple and intuitive: the *less* likely an outcome, the *more* surprising — the more information — it carries when it happens.

So we want a quantity that is big when a probability is small and that fades to zero when the outcome was certain. The choice that works is the surprise of an outcome with probability p: surprise = log(1/p). When p = 1 (certain), surprise is log(1) = 0 — no news. When p is tiny, 1/p is huge, so surprise is large. Using log base 2 measures it in bits: an outcome you'd have guessed by a single fair coin flip (p = 1/2) carries exactly 1 bit.

Why a logarithm and not just 1/p? Because surprises should *add up*. If two independent things happen, their probabilities multiply (p·q), but it feels right that your total surprise should be the *sum* of each surprise. The logarithm is exactly the function that turns multiplication into addition: log(1/(p·q)) = log(1/p) + log(1/q). That single property is why logs appear everywhere in this corner of math — including, you'll see, in the loss you train classifiers with.

Entropy: surprise on average

A single outcome has a surprise. A whole [[random-variable|random variable]] — say, tomorrow's weather, or the next word in a sentence — has *many* possible outcomes, each with its own probability. The natural summary is the average surprise you'd feel over the long run, weighting each outcome's surprise by how often it occurs. That average is [[information-entropy|entropy]]. Higher entropy means a more unpredictable source; lower entropy means a more predictable one.

This connects straight back to the [[expectation-and-variance|expectation]] idea from earlier in this rung: entropy is just the *expected value* of surprise. Concretely, for outcomes with probabilities p1, p2, …, entropy = sum of pi · log(1/pi). A fair coin has entropy 1 bit — maximally uncertain between two options. A biased coin that lands heads 99% of the time has far less, because most of the time you already knew the answer. A coin that always lands heads has entropy 0: nothing to learn.

Cross-entropy: when your model is wrong about the odds

Entropy assumes you know the true probabilities. But a model doesn't — it has *guesses*. Call the true distribution p (what nature actually does) and the model's guessed distribution q (what your model predicts). Cross-entropy asks: if you measured surprise using q's probabilities, but outcomes actually arrive according to p, how much surprise do you feel on average? It's the average surprise you pay for believing q while reality follows p.

Here's the key fact, and it's gentle: cross-entropy is *always at least as large as* the true entropy, and it equals entropy only when q matches p perfectly. Any mismatch costs extra surprise. So if you want your model's q to mirror reality's p, you simply push the cross-entropy *down* — driving it toward the floor set by the true entropy. That is exactly the move a training loop makes.

The leftover gap — cross-entropy minus the true entropy — has its own name: [[kl-divergence|KL divergence]]. Read it as "the *extra* surprise from using the wrong distribution." KL is zero when q = p and grows as q drifts away, so it behaves like a distance from your beliefs to the truth. One honest caveat: it is *not* a true distance — KL(p, q) generally differs from KL(q, p). It's asymmetric, which is why people say "divergence," not "metric."

Why cross-entropy is the default classification loss

Now the payoff. A classifier outputs a [[probability-distribution|probability distribution]] over labels — usually via [[softmax|softmax]], which squashes raw scores into positive numbers that sum to 1. That's the model's q. The truth p for a single labeled example is dead simple: probability 1 on the correct class, 0 on everything else (a "one-hot" vector). Plug that p into cross-entropy and almost everything cancels: the loss for one example becomes just log(1 / q_correct) — the surprise the model assigned to the right answer.

Read that as an incentive. If the model is confident and right (q_correct near 1), surprise is near 0 — almost no penalty. If it's confident and *wrong* (q_correct near 0), the loss rockets toward infinity. The logarithm punishes confident mistakes brutally and rewards calibrated confidence. Minimizing this loss over your whole dataset is the everyday meaning of training a classifier — and in binary cases it's the same thing as the [[log-loss|log loss]] you'll meet in evaluation, and the engine inside [[logistic-regression|logistic regression]].

There's a deeper reason it's the *default* and not just one option among many. Minimizing cross-entropy is mathematically identical to [[maximum-likelihood-estimation|maximum likelihood]] — choosing the parameters that make the observed data most probable, the principle introduced earlier in this rung. So cross-entropy isn't an arbitrary formula someone liked; it's what "make the training data as likely as possible" looks like once you write it down. That pedigree, plus its gradients playing nicely with softmax, is why it's the reflex choice as a [[loss-function|loss function]].

# one labeled example, C classes
# logits  : raw model scores, length C
# y        : index of the true class

q = softmax(logits)          # model's predicted distribution, sums to 1
loss = -log(q[y])            # = log(1 / q[y]) = surprise on the true class

# confident & right -> q[y]~1 -> loss~0
# confident & wrong -> q[y]~0 -> loss huge
# average loss over the dataset is what training minimizes

Cross-entropy for one example collapses to the surprise the model placed on the correct label.

Reading it in the wild — and a caveat

You'll now recognize these ideas everywhere. Language models are trained by minimizing cross-entropy over "what's the next token?" — and [[perplexity|perplexity]], the headline number in that world, is simply cross-entropy exponentiated, reported as an effective branching factor ("the model is as confused as if it were choosing uniformly among N words"). Lower perplexity, lower cross-entropy, less average surprise: three views of one quantity.

And remember the asymmetry of KL from earlier — it quietly shapes behavior. Penalizing the model when it puts probability where the truth has none (KL one way) is different from penalizing it for missing outcomes that do happen (KL the other way). Training on cross-entropy effectively uses one direction, which tends to make models hedge toward covering the truth rather than gambling everything on one answer. You don't need the formula to keep this instinct: surprise is the currency, and a good model spends as little of it as honesty allows.

That's the whole arc of this rung's last stop. Surprise is log(1/p). Entropy is average surprise — the unavoidable uncertainty in a source. Cross-entropy is the average surprise of believing the wrong distribution, KL divergence is the avoidable extra, and a classifier trains by squeezing cross-entropy down — which is just maximum likelihood wearing a different coat. With vectors, probability, calculus, and now information in hand, you're ready to read the rest of the ladder with real confidence.