Entropy, Information, and Probabilistic Machine Learning

Surprise, measured in bits

Throughout this ladder, probability has been our language for uncertainty. Information theory adds one more word: it asks not just "how uncertain am I?" but "how much uncertainty did this message just remove?" The bridge between the two is a single, almost inevitable choice. Suppose an event of probability p occurs. How surprised should you be? A near-certain event (p close to 1) should be barely surprising; an event you thought nearly impossible (p close to 0) should be enormously surprising. And surprise should be additive for independent events: learning two unrelated facts should surprise you by the sum of the separate surprises. The only function that turns multiplication of probabilities into addition is the logarithm, so we define the surprise of an outcome as -log(p).

The base of the logarithm just sets the unit. Base 2 gives bits: one bit is the surprise of a single fair coin flip, since -log_2(1/2) = 1. Base e gives nats, which is more natural in calculus and the one machine learning usually uses. A fair coin carries 1 bit; a fair six-sided die carries -log_2(1/6) which is about 2.58 bits — more outcomes, more surprise, more information when you finally see the result. Notice the honest subtlety already: surprise is a property of a particular *outcome* under a particular *model* p. Change the model and the same outcome carries a different number of bits.

Entropy: average surprise

If -log(p) is the surprise of one outcome, the natural next question is: on average, how surprised will I be before I see the result? That average is [[shannon-entropy|Shannon entropy]]. For a random variable X with probability mass p(x), it is simply the expectation of the surprise: H(X) = sum over x of p(x) times -log(p(x)). It is an expectation of a function of X — the very LOTUS move you learned long ago, here applied to the function -log(p(X)). Entropy is therefore measured in bits or nats and quantifies, in one number, how uncertain the whole distribution is before any observation.

H(X) = - sum_x  p(x) log p(x)        (entropy, average surprise)

fair coin   p = (1/2, 1/2)   ->  H = 1 bit       (max for 2 outcomes)
biased coin p = (0.9, 0.1)   ->  H ~ 0.47 bits   (more predictable)
sure thing  p = (1, 0)       ->  H = 0 bits      (no surprise at all)

Entropy is maximised by the uniform distribution (most uncertain) and is zero for a certain outcome. The convention 0 log 0 = 0 handles impossible outcomes cleanly.

Two facts give entropy its shape. First, it is never negative, since each term p(x) times -log(p(x)) is non-negative for p in [0, 1]. Second, for a variable with k possible outcomes, entropy is largest exactly when the distribution is uniform, giving H = log(k); any lopsidedness lowers it. This is why a fair coin (1 bit) is harder to predict than a biased one (0.47 bits): uniform means maximally uncertain. Shannon's source-coding theorem then makes the unit literal — entropy is the average number of bits per symbol needed to encode the source, no compression scheme can beat it on average, and that is why H is called the information content.

KL divergence: the cost of believing the wrong model

Now the move that powers everything downstream. Suppose the data really come from a distribution p, but you build your beliefs — and your compression code — around a different distribution q. How much do you pay for being wrong? You compute surprises with the wrong model, -log(q(x)), but they actually occur with the right frequencies p(x), so your average surprise is the cross-entropy sum over x of -p(x) log(q(x)). Subtract off the unavoidable minimum, the true entropy H(p), and the gap left over is the [[kullback-leibler-divergence|Kullback-Leibler divergence]] D(p given q): the extra bits you waste, on average, by modelling p as q.

The single most important property of KL divergence is that it is never negative, and equals zero only when q matches p exactly. This is Gibbs' inequality, and the cleanest proof is one you already own from the moments rung: it is just Jensen's inequality applied to the concave logarithm. Because of this, KL behaves like a directed measure of "how far q is from p" — a notion of distance between distributions. But be honest about the word: it is not a true distance. KL is asymmetric, D(p given q) is generally not D(q given p), and it does not satisfy the triangle inequality. It measures wasted bits in one specific direction, not a symmetric gap.

Mutual information: how much one variable tells you about another

Entropy measures the uncertainty in one variable; [[mutual-information|mutual information]] measures how much learning one variable reduces your uncertainty about another. The clean definition is a KL divergence in disguise: I(X; Y) is the KL divergence between the true joint distribution p(x, y) and the product of the marginals p(x) p(y). Recall from the joint-distributions rung that the joint equals the product of marginals exactly when X and Y are independent. So mutual information is literally the divergence between "how the variables actually behave together" and "how they would behave if independent." It is zero precisely when X and Y are independent, and positive otherwise.

There is an equivalent and more intuitive reading: I(X; Y) = H(X) - H(X given Y). The first term is your uncertainty about X before you see Y; the second is your remaining uncertainty after you see Y. Their difference is exactly the uncertainty that observing Y wiped out — the information Y carries about X. By symmetry this also equals H(Y) - H(Y given X), so the information is mutual: Y tells you as much about X as X tells you about Y. One concrete picture: if a medical test result Y reduces your entropy about a disease state X from 1 bit to 0.3 bits, the test delivered 0.7 bits of information about the diagnosis.

Mutual information detects any statistical dependence, linear or not — this is its great advantage over the correlation coefficient, which can be zero for variables that are strongly but non-linearly related. But the same caution from earlier rungs applies in full force: a large I(X; Y) means the two variables share information, not that one causes the other. Information is symmetric; causation is not. Correlation is not causation, and neither is mutual information.

From divergence to learning: the bridge to ML

Here is the payoff that ties this whole rung together. A machine-learning model is a parametric family of distributions q_theta, and "training" means choosing theta so q_theta best matches the unknown true distribution p of the data. The natural target is to minimise D(p given q_theta) — the wasted bits. But we proved that equals cross-entropy minus the constant H(p), so minimising KL divergence is the same as minimising cross-entropy. And cross-entropy, estimated by averaging over your dataset, is exactly the negative average log-likelihood. Minimising it is therefore maximum likelihood estimation — the workhorse you met in the bridge-to-statistics guide, now revealed as an information-theoretic act.

Goal: make your model q_theta close to the true data distribution p, i.e. minimise D(p given q_theta) over the parameters theta.
Split it: D(p given q_theta) = H(p, q_theta) - H(p). Since H(p) is fixed by reality, minimising KL is the same as minimising cross-entropy H(p, q_theta).
Estimate cross-entropy from data: average -log q_theta(x) over your training examples. This is the negative average log-likelihood.
Minimising that average is exactly maximum likelihood. So 'minimise cross-entropy loss' and 'fit by maximum likelihood' are two names for one procedure.

This is why the loss function in nearly every classifier you will meet is called cross-entropy loss, and why the same machinery sits underneath so many models. A hidden Markov model — a Markov chain whose states you cannot see directly, observed only through noisy emissions — is fit by maximising the likelihood of the observed sequence, the same cross-entropy minimisation dressed for time series. The whole sweep of this rung becomes one story: priors and posteriors from the Bayesian guides supply the distributions, Monte Carlo and MCMC let you compute with them when the integrals are intractable, statistics ties estimates to data, and information theory tells you what "a good fit" even means — the model that wastes the fewest bits.

Honest limits, and where the trail leads

A few honest cautions keep these elegant quantities from being misused. KL divergence and cross-entropy are defined relative to a model q, and they blow up to infinity if q assigns probability zero to an outcome that p considers possible — your model declared something impossible, then reality produced it, and the surprise is literally unbounded. This is why working models smear a little probability everywhere. Differential entropy, the continuous analogue using a density, loses some of the discrete version's clean guarantees: it can be negative, and it changes under a change of variables, because a density is not a probability and stretching the axis rescales it. The bit-counting intuition is exact for discrete sources and only a careful analogy in the continuous case.

One more conceptual point closes the loop with the very first guide of this rung. The choice of which model family to even consider — the prior structure — is not handed to us by the data; it is an assumption, and the long-running Bayesian versus frequentist debate is partly about how openly to admit it. Information theory is neutral here: it tells you, given two candidate models, which wastes fewer bits, but it cannot conjure the right model family out of nothing. The famous slogan that the maximum-entropy distribution is the "least biased" choice subject to your constraints is genuinely useful, but it is a principle for choosing among models, not a guarantee that nature obeys it.

And so the ladder closes where it began, but transformed. We started by asking what a probability even is; we end by using probability to quantify information, to measure the distance between beliefs, and to define what it means for a machine to learn. Every tool from the earlier rungs reappears here in a new role — expectation becomes average surprise, Jensen's inequality proves a divergence is non-negative, conditional distributions become reduced uncertainty. The frontier ahead, in deep learning and modern statistics, is built almost entirely from these pieces. You now have the language to read it.