The mean is not the whole story
By now E[X] feels familiar: the long-run average, the balance point of a distribution. But the mean alone can hide everything that matters. Picture two games. In game A you always win exactly 100 dollars. In game B you win 0 or 200 dollars on a fair coin flip. Both have E[X] = 100 — identical means — yet they could not feel more different to play. Game A is a sure thing; game B is a gamble. The mean is blind to that difference, because averaging 0 and 200 throws away the fact that you never actually get 100.
What separates the two games is spread: how far the outcomes typically stray from the center. We want a single number that captures it. A first idea — average the deviation from the mean, E[X - E[X]] — fails instantly, because that average is always exactly zero: the positive overshoots and the negative undershoots cancel by the very definition of the balance point. The mean is where the see-saw balances, so the signed gaps must net out to nothing. We need to stop the cancellation.
Two honest cures exist: take the absolute value of each deviation, or square it. Squaring wins, and not arbitrarily. Squares are smooth (easy to do calculus on), they punish large deviations more than small ones (a deviation of 4 contributes 16, not just 4), and — crucially — they make the algebra of sums behave, as we will see when we add variables. The mean absolute deviation is a perfectly valid measure of spread and is sometimes more robust; squaring is simply the choice that unlocks the cleanest theory.
Variance: the expected squared distance
Here is the definition. Write mu for the mean E[X]. The variance is the average of the squared distance from the mean: Var(X) = E[(X - mu)^2]. Read it slowly as a recipe: take how far X falls from its center, square that gap so it cannot be negative, then average those squared gaps over the whole distribution, weighting each by its probability. A variable that hugs its mean has small squared gaps and a small variance; one that flings outcomes far away has large squared gaps and a large variance. Variance is exactly the expected squared deviation, no more and no less.
Notice this is just LOTUS in action. We have a random variable X with a known distribution, and we want the expected value of a function of it — here the function is g(x) = (x - mu)^2. LOTUS says we never need the distribution of that squared quantity itself; we just weight g(x) by X's own probabilities and add (or integrate). So for a discrete X, Var(X) = sum over x of (x - mu)^2 times P(X = x); for a continuous X you integrate (x - mu)^2 times the density. Variance is not a new kind of operation — it is an ordinary expectation of a particular function.
Let us nail game B with it. X is 0 or 200, each with probability 1/2, and mu = 100. The two squared deviations are (0 - 100)^2 = 10000 and (200 - 100)^2 = 10000, each weighted 1/2, so Var(X) = (1/2)(10000) + (1/2)(10000) = 10000. For game A, X is always 100, the deviation is always 0, and Var(X) = 0. There it is in numbers: the sure game has zero variance, the gamble has a large one, and the means — both 100 — never told them apart. Variance is precisely the dial the mean was missing.
Two facts follow for free. Variance can never be negative: it is an average of squares, and squares are never negative, so Var(X) >= 0 always — there is no such thing as negative spread. And Var(X) = 0 happens only in the extreme case where X never strays from its mean at all, meaning X equals a single constant with probability 1. Any genuine randomness gives a strictly positive variance.
The computational formula
Computing E[(X - mu)^2] directly is clumsy — you must find mu first, then re-sum over every outcome with that mu baked into each squared gap. There is a much friendlier route, the computational formula: Var(X) = E[X^2] - (E[X])^2. In words, the variance is the mean of the squares minus the square of the mean. You compute two ordinary expectations — E[X^2] and E[X] — and subtract. No re-centering, no second pass.
It falls right out of the definition once you expand the square and use linearity. Expanding gives E[(X - mu)^2] = E[X^2 - 2 mu X + mu^2]. Now linearity of expectation splits the sum: E[X^2] - 2 mu E[X] + mu^2. But E[X] is mu itself, so the middle term is -2 mu^2 and the last is +mu^2, leaving E[X^2] - mu^2 = E[X^2] - (E[X])^2. The whole derivation is just linearity plus the fact that mu is a constant you can pull out of an expectation.
- Pick a fair die, X uniform on 1..6. First find E[X]: average of 1 through 6 is 3.5.
- Find E[X^2] with LOTUS: (1 + 4 + 9 + 16 + 25 + 36)/6 = 91/6 which is about 15.1667.
- Subtract the square of the mean: Var(X) = 91/6 - (3.5)^2 = 15.1667 - 12.25 = 2.9167, i.e. 35/12.
- Sanity-check the order: E[X^2] = 15.17 is NOT the same as (E[X])^2 = 12.25, and their gap of 2.92 is the variance. Swapping the two would give a negative number — a red flag you made an error.
Standard deviation: spread in honest units
Variance has one awkward feature: its units are squared. If X is a payout in dollars, Var(X) is in dollars-squared, which is meaningless to picture. Squaring rescued us from cancellation, so to read the spread back in the original units we simply take the square root. That gives the standard deviation, sigma = square root of Var(X), written sigma(X) or just sigma. For game B, Var(X) = 10000, so sigma = 100 dollars — a clean statement that outcomes typically sit about 100 dollars from the mean of 100.
The standard deviation is the number people actually quote, because it lives on the same scale as the data and you can lay it next to the mean. A distribution summarized as 'mean 50, sigma 5' means typical values cluster within a few units of 50; 'mean 50, sigma 40' means they sprawl all over. For the bell curve X ~ Normal(mu, sigma^2), the parameters are literally the mean and the variance, and sigma is the width that contains roughly 68 percent of the probability on either side — sigma is the natural ruler for how wide a distribution is.
How sigma reacts to rescaling and shifting is worth knowing, and it follows from the scale-and-shift rule: Var(aX + b) = a^2 Var(X). Two lessons hide here. Adding a constant b shifts every outcome and the mean together, so the gaps from the mean are untouched and the spread does not change — adding b drops out entirely. Multiplying by a stretches every gap by a, but variance squares gaps, so it scales by a^2; taking the root, the standard deviation scales by the absolute value of a. Double every payout and you double sigma but quadruple the variance.
Adding variables: where independence finally matters
In the previous guide you saw the superpower: E[X + Y] = E[X] + E[Y] no matter what — linearity of expectation needs no independence at all. It is tempting to hope variance is just as carefree, but here the rule changes. The honest statement is the variance of a sum: Var(X + Y) = Var(X) + Var(Y) + 2 Cov(X, Y). That extra term, twice the covariance, measures how X and Y move together — it is positive if they tend to rise and fall in step, negative if one tends to rise as the other falls.
Only when the covariance is zero does variance add cleanly: Var(X + Y) = Var(X) + Var(Y). This is guaranteed if X and Y are independent, because independent variables have zero covariance. That is why averaging many independent measurements shrinks the spread: variances add but the sum gets divided by n, so the variance of the average falls like 1/n and sigma like 1 over the square root of n. The square-root law that governs sampling and the law of large numbers grows straight out of this additivity.
What variance buys you, and what comes next
Variance is not just bookkeeping; it gives you a guarantee. The Chebyshev inequality says that for any distribution with finite variance, the probability of landing more than k standard deviations from the mean is at most 1/k^2 — so at most a quarter of the probability can sit beyond two sigma, at most a ninth beyond three sigma. It is a loose bound precisely because it assumes nothing about shape, working for any distribution at all. A small sigma therefore literally pins outcomes near the mean, turning the spread into a concrete promise about how often you stray.
Two honest cautions before we move on. First, variance assumes the squared values have a finite average; some heavy-tailed distributions have infinite or undefined variance, and for them sigma simply does not exist and Chebyshev says nothing. Second, sigma summarizes the typical spread but says nothing about which side the surprises fall on — a distribution can have a long tail to the right and a short one to the left yet still report a single symmetric-looking sigma. Variance measures width, not lopsidedness.
That last gap is the doorway to the final guide of this rung. Variance is built from E[(X - mu)^2], the second moment about the mean; push to the third such average and you get a number that detects lopsidedness (skewness), to the fourth and you sense how heavy the tails are (kurtosis). These higher moments are the next dials of shape, and there is even a single object — the moment generating function — that packages them all at once. Mean located the distribution; variance scaled it; the moments to come will tell you its full silhouette.