Conditional Variance and the Decomposition Identity

From the best guess to the leftover spread

In the previous guide you learned that E[X given G] is the best predictor of X built from the information G — the function of your data that gets closest to X in the least-squares sense. A best guess naturally raises the next question: how good is it? After you have committed to that guess, how much of X is still unpredictable? That leftover, that residual wobble, is exactly what conditional variance measures. It is the partner of conditional expectation: one says where X tends to be given the information, the other says how spread out X still is around there.

The definition is the most natural thing imaginable: take the ordinary recipe [[prob-variance|Var(X) = E[X^2] - (E[X])^2]] and put a 'given G' on every expectation in sight. So Var(X given G) is defined as E[(X - E[X given G])^2 given G], the conditional expectation of the squared distance of X from its conditional mean. Equivalently, using the computational shortcut, Var(X given G) = E[X^2 given G] - (E[X given G])^2. Like its parent, this object is a random variable, not a number — it changes as the information you receive changes.

A tiny worked example

Keep the die from the earlier guides: roll a fair die, let X be the result, and let Y be 0 if the roll is odd, 1 if even. We already found E[X given Y] equals 3 on odd rolls and 4 on even rolls. Now compute the spread inside each block. On the odd block {1, 3, 5}, the values sit at distances -2, 0, +2 from their mean 3, so the within-block variance is (4 + 0 + 4) / 3 = 8/3. The even block {2, 4, 6} is the same shape around mean 4, again 8/3. So here Var(X given Y) happens to equal 8/3 in both blocks — the spread is the same regardless of which half you are told you are in.

Now hold these two pieces side by side. The conditional mean E[X given Y] is itself a random variable taking 3 and 4, each with probability 1/2 — so it has its own spread: its mean is 3.5 and its variance is (0.5)^2 averaged = 0.25. Meanwhile the conditional variance Var(X given Y) is the constant 8/3 here, so its average is just 8/3. Add the average leftover spread to the spread of the guesses: 8/3 + 1/4 = 32/12 + 3/12 = 35/12. And the plain variance of a fair die roll is exactly 35/12. That is not a coincidence — it is the identity this whole guide is about.

Die: X = roll, Y = 0 if odd / 1 if even

  Within-block means     E[X|Y]:   3 (odd),   4 (even)
  Within-block variances Var(X|Y):  8/3 (odd), 8/3 (even)

  E[ Var(X|Y) ]  = 8/3 * 1/2 + 8/3 * 1/2 = 8/3      (avg leftover spread)
  Var( E[X|Y] )  = (3-3.5)^2 *1/2 + (4-3.5)^2 *1/2 = 1/4   (spread of guesses)

  Sum = 8/3 + 1/4 = 35/12 = Var(X)    (law of total variance)

Average leftover spread plus spread of the conditional means equals the total variance.

The decomposition identity

The pattern in the example is a theorem, the law of total variance, sometimes called the variance decomposition or Eve's law. It states Var(X) = E[Var(X given G)] + Var(E[X given G]). In words: the total variability of X splits cleanly into two non-negative pieces — the average of the unexplained variability that remains within blocks, plus the variability of the block-to-block predictions themselves. Nothing is double-counted and nothing leaks; the budget always balances.

The proof is short and rests entirely on tools you already own. Start from Var(X) = E[X^2] - (E[X])^2. For the first term, apply the tower property to peel off a layer: E[X^2] = E[E[X^2 given G]]. Inside, write E[X^2 given G] = Var(X given G) + (E[X given G])^2, the conditional version of the computational formula. So E[X^2] = E[Var(X given G)] + E[(E[X given G])^2]. For the second term, the law of total expectation gives E[X] = E[E[X given G]], so (E[X])^2 = (E[E[X given G]])^2. Subtract: the two squared pieces combine into E[(E[X given G])^2] - (E[E[X given G]])^2, which is precisely Var(E[X given G]). What is left over is E[Var(X given G)]. Done.

Write Var(X) = E[X^2] - (E[X])^2, the ordinary computational formula.
Tower the first term: E[X^2] = E[E[X^2 given G]], then expand the inside as E[X^2 given G] = Var(X given G) + (E[X given G])^2.
Tower the mean: E[X] = E[E[X given G]], so the subtracted square is (E[E[X given G]])^2.
Collect the (E[X given G])^2 pieces into Var(E[X given G]) and read off Var(X) = E[Var(X given G)] + Var(E[X given G]).

Pythagoras in the space of random variables

The previous guide showed that E[X given G] is the orthogonal L^2 projection of X onto the random variables you can build from G. Think of square-integrable random variables as vectors, with inner product E[XY] and squared length E[X^2]. Projecting X onto the subspace of G-measurable variables splits X into two perpendicular pieces: the shadow E[X given G], which lives in the subspace, and the residual X - E[X given G], which is orthogonal to it. Orthogonality here is not a metaphor — the residual is uncorrelated with every variable the information can produce.

Now the law of total variance is just the Pythagorean theorem for this right triangle. Center everything at the mean: the total spread Var(X) is the squared length of X minus its mean; the leg Var(E[X given G]) is the squared length of the projected shadow (how far the guesses spread); the other leg E[Var(X given G)] is the squared length of the residual (the unexplainable wobble). Because the two legs are perpendicular, their squared lengths add to the hypotenuse — exactly Var(X) = Var(E[X given G]) + E[Var(X given G)]. The dry algebra of the proof and this clean picture are the same fact wearing two outfits.

Reading the two pieces, and honest cautions

The decomposition is genuinely useful because the two terms have a plain-English meaning. E[Var(X given G)] is the unexplained or within-group variance — the spread that survives even after you exploit the information; it is the irreducible noise of your best predictor. Var(E[X given G]) is the explained or between-group variance — how much of X's spread is captured by the fact that the conditional mean moves as the information changes. Their ratio, Var(E[X given G]) / Var(X), is the fraction of variance explained — the same idea that powers R-squared in regression and the F-test in analysis of variance.

Use the identity to compute variances that would be painful head-on. Suppose N, the number of eggs a hen lays, is Poisson with mean lambda, and each egg hatches independently with probability p, giving X chicks. Given N = n, X is Binomial, so E[X given N] = pN and Var(X given N) = N p (1 - p). Then E[Var(X given N)] = p(1-p) E[N] = p(1-p) lambda, while Var(E[X given N]) = Var(pN) = p^2 lambda. Adding: Var(X) = p(1-p) lambda + p^2 lambda = p lambda. The two-stage bookkeeping does in three lines what a direct attack on the compound distribution would labor over.

A few honesty checks. The whole edifice needs X to be square-integrable, that is E[X^2] finite — for heavy-tailed laws like the Cauchy the variance is undefined and none of this applies, just as the CLT itself fails there. Also keep the two pieces apart: a small within-group variance does not by itself mean the conditioning is informative, and a small explained variance does not mean X is nearly constant; you must look at both legs and their ratio. Finally, do not read 'explained' as 'caused' — the decomposition is pure algebra of variances, and as always, correlation and explained variance are not causation. The information may merely be a marker for something else doing the real work.