Conditional Expectation as the Best Predictor

What "best" should even mean

By now you can compute E[X given G] and you've seen its two everyday tools, the tower property and taking out what is known. This guide answers a different and very human question: why should we *care* about this object? The honest motivation is prediction. You have a random quantity X you cannot see directly — tomorrow's demand, a hidden signal, a future price — but you do have some information, packaged as a sigma-algebra G. Your job is to commit to a single best guess of X built only from what G lets you know. The whole guide is about pinning down what "best" means and discovering that the winner is exactly E[X given G].

A guess is just some random variable Y that is measurable with respect to G — meaning Y is computable from the information in G alone, never peeking at the parts of the world G cannot see. To grade a guess we need a penalty for being wrong. The choice that makes everything beautiful is the mean squared error, MSE(Y) = E[(X - Y)^2]: take the gap between truth and guess, square it so over- and under-shooting both cost something and big misses cost a lot, then average. The best predictor is the G-measurable Y that makes this number as small as possible. Squaring is a genuine modeling choice, not the only one — but it is the one that turns prediction into clean geometry, as we'll see.

The clean proof that the conditional mean wins

Here is the argument, and it's short enough to carry in your head. Write Xhat = E[X given G] for the candidate, and let Y be any other G-measurable guess. Split the error into two pieces by inserting Xhat: X - Y = (X - Xhat) + (Xhat - Y). The first piece, X - Xhat, is the residual — the part of X that no amount of G-information can explain. The second piece, Xhat - Y, is the difference between two G-measurable guesses, so it is itself G-measurable. The magic is that these two pieces don't interfere when you square and average: the cross term vanishes.

Why does the cross term die? Expand: E[(X - Y)^2] = E[(X - Xhat)^2] + 2·E[(X - Xhat)(Xhat - Y)] + E[(Xhat - Y)^2]. Look at the middle term. The factor (Xhat - Y) is G-measurable, so by taking out what is known we can pull it through a conditional expectation given G. But the conditional expectation of the residual is zero: E[X - Xhat given G] = E[X given G] - Xhat = 0, because Xhat *is* E[X given G]. So the conditional expectation of the whole product is (Xhat - Y)·0 = 0, and by the tower property its plain expectation is 0 too. The residual is, on average, uncorrelated with anything G can build.

MSE(Y) = E[(X - Y)^2]
       = E[(X - Xhat)^2]  +  E[(Xhat - Y)^2]
         \____________/      \____________/
         fixed cost          >= 0, zero only if Y = Xhat
         (irreducible)       (your avoidable error)

  where  Xhat = E[X given G]

The Pythagorean decomposition of error. The first term is out of your control; the second is yours to kill, and it's killed exactly by guessing the conditional mean.

Read the boxed identity. MSE(Y) is the irreducible cost E[(X - Xhat)^2] plus a non-negative term E[(Xhat - Y)^2] that depends on your choice. Since the second term is a square, it is at least zero and equals zero only when Y = Xhat. Therefore every guess pays at least E[(X - Xhat)^2], and only Xhat = E[X given G] pays exactly that floor. That is the precise sense in which E[X given G] is the best mean-square predictor of X given G: no other G-measurable function can do better, and any deviation costs you extra, measured by how far you strayed.

The geometry: a projection and a right angle

That decomposition was really the Pythagorean theorem in disguise, and recognizing this is the deepest payoff of the rung. Think of every random variable with a finite second moment as a vector in a space called L^2, where the "length-squared" of a vector Z is E[Z^2] and the "inner product" of two vectors is E[ZW]. In this geometry, distance between X and a guess Y is the square root of E[(X - Y)^2] — exactly our error. The G-measurable random variables form a flat subspace inside L^2: a plane, if you like. Finding the best predictor is finding the point on that plane closest to the vector X.

And the closest point on a plane to an outside point is always the foot of the perpendicular — the orthogonal projection. So E[X given G] is literally the projection of X onto the subspace of G-measurable variables, which is why we call it conditional expectation as an L^2 projection. The residual X - Xhat is the perpendicular dropped from X to the plane, and "perpendicular" here means E[(X - Xhat)·Z] = 0 for every G-measurable Z — exactly the cross-term-vanishing fact we proved. The right angle and the vanishing cross term are the same statement, seen from two directions.

A tiny worked example you can hold

Roll a fair die; let X be the number shown, so E[X] = 3.5. With no information, your best constant guess is 3.5, with squared error Var(X) = E[X^2] - (E[X])^2 = 91/6 - 12.25 ≈ 2.917. Now suppose the only information G you get is the parity: someone will tell you whether the roll is odd or even, nothing more. The best predictor must be a function of parity alone — one value for odd, one for even. The projection says: use the conditional mean on each parity class. Odd faces 1, 3, 5 average to 3; even faces 2, 4, 6 average to 4. So E[X given parity] equals 3 on odd rolls and 4 on even rolls.

Feel the improvement concretely. After projecting, the leftover squared error is the average squared gap within each class: for odds, the deviations of {1,3,5} from 3 are -2,0,+2 giving an average squared miss of 8/3; for evens, {2,4,6} from 4 give the same 8/3. Averaging across the equally likely classes leaves 8/3 ≈ 2.667. So learning mere parity drops the error from about 2.917 to about 2.667 — a real, if modest, gain. The amount you gained, roughly 0.25, is exactly the variance *explained* by the parity information, and that bookkeeping — total error = explained + unexplained — is the seed of the conditional variance decomposition you'll meet next.

Identify the information G and the guesses it allows — here, any function constant on each parity class.
On each piece of G, compute the conditional mean of X; that value is the projection on that piece.
Assemble these piecewise means into one random variable — that is E[X given G], your best predictor.
Score it: the remaining error is the average within-piece variance, and the drop from Var(X) is the variance the information explained.

Honest limits and a final reframe

Be clear about what "best" does and does not promise. First, it is best only among guesses built from G — give the predictor more information (a finer sigma-algebra) and it can do at least as well, never worse, because a bigger plane is closer to X. Second, the optimality is in the squared-error average; it does not say E[X given G] will be close on any single trial. In the die example the best guess on an even roll is 4, yet the actual value might be 2 or 6 — off by 2 every time. The conditional mean minimizes the *average* squared miss, not the miss on the next throw. Confusing those is a cousin of the gambler's-fallacy mistake of expecting averages to control individual outcomes.

Two more honest notes. The squared-error framing requires X to have a finite second moment, E[X^2] < infinity, so that it actually lives in L^2 and the projection exists — a heavy-tailed X with infinite variance (the Cauchy is the textbook offender) breaks the geometry, even though E[X given G] can still be defined more generally through its defining properties. And do not over-read "projection" as "approximation that throws away noise on purpose": the residual X - Xhat is not error you could have avoided with cleverness — it is the genuinely unpredictable part of X relative to G, and any guess that pretends to capture it is just overfitting the unseeable.

Step back and the rung clicks together. The defining property you met earlier — E[X given G] is the G-measurable variable whose averages match X over every G-event — was an algebraic demand. The best-predictor theorem reveals what that demand secretly *is*: it is the equation of a perpendicular, the unique direction in which the leftover error has no usable component. Conditioning is not a formula to memorize; it is the act of projecting reality onto what you are allowed to know. Next we measure the size of the leftover — the conditional variance — and watch total variability split cleanly into the part information explained and the part it could not.