Conditional Expectation Given a Sigma-Algebra

From a single number to a random variable

In the previous guide you saw how conditioning on a whole sigma-algebra G means conditioning on a body of *information* rather than on one event. Now we turn that intuition into a precise object. The first mental shift to make — and it is the whole point of this guide — is that conditional expectation given a sigma-algebra, written E[X given G], is not a number. It is a random variable: a new function on the same sample space, whose value depends on which outcome occurs. Plain E[X] collapses X to one number; E[X given G] keeps a function around, one that has been smoothed down to whatever resolution G can see.

Picture the sample space chopped into blocks by the information in G. On each block, G cannot tell the outcomes apart — they are indistinguishable as far as G is concerned. So E[X given G] is forced to be *constant on every such block*, and the constant it takes there is the ordinary average of X over that block. If your information were the coarsest possible (G = {empty set, whole space}), there is only one block and the average over it is just E[X]: conditional expectation collapses back to the plain mean. If your information were total (G = everything), each outcome is its own block and E[X given G] = X exactly. Every honest G sits between these extremes, giving X a blurred portrait at a chosen sharpness.

The two defining properties

The block picture is the right intuition, but for general (possibly continuous) sigma-algebras 'block' is too crude a word, because the conditioning information may not chop the space into tidy lumps at all. The modern definition replaces blocks with a pair of clean requirements. We call Y = E[X given G] the conditional expectation if Y satisfies both defining properties: (1) Y is G-measurable, and (2) for every event A that lives in G, the average of Y over A equals the average of X over A. Property (2) is the partial-averaging property, and it is the engine of everything.

Y = E[X | G]  is THE conditional expectation iff:

  (1) Measurability:   Y is G-measurable
                       (Y depends only on the information in G)

  (2) Partial averaging:   for every A in G,
         E[ Y * 1_A ]  =  E[ X * 1_A ]
      i.e.  integral of Y over A  =  integral of X over A

  Special case A = whole space:   E[Y] = E[X]

The two conditions that pin down E[X given G] uniquely (up to events of probability zero).

Read property (2) slowly, because it captures the whole idea: 'on every set you are allowed to ask about, Y carries the same total as X'. You may not be able to recover X outcome-by-outcome from G, but you can demand that Y match X's accumulated value on each resolvable region. That is exactly what 'the average of X at this resolution' should mean. Together the two properties force Y to be constant on the finest pieces G can distinguish, with that constant equal to X's average there — recovering the block picture wherever blocks exist, and gracefully covering the cases where they do not.

Why it exists at all: the projection argument

It is one thing to write down two properties; it is another to know that a random variable satisfying them actually exists, and is unique. The cleanest existence proof — at least for variables with finite variance — comes from geometry. Think of all random variables with finite variance as vectors in a space where the 'length-squared' of X is E[X^2] and the inner product of X and Y is E[XY]. This is a genuine Hilbert space, and the G-measurable variables form a closed subspace inside it: the subspace of 'things you could know from G alone'.

In any such space, every vector has a unique nearest point inside a closed subspace — its orthogonal projection. The conditional expectation E[X given G] is exactly the projection of X onto the subspace of G-measurable variables. Geometrically it is the shadow X casts on the world G can see: the closest G-measurable variable to X. The error X - E[X given G] is orthogonal to that whole subspace, meaning E[ (X - E[X given G]) * Z ] = 0 for every G-measurable Z. Set Z = 1_A and you recover the partial-averaging property exactly — so the geometric definition and the two-property definition are the same statement seen from two angles.

Two honest caveats. The slick projection proof needs X to have finite variance, so that it lives in the L^2 space at all; for merely integrable X (finite mean but possibly infinite variance) existence still holds, but by a different argument resting on the Radon-Nikodym theorem you met in the measure-theory rung. And the projection is in the mean-square sense, not pointwise — E[X given G] is the variable whose *average squared distance* to X is smallest, which connects directly to the 'best predictor' story in guide 4 of this rung. For now the takeaway is simply: the object exists, it is unique up to probability zero, and geometrically it is the nearest G-measurable shadow of X.

A tiny worked example

Numbers make the abstraction land. Roll a fair die, so X is the face value, uniform on {1, 2, 3, 4, 5, 6} with E[X] = 3.5. Let G be the information 'is the result even or odd?' — a tiny sigma-algebra with just two real blocks: the odds {1, 3, 5} and the evens {2, 4, 6}. To build E[X given G] we average X within each block. Over the odds, (1 + 3 + 5)/3 = 3; over the evens, (2 + 4 + 6)/3 = 4. So E[X given G] is the random variable that equals 3 whenever the roll is odd and 4 whenever it is even.

Check measurability: the value 3-or-4 depends only on parity, which is exactly the information in G. Pass.
Check partial averaging on A = odds: average of Y over the odds is 3, and average of X over the odds is (1+3+5)/3 = 3. Match.
Check partial averaging on A = evens: average of Y is 4, average of X is (2+4+6)/3 = 4. Match.
Check the overall mean: E[Y] = (1/2)(3) + (1/2)(4) = 3.5 = E[X]. The smoothing preserved the global average, as it must.

Notice what happened to the spread. X ranged over six values; E[X given G] takes only two, 3 and 4, hugging much closer to the centre 3.5. That is general: smoothing to lower resolution can only shrink variance, never grow it — Var(E[X given G]) <= Var(X). The variance that disappeared is precisely the within-block scatter that G can no longer see, and tracking that lost piece is the subject of conditional variance in guide 5. This is also a clean illustration of E[X given G] as a function of the conditioning variable, the special case E[X given Y] when G is generated by a single variable Y (here, parity).

Traps, and what comes next

A few misconceptions trip up almost everyone. First and loudest: E[X given G] is a random variable, not a number — only E[X] and E[X given A] for a fixed event A are numbers. The moment you condition on a whole sigma-algebra (or a whole variable), the answer is a function of the outcome. Second, do not confuse 'G-measurable' with 'independent of G'. If X is itself G-measurable (G already knows X), then E[X given G] = X — there is nothing left to average out. At the opposite end, if X is independent of G then conditioning tells you nothing and E[X given G] = E[X], the constant. Most variables sit between, and these two endpoints are the sanity anchors.

Third, beware reading the partial-averaging property as 'Y equals X on A'. It says the *integrals* match over A, not the pointwise values; E[X given G] generally differs from X at almost every individual outcome, agreeing only in accumulated totals over G-sets. And fourth, do not expect E[X given G] to be computable from G's labels alone without knowing the distribution of X — the labels tell you which block you are in, but you still need X's averages within each block to fill in the values.

With the object firmly defined, the rest of the rung is about working with it fluently. Guide 3 develops the two workhorse rules — the tower property (averaging a coarse smoothing of a fine smoothing just gives the coarse one) and 'taking out what is known' (any factor G already sees pulls outside the conditional expectation). Guide 4 cashes in the projection picture by showing E[X given G] is literally the best mean-square predictor of X from G, and guide 5 measures the scatter left behind with conditional variance. Everything downstream rests on the single definition you have just unpacked: the unique, G-measurable variable whose averages match X on every set G can resolve.