Where we are: conditioning as a random variable
In the previous guide we promoted conditional expectation from a number to a random variable. Given a sigma-algebra G that encodes the information you have, E[X given G] is itself a random variable: it is G-measurable (you can compute it from the information in G alone), and it has the averaging property that E[X given G] integrates to the same total as X over every event in G. Those two clauses are the whole definition — everything in this guide is squeezed out of them.
Keep one concrete picture in mind throughout. Let X be a person's height, and let G be the information "which country they live in." Then E[X given G] is the random variable that, for each person, returns their country's average height. It is constant within each country (that is what G-measurable means here — it can only depend on country) and it matches the true heights on average inside every country. The two rules below, the tower property and taking out what is known, are simply the two most useful things you can do with such a country-average variable.
The tower property: averaging an average
The tower property says that if you take the conditional average and then average that, you get the plain average back: E[ E[X given G] ] = E[X]. In the heights picture this is almost obvious. Average each country's average height, weighting each country by how many people live there, and you recover the overall average height. No information was created or destroyed by the detour through countries — averaging the country-averages just reassembles the whole population. This special case, where you average E[X given G] all the way down to a number, is exactly the law of total expectation you met one rung earlier, now stated for a sigma-algebra instead of a partition.
The full strength of the tower property shows up when you have two nested layers of information. Suppose H is coarser than G — H knows less, say only "which continent," while G knows "which country." Then the tower property reads E[ E[X given G] given H ] = E[X given H]. The slogan is the coarser one wins: re-conditioning a fine average on coarser information collapses straight down to the coarse average. Average the country-averages within each continent and you simply get the continent-average. The finer step in the middle leaves no trace.
Heights X, H = continent (coarse), G = country (fine): X (person) country avg = E[X|G] continent -------------------------------------------------- Ann 172 Japan 168 Asia Bo 164 Japan 168 Asia Cy 180 Spain 180 Europe E[X|H = Asia] = (172+164)/2 = 168 <- direct E[E[X|G]|H=Asia]= (168+168)/2 = 168 <- via country avgs same number: the coarser sigma-algebra (continent) wins.
Taking out what is known
The second work-horse rule is taking out what is known, sometimes called pulling out a measurable factor. If a random variable Y is already determined by the information in G — that is, Y is G-measurable — then inside a conditional expectation given G, Y behaves exactly like a constant: E[ Y * X given G ] = Y * E[X given G]. The intuition is plain. Once G is known, Y is no longer random at all; you already know its value, so it slides out of the averaging just as a constant 7 would slide out of an ordinary expectation, E[7X] = 7 E[X].
Back to heights. Suppose Y is "the average income of your country" — also a country-level quantity, hence G-measurable. To find E[ Y * X given G ], the conditional average of income-times-height within a country, you do not need to re-average Y, because within a single country Y is one fixed number. You just multiply that fixed Y by the country's average height E[X given G]. Knowing the country pins Y down completely, so it leaves the averaging untouched. A G-measurable factor is dead weight to the conditional average; it rides along outside.
Independent information adds nothing
There is a partner rule that completes the picture. Taking out what is known says G-measurable variables become constants. The mirror statement, conditioning on independent information, says that if X is independent of G, then knowing G tells you nothing about X, so E[X given G] = E[X], the plain unconditional mean. Learning which country someone lives in changes your guess of their height only insofar as height and country are related; if they were genuinely unrelated, the country-average would be the same everywhere and equal to the global average.
Be careful here, because this is where a classic trap lives. Independence is what makes conditioning collapse to the unconditional mean — and independence is strictly stronger than mere lack of correlation. Two variables can have zero correlation yet still be dependent, in which case E[X given G] genuinely varies with G even though Cov(X, Y) = 0. So you may NOT shortcut to E[X given G] = E[X] just because X and the variable generating G are uncorrelated; you need real independence. Conversely, independent variables are always uncorrelated, so the implication runs one way only.
The geometry: projection onto what you know
All of this snaps into a single picture once you put it in L^2 space, the world of random variables with finite second moment, where the "length" of X is the square root of E[X^2] and the "angle" between X and Y is governed by E[XY]. In this geometry, [[conditional-expectation-as-l2-projection|E[X given G] is the orthogonal projection]] of X onto the subspace of all G-measurable variables — the closest G-measurable variable to X. This is the deep reason it is the best predictor, which the next guide develops in full: the projection is the foot of the perpendicular, the nearest point in the flat plane of "things you could know."
From this single image both work-horse rules fall out as geometry. The tower property is iterated projection: projecting onto the fine subspace and then onto the coarser subspace inside it is the same as projecting straight onto the coarse subspace — and the coarse one wins because it is the final landing place. Taking out what is known is linearity of projection in directions that already lie inside the subspace: a G-measurable factor is a vector already in the plane, so it scales the projection rather than being bent by it. The error X minus E[X given G] is orthogonal to everything in G — that orthogonality is the averaging property wearing geometric clothes.
One honest caution about scope. This crisp projection picture lives in L^2 and needs X to have a finite second moment, E[X^2] < infinity. Conditional expectation itself is more general — it is defined whenever E[|X|] < infinity, with no second moment required — and the tower property and taking-out rules hold in that wider L^1 world too. So the geometry is the most beautiful way to see the rules and the right intuition to carry around, but it is a special case, not the definition. The rules are true more broadly than the picture that explains them.
Putting the rules to work
Watch the rules combine on a small problem. You roll a fair die; let N be the result, and then flip N fair coins; let X be the number of heads. We want E[X]. Conditioning on N is natural because once N is fixed, X is just Binomial(N, 1/2), whose mean is N/2. So E[X given N] = N/2 — here N is the information, and N/2 is G-measurable, a clean conditional mean. Now the tower property finishes it in one stroke: E[X] = E[ E[X given N] ] = E[N/2] = (1/2) E[N] = (1/2)(3.5) = 1.75.
- Choose the information to condition on so the inner problem becomes easy. Here, conditioning on the die N turns X into a plain Binomial.
- Compute the inner conditional expectation E[X given N] = N/2, a function of the known quantity N — that is the take-out / measurability step.
- Apply the tower property: average the inner answer over N. E[N/2] = (1/2)E[N], where the constant 1/2 pulls out by linearity.
- Plug in E[N] = 3.5 to get E[X] = 1.75 — two conditioning rules and a known mean, no messy double sum over all dice-and-coin outcomes.
Two closing threads point forward. First, these rules are the beating heart of a martingale, a process whose conditional expectation of tomorrow given everything known today equals today's value — the tower property is what makes "fair game" consistent across time, and you will lean on it constantly in the rungs ahead. Second, conditional expectation respects convex functions through the conditional Jensen inequality, E[g(X) given G] >= g(E[X given G]) for convex g, the conditional twin of ordinary Jensen. That inequality, combined with taking out what is known, is exactly what the next guides need to prove that E[X given G] is the best mean-square predictor and to crack open conditional variance.