From Conditioning on an Event to Conditioning on Information

What you can already do, and what is missing

From the earlier rungs you are fluent with conditional probability given an event: once you learn that B happened, you recompute everything inside B, using P(A given B) = P(A and B) / P(B). The picture is sharp — knowing B happened shrinks the sample space down to B, and probabilities are renormalized so they add to one again inside that smaller world. You also met the conditional expectation given an event, E[X given B], the ordinary expectation of X computed in that shrunken world. All of this gives you one number per question.

But there is a richer thing you will want very soon, especially once you reach martingales and stochastic processes. Instead of being told one specific event happened, you are told the value of another random variable Y — and then asked, what do we now expect X to be? That is conditional expectation given a variable, written E[X given Y]. The crucial twist is that you usually do not know which value of Y will occur, so the natural answer is not a single number but a whole rule: a number for each possible value y. The output of conditioning is starting to look like a function — and that is the doorway to this entire rung.

Conditioning on a variable, value by value

Let us make this concrete with tiny numbers. Roll a fair die and let X be the result. Let Y be 0 if the roll is odd and 1 if it is even. Conditioning on the event {Y = 0} (an odd roll) shrinks the world to {1, 3, 5}, each now with probability 1/3, so E[X given Y = 0] = (1 + 3 + 5) / 3 = 3. Conditioning on {Y = 1} (an even roll) gives the world {2, 4, 6} and E[X given Y = 1] = (2 + 4 + 6) / 3 = 4. Two events, two ordinary conditional expectations — nothing new yet.

Now do the bold thing: do not pick a value of Y in advance. Define a new object, E[X given Y], that returns 3 whenever Y turns out odd and 4 whenever Y turns out even. Because Y is itself random, this object is a random variable, not a number — it takes the value 3 with probability 1/2 and 4 with probability 1/2. This is the single most important mental shift of the rung: conditioning on information produces a random variable whose value depends on which information you receive. The familiar number E[X given Y = y] is just this random variable read off at one particular y.

Information is a partition — and a sigma-algebra

Step back and ask what knowing Y actually buys you. In the die example, Y does not let you tell 1 from 3 from 5 — all three odd outcomes give Y = 0, so Y lumps them together. Likewise it cannot separate 2, 4, 6. So observing Y splits the sample space into the two groups {1, 3, 5} and {2, 4, 6}. That is exactly a partition of the sample space: a collection of disjoint blocks that together cover everything. The information Y carries is precisely the ability to say which block you are in — and nothing finer.

Here is the unifying leap. Take that partition and close it under unions and complements — toss in all the sets you can build from the blocks, like {1, 3, 5}, {2, 4, 6}, the whole space, and the empty set. What you get is a sigma-algebra, the same structure you met when probability was put on rigorous footing. A sigma-algebra is best read not as dry bookkeeping but as a precise ledger of questions you can answer: an event is in it exactly when, given your information, you can decide yes-or-no whether that event occurred. So a sigma-algebra G is information, packaged so that mathematics can handle it.

This reframing is why the rung's title speaks of conditioning on information. Conditioning on the variable Y is the same as conditioning on the sigma-algebra generated by Y — the ledger of all yes-or-no questions Y can settle. The advantage of the sigma-algebra language is generality: a filtration in a stochastic process is a growing chain of sigma-algebras, one for each time, recording everything known so far. Once you can condition on a sigma-algebra, you can condition on the entire past of a process, which is the engine behind martingales.

The averaging principle that ties it together

There is one rule that quietly governs everything, and it is worth seeing now in its simplest dress. If you take E[X given Y], a random variable, and average it over the randomness of Y, you get back the plain old E[X]. In the die example: E[X given Y] equals 3 half the time and 4 half the time, so its average is 3 times 1/2 plus 4 times 1/2 = 3.5 — exactly E[X] for a fair die. Conditioning rearranges the average across blocks; it never creates or destroys it.

This is the law of total expectation, E[E[X given Y]] = E[X], and in the sigma-algebra world it grows up into the tower property, the workhorse of the next two guides. Intuitively it says: predict in two stages — first guess X using the coarse information, then average those guesses — and you land where a single overall average would have landed. It is the conditional version of the law of total probability you already know, with expectations in place of probabilities.

Die example, X = roll, Y = 0 if odd / 1 if even

  E[X | Y = 0] = (1 + 3 + 5)/3 = 3      P(Y = 0) = 1/2
  E[X | Y = 1] = (2 + 4 + 6)/3 = 4      P(Y = 1) = 1/2

  E[X | Y]  is the RANDOM VARIABLE:   3 (when odd),  4 (when even)

  Average it back:
  E[ E[X | Y] ] = 3*(1/2) + 4*(1/2) = 3.5 = E[X]   (law of total expectation)

E[X given Y] is a random variable; averaging it over Y recovers the unconditional E[X].

Honest cautions before we go deeper

A few traps are worth naming now so they do not bite later. First, do not confuse the number and the variable: E[X given Y = 3.5] is meaningless if Y never equals 3.5, but E[X given Y] as a function is perfectly well defined wherever Y lands. Second, E[X given Y] is a function of Y, never of X — once you know which block you are in, X may still wobble within that block, and the conditional expectation reports only the within-block average, not X itself.

Third, conditioning is genuinely different from independence. If knowing Y tells you nothing about X — that is, X and Y are independent in the right sense — then E[X given Y] collapses to the constant E[X], because every block has the same within-block average. But that is the special boring case. The whole point of conditional expectation is the interesting case where the blocks differ, so the information genuinely moves your prediction. Do not assume conditioning leaves things unchanged; assume it changes them unless independence says otherwise.

With those guardrails in place, the road ahead is clear. The next guide pins down E[X given G] for a general sigma-algebra G by its defining properties — measurability and an averaging condition — so it works even for continuous Y where the blocks shrink to single points and the naive ratio P(A and B) / P(B) breaks down because P(B) is zero. After that we drill the tower property and the take-out-what-is-known rule, then meet the beautiful fact that E[X given G] is the best possible predictor of X in the least-squares sense, with its own clean geometry. You have just built the foundation all of that stands on.