The Lebesgue Integral Is Expectation, Done Right

Two formulas for one idea, and why that bothered everyone

All through the earlier rungs you computed expectation two ways. For a discrete variable you summed value times probability, E[X] = sum of x * P(X = x). For a continuous one you integrated value times density, E[X] = integral of x * f(x) dx. They gave sensible answers, but they came from completely separate machinery — one a sum over atoms, the other a density fed through a Riemann integral — and worse, neither handled a mixed distribution that is part lump and part smear. A variable that is 0 with probability 1/2 and uniform on [0, 1] otherwise has no clean pmf and no clean pdf, yet it obviously has a mean. We were patching cases.

Recall from earlier in this rung that a random variable is a measurable function X on a probability space (Omega, F, P). Expectation ought to be just one thing: the average of that function against the measure P. The trouble is that the Riemann integral you learned in calculus slices the horizontal axis — it chops the domain into thin vertical strips and asks for the function's height on each. That works for a smooth curve, but it falls apart the moment the function is wildly discontinuous, exactly the situation a general random variable hands you. The fix is to slice the other way.

Lebesgue's trick: slice the value axis, not the input axis

Henri Lebesgue described his idea with a coin metaphor. To count the money in your pocket, Riemann picks up coins one by one in the order he meets them and adds as he goes. Lebesgue instead first sorts the coins into piles by denomination — all the dimes here, all the quarters there — then multiplies each value by how many coins sit at that value. Same total, but Lebesgue's bookkeeping never cares about the messy order in which the coins appeared. Translated to a function: instead of asking "what is the height above each x?", you ask "how much input maps to roughly each output level y?" and sum y times the size of that input set.

That "size of an input set" is exactly the measure you built in the previous guides. In a probability space the relevant size is P, so the size of the set { X is near y } is just a probability. This is why the construction is so natural for us: Lebesgue integration weighs each output level by the probability of landing there, which is the literal meaning of an average. The definition is built in three honest layers, each resting on the one before, and it is worth seeing them.

Indicators. For a single event A, define the integral of the indicator 1_A to be P(A). This is the seed: the average of "1 if A happens, else 0" is just the chance of A.
Simple functions. A finite combination s = sum of a_k * 1_{A_k} (it takes only finitely many values) gets E[s] = sum of a_k * P(A_k). This is exactly the discrete formula you already trust, now read as a measure-weighted sum.
Nonnegative functions by approximation from below. For any X >= 0, define E[X] as the supremum of E[s] over all simple s with 0 <= s <= X. You squeeze X from underneath by staircases and take the limit; this always exists (possibly +infinity).
General functions by splitting. Write X = X_plus - X_minus into its positive and negative parts, integrate each (they are nonnegative), and subtract. If E[X_plus] and E[X_minus] are not both infinite, E[X] is defined — and X is called integrable exactly when E[|X|] is finite.

The real payoff: when can you swap a limit and an integral?

Here is the question that drives almost all of probability theory: if X_n converges to X, does E[X_n] converge to E[X]? In other words, can you push a limit through the integral sign? Under the Riemann integral the answer is often "no, or it is too painful to check." This is not pedantry — every limit theorem you have met, the law of large numbers and the central limit theorem included, is secretly a statement about swapping a limit with an expectation. So the worth of the Lebesgue integral is measured almost entirely by how cleanly it lets you make that swap.

And the swap genuinely can fail, so we cannot just wave it through. Picture a moving spike: let X_n be n on the interval (0, 1/n) and 0 elsewhere, with the input uniform on [0, 1]. For every fixed point the spike eventually slides past and leaves it at 0, so X_n converges to X = 0 pointwise, and E[X] = 0. But each X_n has E[X_n] = n * (1/n) = 1, forever. So lim E[X_n] = 1 while E[lim X_n] = 0 — the mass escaped to infinity faster than the window shrank. The convergence theorems are exactly the hypotheses that forbid this leak.

Three theorems that let the limit through

Three results, in rising order of usefulness, give you license to swap. The monotone convergence theorem (MCT) says: if 0 <= X_1 <= X_2 <= ... increase up to X, then E[X_n] increases up to E[X], with no exceptions. Monotone, nonnegative growth can never lose mass on the way up, so the limit always comes along. This is the theorem that justified step 3 of the very construction above, and it is also why E[sum of X_k] = sum of E[X_k] holds for nonnegative terms even with infinitely many of them.

The Fatou lemma is the cautious cousin: for any nonnegative X_n it does not promise equality, only the one-sided E[lim inf X_n] <= lim inf E[X_n]. Read against the moving spike, Fatou says 0 <= 1, which is true and tells you mass can only leak away in the limit, never appear from nowhere. Fatou costs almost no hypotheses, so it is the safe first move when you know nothing else; you reach for equality only when a stronger theorem applies.

The workhorse is the dominated convergence theorem (DCT). If X_n converges to X (pointwise, or almost everywhere) AND there is a single integrable Y with |X_n| <= Y for all n, then E[X_n] converges to E[X] — full equality. The dominating Y is a fixed ceiling that all the X_n live under, and that ceiling is exactly what stops mass from sneaking off to infinity. The moving spike has no such ceiling: the only Y above every spike is the function 1/x on (0,1], whose integral is infinite, so DCT correctly refuses to apply. When you can produce a finite-integral dominator, the swap is yours.

MCT  : 0 <= X_1 <= X_2 <= ... -> X      =>  lim E[X_n] = E[X]      (equality, monotone up)
Fatou: X_n >= 0                          =>  E[lim inf X_n] <= lim inf E[X_n]   (one-sided)
DCT  : X_n -> X  and  |X_n| <= Y, E[Y]<inf =>  lim E[X_n] = E[X]      (equality, with a ceiling)

moving-spike test:  X_n = n on (0, 1/n), else 0,   X_n -> 0
   E[X_n] = 1 for all n,  E[0] = 0
   MCT? no (not monotone)   DCT? no (smallest ceiling is 1/x, integral = inf)
   Fatou: 0 <= 1   <- the only one of the three that still applies, and it holds

The same example, run past all three theorems. Only Fatou applies to the leaky spike, and it gives the honest inequality rather than a false equality.

Honest fine print, and what these theorems quietly fix

A few cautions keep you out of trouble. First, the Lebesgue integral genuinely cannot see a set of probability zero: changing X on such a set never changes E[X]. This is the formal home of "a single point has probability zero" — the value of a continuous variable at any one number is irrelevant to its mean. So all three theorems only need their hypotheses to hold almost everywhere (with probability 1), not at literally every point. That is not a loophole; it is the whole reason the theory is robust.

Second, the dominator in DCT must be a single fixed Y that works for every n at once — finding a different finite ceiling for each n separately is not enough, and that subtle gap is where most failed swaps hide. Third, none of this contradicts the convergence-of-distribution facts you will meet next; pointwise or almost-everywhere convergence of the variables is a stronger, more hands-on notion than convergence in distribution, and the convergence theorems are tools for the former. Used carelessly, the swap is simply false, as the spike showed — the theorems are permissions, not blanket guarantees.