Linearity of Expectation: The Superpower

The rule, and why it is a big deal

You already know expectation as the long-run average of a random variable, and you have just learned the law of the unconscious statistician for averaging a function of one variable. Now comes the rule that quietly powers half of applied probability. [[linearity-of-expectation|Linearity of expectation]] says that for any random variables X and Y and any constants a and b, E[aX + bY] = a E[X] + b E[Y]. Spelled out: the average of a sum is the sum of the averages, and constants slide straight out front. It looks almost too plain to be worth a name.

The plainness is a disguise. The astonishing part is the fine print that is missing: there is no requirement that X and Y be independent. They can be wildly correlated, tangled, defined on the same coin flips, even literally the same variable — and E[X + Y] is still E[X] + E[Y]. This is what makes it a superpower. Most rules in probability come fenced in by an independence assumption that real problems refuse to honour; linearity simply does not need one. You will spend the rest of this guide cashing in that freedom.

Why no independence is needed

It helps to see why the rule is so forgiving, because the reason is genuinely simple. Picture the underlying sample space as a list of outcomes, each with a probability. The variable X assigns a number to each outcome; so does Y. The new variable X + Y just adds those two numbers outcome by outcome. Now form the average of X + Y the honest way — weight each outcome's combined value by its probability and add it all up.

Because addition can be regrouped freely, that one big sum splits cleanly into the sum of X's contributions plus the sum of Y's contributions — which are exactly E[X] and E[Y]. At no point did we ask how X and Y relate to each other; each outcome carries both its X-value and its Y-value as a fixed pair, and we never needed the joint probabilities to factor. That is the secret: linearity is just the distributive law of arithmetic, applied one outcome at a time. Independence is about how probabilities multiply, and we never multiplied anything.

E[X + Y] = sum over outcomes w of  P(w) * ( X(w) + Y(w) )
         = sum  P(w)*X(w)  +  sum  P(w)*Y(w)
         =        E[X]      +        E[Y]

  -- regrouping a sum needs no independence --

  General form:   E[a1*X1 + a2*X2 + ... + an*Xn]
                = a1*E[X1] + a2*E[X2] + ... + an*E[Xn]

Linearity is just regrouping one weighted sum — independence never enters.

The indicator trick: counting made easy

Linearity becomes a true superpower when paired with one companion idea: the [[indicator-random-variable|indicator random variable]]. An indicator I_A is simply 1 when event A happens and 0 when it does not. Its expectation is the gentlest fact in the subject: E[I_A] = 1 * P(A) + 0 * P(not A) = P(A). An indicator's average is just the probability of the thing it indicates. That tiny bridge — from a probability to an expectation — is what lets linearity do its work.

Here is the [[indicator-variable-trick|indicator-variable trick]] in full. To find the expected number of times some kind of thing happens, write that count as a sum of indicators, one per possible occurrence: N = I_1 + I_2 + ... + I_n. By linearity, E[N] = E[I_1] + ... + E[I_n] = P(occurrence 1) + ... + P(occurrence n). You have converted a hard counting problem into n easy probability problems and added them up. Crucially, the indicators are usually heavily dependent — and you do not care, because linearity ignored dependence.

A two-second example: roll a fair die 60 times; how many sixes do you expect? Let I_k be 1 if roll k is a six. Each has E[I_k] = 1/6, and there are 60 of them, so the expected number of sixes is 60 * (1/6) = 10. That happens to match the mean of a binomial distribution, np, but notice we never summoned the binomial formula or its scary-looking probabilities — linearity gave the mean directly. The same one-line move works even when the trials are not independent, where the binomial formula would not apply at all.

Two showpieces: hats and birthdays

The classic hat-check problem shows off the power. Suppose n people throw their hats in a pile and each grabs one back at random; how many people expect to get their own hat? Tracking the full chaos — whose hat went where, all the dependencies between who-got-what — is a genuine combinatorial nightmare. The indicator trick walks straight past it. Let I_k be 1 if person k recovers their own hat. Person k is equally likely to receive any of the n hats, so P(own hat) = 1/n, giving E[I_k] = 1/n.

Now sum and apply linearity: the expected number of people who get their own hat is E[I_1 + ... + I_n] = n * (1/n) = 1. Exactly one person, on average, no matter how large the crowd — whether 10 people or 10 million. The indicators are strongly dependent (if everyone else has their own hat, the last person must too), yet linearity sailed through untouched. Try getting this answer by listing permutations and you will appreciate just how much labour the rule saved.

The same machine cracks the expected count of shared birthdays. The famous birthday problem usually asks for the probability that some pair matches, which needs the complement and a product of fractions. But the expected number of matching pairs is a one-liner: among n people there are n(n-1)/2 pairs, each matches with probability 1/365, so by linearity the expected number of coincident pairs is n(n-1)/2 * (1/365). For n = 23 that is about 0.69 expected pairs — comfortably positive, which is exactly why a shared birthday is more likely than people guess. Linearity turns intimidating combinatorics into arithmetic.

The limits: where linearity stops

Honesty about scope is what keeps this superpower from misfiring. Linearity governs the mean of a sum, but it says nothing direct about the spread of a sum. The variance of a sum is NOT in general the sum of the variances: Var(X + Y) = Var(X) + Var(Y) + 2 Cov(X, Y). That extra covariance term measures how X and Y move together, and it vanishes only when they are uncorrelated. So the moment you step from averages to variability, dependence comes roaring back and must be respected — a sharp contrast with the carefree world of E[X + Y].

Two final cautions round out the picture. First, linearity needs the individual expectations to actually exist; for a heavy-tailed variable whose mean is infinite or undefined (the Cauchy is the standard cautionary tale), the rule has nothing finite to add. Second, do not confuse linearity of expectation with the much stronger claim that variables are independent — recall that uncorrelated does not even imply independent, let alone the reverse. Linearity is the rare, beautiful tool that asks for almost nothing and gives back a great deal; just keep it on its own turf of means and sums, and it will serve you for the rest of probability.