The Variance of a Sum and Conditioning

Means add for free; spreads do not

You already own a beautiful, unconditional gift from an earlier rung: linearity of expectation. No matter how X and Y are tangled together, E[aX + bY] = a E[X] + b E[Y] — means just add, dependence be damned. It is tempting to hope variance behaves the same way, that Var(X + Y) is simply Var(X) + Var(Y). That hope is right exactly when X and Y don't interact, and wrong the moment they do. Spread, unlike the mean, can feel whether two variables move together.

Why the difference? Variance is built from a square, and squares create cross terms. Start from the definition with means subtracted: Var(X + Y) = E[((X - E[X]) + (Y - E[Y]))^2]. Expand the square inside and you get three pieces: (X - E[X])^2, (Y - E[Y])^2, and twice the product (X - E[X])(Y - E[Y]). Take expectations of each and the first two are Var(X) and Var(Y) — but the third is exactly twice the covariance you met two guides ago. The square refuses to forget how the two deviations line up.

Var(X + Y) = Var(X) + Var(Y) + 2 Cov(X, Y)

  Cov > 0  ->  sum is MORE spread than the parts alone
  Cov < 0  ->  sum is LESS spread (the two partly cancel)
  Cov = 0  ->  variances simply add

The master formula. The 2 Cov(X, Y) cross term is the whole story; everything else in this guide is reading it carefully.

When the cross term vanishes — and what it buys you

If X and Y are independent, their covariance is zero (you proved this last rung: independence forces the product of deviations to average out to nothing). The cross term disappears and you recover the clean Var(X + Y) = Var(X) + Var(Y). Be honest about the exact condition, though: the cross term vanishes whenever Cov(X, Y) = 0, and that holds for merely uncorrelated variables too. So variances add under the weaker assumption of zero correlation — you do not need full independence. This is the one place where uncorrelated is genuinely enough, and it's worth knowing because it's easier to check.

This additivity is the engine behind a fact you'll lean on for the rest of probability. Take n independent copies of a variable, each with variance sigma^2, and add them: the sum has variance n·sigma^2. Now average instead of sum — divide by n. Using Var(cX) = c^2 Var(X), the average has variance (1/n^2)·(n·sigma^2) = sigma^2 / n. The spread of an average shrinks like 1/n, so its standard deviation shrinks like 1 over the square root of n. That sigma over root-n is the famous standard error, and the slow square-root decay is why doubling your accuracy costs four times the data.

The general formula: bilinearity does the heavy lifting

What about a sum of many variables, possibly all dependent? The two-variable rule generalizes by the same square-expansion logic, and the clean way to organize it is the bilinearity of covariance — the fact that covariance is linear in each slot, so Cov of a sum is the sum of the Covs. Applied to Var(X1 + ... + Xn) = Cov(sum, sum), it splits into every pairwise covariance: the diagonal terms Cov(Xi, Xi) = Var(Xi), plus an off-diagonal Cov(Xi, Xj) for each ordered pair. Because Cov is symmetric, each unordered pair is counted twice, which is where the factor of 2 comes from.

Written out, Var(X1 + ... + Xn) is the sum of the n individual variances plus twice the sum of Cov(Xi, Xj) over every distinct pair i < j. The variances sit on the diagonal of the table of all pairings; the covariances fill everything off it. If all the pairs happen to be uncorrelated, every off-diagonal term is zero and you're back to plain additivity, Var(X1 + ... + Xn) = Var(X1) + ... + Var(Xn). Otherwise the off-diagonal traffic is where the action is.

That pair count carries a warning worth absorbing. With n terms there are only n variances but about n^2/2 covariances, so when variables are positively correlated the cross terms can dominate completely. Pile 100 assets that each wobble a little but tend to fall together, and the portfolio's variance is driven far more by the 4950 covariances than by the 100 individual variances. This is exactly why diversification works through low or negative correlation, not merely through having many pieces: adding more positively-correlated parts does not tame the spread, it can feed it.

Conditioning: split the mean before you compute it

Now a different and equally powerful move. Earlier in this ladder you met conditional expectation: E[X given Y = y] is the mean of X once you're told Y took the value y. As y varies this is a number that depends on y, so E[X given Y] is itself a random variable — a function of Y. The law of total expectation says you can recover the plain mean of X by averaging these conditional means over Y: E[X] = E[E[X given Y]]. In words: figure out the answer separately in each possible world, then average those answers, weighted by how likely each world is.

This law of total expectation is the continuous, expectation-level twin of the law of total probability you used in the conditional-probability rung — same divide-and-conquer spirit, lifted from probabilities to averages. It shines on problems with a random number of random pieces. Suppose a shop serves a random number N of customers in a day, and each spends a random amount with mean E[X] = 20 dollars, independent of N. Condition on N first: given N = n, the expected total is n·20. So E[total given N] = 20·N, and averaging over N gives E[total] = 20·E[N]. If E[N] = 50 customers, the expected daily take is 1000 dollars — computed without ever wrestling the messy distribution of the total directly.

Pick a helper variable Y you wish you knew — usually one that makes X easy once it's fixed (here, the count N).
Compute the inner conditional mean E[X given Y = y] as an ordinary, often easy, expectation.
Read that off as a function of Y, giving the random variable E[X given Y].
Average over Y: E[X] = E[E[X given Y]]. The outer average reattaches the uncertainty in Y you had set aside.

Splitting variance: within plus between

Conditioning splits variance too, but with a twist that's genuinely beautiful. The law of total variance says Var(X) = E[Var(X given Y)] + Var(E[X given Y]). Two terms, and each is doing honest work. The first, E[Var(X given Y)], is the average leftover spread of X inside each world once Y is known — the variability you can't explain away even after learning Y. The second, Var(E[X given Y]), is how much the conditional mean itself swings as Y changes — the spread that Y does explain. Total uncertainty equals unexplained-within plus explained-between.

A picture makes it stick. Imagine students' test scores across several classes. The first term, E[Var(X given Y)], is the average spread of scores within a class — kids in the same room still differ. The second, Var(E[X given Y]), is how much the class averages differ from one another — the between-class gap. The total variation in scores is exactly these two added together, never more and never less. This within-plus-between decomposition is the backbone of analysis of variance and of every "how much does this factor explain?" question in statistics.

Where this rung was heading

Step back and see what this rung built. You learned to handle several variables at once through joint, marginal, and conditional distributions; you measured their linear togetherness with covariance and correlation; you got the firm warning that uncorrelated is not independent. This last guide ties it off: the variance of a sum reveals exactly where covariance bites, and conditioning lets you take any mean or variance apart and reassemble it. These are not just identities to memorize — they're the tools that make multi-variable probability computable instead of frightening.

These two ideas also point straight up the ladder. The variance-of-an-average result, sigma^2 / n, is the seed of the law of large numbers, which controls how a sample mean settles toward its true mean, and of the central limit theorem, which says that settling happens through a normal-shaped bell — both several rungs ahead, and both honest about needing finite variance to work at all. Conditional expectation, meanwhile, grows into a whole language for prediction and for the theory of martingales and stochastic processes much further on. You're leaving this rung with the additive structure of spread and the divide-and-conquer power of conditioning firmly in hand.