Covariance and Correlation

Do they move together? Defining covariance

In the previous guide you learned to test whether two random variables are independent — whether knowing one tells you nothing about the other. But independence is all-or-nothing, and most real pairs live somewhere in between. Height and weight are not independent, yet neither does one perfectly fix the other; they merely *lean* in the same direction. We want a number that measures that lean: when X happens to be above its mean, does Y tend to be above its mean too, or below? Covariance is exactly that number.

Write mu_X = E[X] and mu_Y = E[Y] for the two means. For a single draw, look at the two deviations from the mean, (X - mu_X) and (Y - mu_Y), and multiply them. If both are positive (both above their means) or both negative (both below), the product is positive. If one is up while the other is down, the product is negative. The covariance is just the average of that product over the joint distribution: Cov(X, Y) = E[(X - mu_X)(Y - mu_Y)]. A positive covariance says the two usually drift the same way; a negative one says they usually drift in opposite ways; near zero says there is no consistent linear pull either way.

There is a much friendlier computational formula, and it mirrors the one you already use for variance. Expanding the product and using linearity of expectation collapses everything to Cov(X, Y) = E[XY] - E[X] E[Y]. So you only need the expected value of the product XY and the two separate means. Notice the family resemblance to Var(X) = E[X^2] - (E[X])^2: covariance is to a *pair* what variance is to a *single* variable. In fact, Cov(X, X) = E[X^2] - (E[X])^2 = Var(X) — covariance of a variable with itself is just its own variance.

Cov(X, Y) = E[(X - mu_X)(Y - mu_Y)]     (definition)
          = E[XY] - E[X] E[Y]           (computational form)

  Cov(X, X) = E[X^2] - (E[X])^2 = Var(X)

  X, Y independent  =>  E[XY] = E[X] E[Y]  =>  Cov(X, Y) = 0

Two equivalent formulas for covariance, the link to variance, and what independence forces.

The catch with covariance: it has no fixed scale

Covariance answers "which direction?" cleanly, but it is hopeless at answering "how strongly?" The problem is units. Covariance carries the units of X times the units of Y, so if X is a height in metres and Y a weight in kilograms, Cov(X, Y) is measured in metre-kilograms — a quantity with no intuitive size. Worse, it changes the moment you rescale. Measure the same height in centimetres instead of metres and every X is multiplied by 100, which multiplies the covariance by 100 as well, even though absolutely nothing about the underlying relationship has changed.

This rescaling behaviour is itself a useful fact, and it comes from a deeper property called bilinearity. Covariance is linear in each slot separately: Cov(aX + b, cY + d) = ac Cov(X, Y). The additive shifts b and d vanish entirely — sliding a variable up or down does not change how it co-varies with another — while the scale factors a and c pull straight out front. That is exactly why switching metres to centimetres multiplied the covariance by 100. Bilinearity is the engine behind nearly every covariance manipulation you will do, so it is worth holding onto.

Correlation: covariance with the units stripped out

The fix is to divide the covariance by the right amount of "size" so the units cancel. The natural measure of each variable's own scale is its standard deviation, sigma_X and sigma_Y. Dividing by their product gives the correlation coefficient, written rho (the Greek letter rho): rho = Cov(X, Y) / (sigma_X sigma_Y). Because the metre-kilograms on top are cancelled by the metres-times-kilograms on the bottom, rho is a pure, dimensionless number. Rescaling X from metres to centimetres now multiplies top and bottom by 100 alike, so rho does not budge — which is exactly the stability we wanted.

Rescaling also pins rho to a fixed range. The Cauchy-Schwarz inequality guarantees that |Cov(X, Y)| can never exceed sigma_X sigma_Y, which forces -1 <= rho <= 1. The boundaries carry real meaning: rho = +1 happens exactly when Y is an increasing straight-line function of X (Y = aX + b with a > 0), and rho = -1 when it is a decreasing one. Values in between measure how tightly the cloud of points hugs a straight line. So rho near 0.9 is a strong upward linear trend, rho near -0.2 a weak downward one, and rho near 0 no linear trend at all.

Why covariance matters: the variance of a sum

Covariance is not just a descriptive score; it is the missing piece in one of the most-used formulas in probability. You already know E[X + Y] = E[X] + E[Y] always, with no conditions. Variance is not so generous. The general rule is Var(X + Y) = Var(X) + Var(Y) + 2 Cov(X, Y). The extra term is the covariance, doubled, and it is precisely the price of the two variables moving together. If they tend to rise and fall in step (positive covariance), their sum swings more wildly than the parts would suggest; if they tend to cancel (negative covariance), the sum is calmer.

Now the payoff of independence becomes vivid. When X and Y are independent, E[XY] = E[X] E[Y], so Cov(X, Y) = 0, and the cross term vanishes: Var(X + Y) = Var(X) + Var(Y). Variances simply add. This is the engine behind countless results — the variance of n independent draws being n times one variance, the standard error shrinking like 1/sqrt(n), and ultimately the central limit theorem. The whole machinery of "averaging reduces noise" runs on covariance being zero.

A small worked number. Roll a fair die; let X be the roll and Y = 7 - X (the value on the opposite face). Then E[X] = 3.5 and Var(X) = Var(Y) = 35/12.
Compute Cov(X, Y) by bilinearity: Cov(X, 7 - X) = Cov(X, 7) - Cov(X, X) = 0 - Var(X) = -35/12. They move in perfect opposition.
Correlation: rho = Cov(X, Y) / (sigma_X sigma_Y) = (-35/12) / (35/12) = -1, the perfect-negative-line extreme — unsurprising, since Y is exactly -X plus a constant.
Variance of the sum: X + Y = 7 is constant, so Var(X + Y) = 0. Check it: Var(X) + Var(Y) + 2 Cov(X, Y) = 35/12 + 35/12 - 2(35/12) = 0. The negative covariance cancels the spread exactly.

Two honest warnings

The first warning is the one this rung is built around, and it is the single most common mistake with correlation. Zero correlation does not mean independence. Independence forces covariance to zero, but the arrow does not reverse. Because rho measures only the *linear* part of a relationship, a perfectly deterministic but curved relationship can show rho = 0. The classic example: let X be symmetric about zero and set Y = X^2. Then Y is completely determined by X — as dependent as can be — yet Cov(X, Y) = E[X^3] - E[X] E[X^2] = 0 because the symmetry makes both terms vanish. We will dwell on this exact gap in the next guide, because so much bad reasoning hides in it.

The second warning is about meaning, not mathematics: correlation is not causation. A large rho between two variables tells you they move together, nothing about *why*. Ice-cream sales and drowning deaths are strongly correlated, but neither causes the other — hot weather drives both. A hidden common cause, reverse causation, or pure coincidence can each manufacture a high correlation. Correlation is a genuine, useful signal that something links the variables; deciding *what* links them, and in which direction, is a separate question that data alone rarely settles.

There is one important family where the gap between uncorrelated and independent does close. For the bivariate normal distribution — and only because of its special structure — zero correlation really does imply independence. This is a genuine exception, not the general rule, and it is exactly why the normal case is so beloved and so easy to reason about. Outside that comfortable world, keep the two warnings firmly in mind: rho = 0 can still hide deep dependence, and rho far from 0 still says nothing about cause.