Chebyshev's Inequality and the Weak Law

From the mean to the spread

In the previous guide you met Markov's inequality: for a non-negative variable X and any threshold a > 0, P(X >= a) <= E[X] / a. It is wonderfully cheap — the mean alone buys you a tail bound — but it is also wonderfully crude, because it knows nothing about how X spreads out around its mean. Two variables with the same average can behave completely differently: one hugs its mean tightly, the other swings wildly. Markov cannot tell them apart. This guide fixes exactly that blind spot.

The quantity that measures spread is the variance, Var(X) = E[(X - mu)^2], where mu = E[X]. Its square root is the standard deviation sigma, which lives in the same units as X itself. The whole trick of this guide is one move: apply Markov's inequality not to X, but to the squared distance from the mean, (X - mu)^2. That non-negative quantity is exactly the thing whose expectation is the variance — so Markov, fed this new variable, will hand us a bound stated in terms of sigma^2.

Deriving Chebyshev in two lines

Let us do the move slowly. We want to bound the probability that X lands far from its mean, say at least k standard deviations away: the event |X - mu| >= k*sigma. Squaring both sides (both are non-negative) keeps the event the same: it equals (X - mu)^2 >= k^2 * sigma^2. Now (X - mu)^2 is non-negative, so we are allowed to feed it to Markov with the threshold a = k^2 * sigma^2. Markov says P((X - mu)^2 >= a) <= E[(X - mu)^2] / a, and the numerator E[(X - mu)^2] is just Var(X) = sigma^2.

event:   |X - mu| >= k*sigma                        (X falls k SDs from the mean)
square:  (X - mu)^2 >= k^2 * sigma^2                 (same event, both sides >= 0)
Markov:  P((X-mu)^2 >= k^2 sigma^2) <= E[(X-mu)^2] / (k^2 sigma^2)
but:     E[(X - mu)^2] = Var(X) = sigma^2
result:  P(|X - mu| >= k*sigma) <= 1 / k^2

Chebyshev's inequality is just Markov applied to the squared deviation. The sigma^2 cancels, leaving the clean bound 1/k^2.

Substituting a = k^2 * sigma^2 and cancelling the sigma^2 gives the headline result, Chebyshev's inequality: P(|X - mu| >= k*sigma) <= 1 / k^2. In words: the chance of being at least k standard deviations from the mean is at most one over k squared. It is a tail bound that, unlike Markov, charges for distance measured in the variable's own natural ruler — the standard deviation — and it works for any distribution with a finite variance, no shape assumptions at all.

What the bound actually says — and what it doesn't

Put a few numbers in to feel it. With k = 2, Chebyshev promises P(|X - mu| >= 2*sigma) <= 1/4: at most a quarter of the probability can sit two or more standard deviations out, for ANY distribution with finite variance. With k = 3 it falls to 1/9 ≈ 0.11. Equivalently, at least 1 - 1/k^2 of the mass lies within k standard deviations of the mean — at least 75% within two, at least 89% within three. That is a universal guarantee you can quote without knowing anything else about X.

Two cautions before we move on. First, Chebyshev needs a finite variance to even be stated; for a heavy-tailed variable like the Cauchy, sigma^2 is infinite and the bound is vacuous (it just says "<= infinity"). Second, the bound is two-sided and symmetric — it controls both tails together — so it can be wasteful for a lopsided distribution where almost all the danger is on one side. It is a blunt, dependable tool, not a sharp one.

Aiming Chebyshev at an average

Here is where the inequality earns its keep. Take n independent, identically distributed variables X_1, ..., X_n, each with mean mu and finite variance sigma^2, and form their sample average X-bar = (X_1 + ... + X_n) / n. By linearity of expectation, E[X-bar] = mu — the average is centred on the true mean. The key is its variance. Because the terms are independent, the variance of the sum is just the sum of the variances, n*sigma^2, and dividing the sum by n divides its variance by n^2. So Var(X-bar) = n*sigma^2 / n^2 = sigma^2 / n.

Read that result again, because it is the whole engine of statistics: the average of n independent copies has variance sigma^2 / n, which shrinks toward zero as n grows. The sample mean is far less variable than a single draw — its standard deviation is sigma / sqrt(n), the famous "square-root-of-n" law. Averaging does not make any single measurement more accurate, but it makes the average itself cling ever more tightly to mu.

The weak law of large numbers

Now combine the two ideas and watch the weak law of large numbers fall out almost for free. Apply Chebyshev to the average X-bar, using its mean mu and its variance sigma^2 / n. For any fixed tolerance epsilon > 0, the chance the average misses mu by more than epsilon is P(|X-bar - mu| >= epsilon) <= Var(X-bar) / epsilon^2 = sigma^2 / (n * epsilon^2). The numerator is fixed; the denominator grows with n; so the whole bound goes to zero as n goes to infinity.

Center and spread the average: E[X-bar] = mu and Var(X-bar) = sigma^2 / n.
Apply Chebyshev to X-bar with threshold epsilon: P(|X-bar - mu| >= epsilon) <= Var(X-bar) / epsilon^2.
Substitute the variance: the bound becomes sigma^2 / (n * epsilon^2).
Let n grow: the bound tends to 0, so P(|X-bar - mu| >= epsilon) tends to 0 for every epsilon.

That last line is precisely the definition of convergence in probability: X-bar converges in probability to mu. The sample average becomes arbitrarily likely to sit arbitrarily close to the true mean as the sample grows. This is the weak law, and the proof we just gave — Chebyshev on the average — is the textbook one. It is also why the law needs a finite variance in this elementary form; sharper arguments relax that, but the variance route is the cleanest first proof.

What the weak law does and does not promise

The weak law is about the average, not about sums "evening out." A common myth says that after a run of heads, tails are "due" so that the counts will balance — that is the gambler's fallacy, and it is false: independent trials have no memory. What actually converges is the proportion of heads, X-bar, toward 1/2; the absolute difference between the number of heads and tails can and typically does grow without bound. The average is tamed precisely because dividing by n crushes the variance; the running total is not tamed at all.

It is also worth separating two laws. The weak law says, for each large n, X-bar is very likely near mu — a statement about probabilities that can still allow rare big misses at any particular n. The strong law of large numbers says something stronger: with probability 1, the entire sequence X-bar settles to mu and stays there. The strong law is harder to prove and is not what Chebyshev alone gives you, but it is the deeper truth behind the everyday intuition that long-run frequencies equal probabilities.