The Law of Large Numbers

The promise: averages settle down

Flip a fair coin ten times and you might see 7 heads — a proportion of 0.7, nowhere near a half. Flip it ten thousand times and the proportion will sit suspiciously close to 0.5. Everyone has a gut feeling that "things even out in the long run," and the law of large numbers is the precise theorem behind that feeling. It says: if you average more and more independent draws from the same distribution, the average converges to the true expected value. The vague proverb becomes a clean mathematical promise.

Let us fix notation we will reuse all guide. Draw X_1, X_2, X_3, ... independently from one distribution, each with the same mean mu = E[X] and (for now) a finite variance sigma^2 = Var(X). The sample mean of the first n draws is X-bar_n = (X_1 + X_2 + ... + X_n) / n. The law of large numbers is a statement about what happens to X-bar_n as n grows without bound: it closes in on mu. The whole subject of this rung — the different *modes* in which a sequence can converge — exists precisely so we can say exactly which kind of "closes in" the law delivers.

The weak law, proved with one inequality

The weak law of large numbers says: for any tolerance epsilon > 0, no matter how small, P(|X-bar_n - mu| > epsilon) tends to 0 as n grows. In words: the probability that the sample mean misses the true mean by more than your chosen margin shrinks to zero. That is exactly the definition of convergence in probability from the previous guide — X-bar_n converges in probability to mu. It is a statement about each fixed large n: for that n, a big miss is very unlikely.

What is lovely is how cheaply we can prove it when the variance is finite. The key is that averaging *crushes* variance. Since the draws are independent, the variance of a sum is the sum of the variances (no covariance terms survive), so Var(X_1 + ... + X_n) = n sigma^2. Dividing the sum by n divides the variance by n^2, giving Var(X-bar_n) = sigma^2 / n. As n grows, the spread of the sample mean around mu shrinks toward zero. The average is being squeezed onto a single point.

Now feed that shrinking variance into Chebyshev's inequality, which bounds how much any variable can stray from its mean using only its variance: P(|X-bar_n - mu| > epsilon) <= Var(X-bar_n) / epsilon^2 = sigma^2 / (n epsilon^2). The right side goes to 0 as n grows, for any fixed epsilon. That is the weak law, done in two lines. Notice it even hands you a concrete sample-size rule: to be reasonably sure the mean is within epsilon, you need n on the order of sigma^2 / epsilon^2.

Independence  =>  Var(X1 + ... + Xn) = n * sigma^2
Divide by n   =>  Var(X-bar_n)       = sigma^2 / n

Chebyshev:  P(|X-bar_n - mu| > epsilon)  <=  sigma^2 / (n * epsilon^2)  -->  0

Example (fair coin, mu = 0.5, sigma^2 = 0.25), within epsilon = 0.05:
   bound = 0.25 / (n * 0.0025) = 100 / n
   n = 1000  ->  bound 0.10        n = 10000  ->  bound 0.01

The two-line proof of the weak law: shrinking variance plus Chebyshev's inequality.

Weak versus strong: two kinds of certainty

The weak law guarantees that for each large n, a big miss is improbable — but it leaves open an unsettling possibility. Could the sample mean keep flickering, wandering away from mu now and then forever, as long as each individual excursion is rare? The strong law of large numbers closes that door. It promises that, with probability 1, the *entire sequence* X-bar_1, X-bar_2, X-bar_3, ... actually converges to mu. Pick one infinite run of the experiment, watch the running average, and it will settle down to mu and stay there.

That stronger guarantee is exactly almost sure convergence, also from the previous guide. The distinction matters, and the hierarchy of convergence modes tells us why: almost sure convergence implies convergence in probability, but not the other way around. So the strong law is genuinely stronger — it implies the weak law for free, while the weak law alone could in principle hold even if some sequences never truly settled. The strong law rules that out: the set of outcomes where the average fails to converge has probability zero.

What the law does NOT say

The law of large numbers is widely believed and just as widely misquoted. The deadliest error is the gambler's fallacy: "I've seen six reds in a row at roulette, so black is now overdue." The wheel has no memory; spins are independent, and the next spin is just as likely red as ever. The law does not promise that a deficit of blacks gets *repaid*. It promises only that the *average* over a huge number of spins approaches the true mean — and it gets there not by correcting past misses but by drowning them in an ocean of fresh, indifferent trials.

Look again at the numbers to feel this. After 6 reds, suppose you flip a fair coin 10,000 more times. Those 6 reds are a fixed lump that never gets cancelled; they simply become a vanishing fraction of the total. The sum of deviations does not shrink — in fact the *sum* of heads minus tails typically grows like the square root of n, drifting ever larger in absolute size. It is the *per-flip* deviation, that sum divided by n, that melts away. "Evening out" is dilution, not repayment. This is the single deepest distinction in the whole topic.

Two more guardrails. First, the law assumes the draws are independent and identically distributed with a finite mean; if the mean does not exist — as for the Cauchy distribution, whose heavy tails make E[X] undefined — the sample mean never settles at all, and the law simply does not apply. Second, the law tells you *that* the average converges, never *how fast*. The rate, the size of the typical leftover wobble, is a different and finer question — and it is precisely what the central limit theorem in the next guide answers.

Where the law quietly does your work

Once you trust the law, a lot of everyday reasoning becomes legitimate. When pollsters quote an average from a sample, when an insurer charges a premium it expects to cover claims, when a physicist reads a steady value off a noisy detector — all of it leans on the same fact: a sample mean approximates a population mean once n is large enough. The law is the bridge between the abstract expectation (a number you can rarely observe directly) and the sample average (a number you can actually compute from data).

The cleanest application is Monte Carlo estimation. Want a quantity that is hard to compute by formula — say the probability of a complicated event, or the area of an irregular region? Write it as an expectation E[g(X)], simulate many independent draws, average g over them, and the law guarantees that average converges to the answer. Estimating pi by throwing random darts at a square and counting how many land inside an inscribed circle is exactly this: the fraction inside is a sample mean of an indicator, and the law pins it to the true probability pi/4.

Write the target as an expectation. To estimate pi/4, let X be a random point in the unit square and g(X) = 1 if it falls inside the quarter-circle, 0 otherwise. Then E[g(X)] = pi/4, the area of the quarter-circle.
Simulate independent draws. Generate n random points and evaluate g on each, giving 0/1 values Y_1, ..., Y_n — an independent, identically distributed sample.
Average them. The sample mean Y-bar_n = (Y_1 + ... + Y_n) / n is just the fraction of darts inside the circle.
Invoke the law. By the strong law, Y-bar_n converges (with probability 1) to E[g(X)] = pi/4, so 4 * Y-bar_n homes in on pi. More darts means a tighter estimate — but the law alone won't tell you the error bar; that wait for the CLT.