The Central Limit Theorem

From the law of large numbers to a sharper question

The previous guide settled where the sample average goes: by the weak law of large numbers the mean of n independent, identically distributed draws collapses onto the true expectation mu as n grows. That is a statement about the *centre*. But it leaves a richer question untouched: how does the average wobble *around* mu before it gets there? The error X-bar_n - mu shrinks to zero, yet at any finite n it is a random quantity with its own shape. The [[prob-central-limit-theorem|central limit theorem]] (CLT) describes that shape, and the answer is astonishingly universal.

Here is the picture in words. Take any distribution with a finite mean mu and a finite variance sigma^2 — it can be a lopsided die, a coin flip, a waiting time, almost anything. Draw n of them independently and add them up. The sum is itself random, but as n grows its histogram smooths into the same familiar bell every time: the [[prob-normal-distribution|normal distribution]]. The individual ingredient is forgotten; only its mean and variance survive into the limit. That erasure of detail is what makes the theorem feel like magic, and it is why the bell curve turns up in heights, measurement errors, and exam scores alike.

The statement, stated carefully

To state the theorem we must standardise, because the raw sum runs off to infinity and its spread grows too. Let X_1, X_2, ... be independent and identically distributed with mean mu and finite variance sigma^2 > 0. The sample average X-bar_n has mean mu and variance sigma^2/n, so its standard deviation is sigma/sqrt(n). Subtract the mean and divide by that standard deviation to get a clean dimensionless quantity Z_n = (X-bar_n - mu) / (sigma / sqrt(n)) = sqrt(n) (X-bar_n - mu) / sigma. The CLT says Z_n converges in distribution to the standard normal Normal(0, 1) as n goes to infinity.

X_1, ..., X_n  iid,  E[X_i] = mu,  Var(X_i) = sigma^2  (finite, > 0)

   S_n     = X_1 + ... + X_n            (the sum)
   X-bar_n = S_n / n                    (the average)

   Z_n = (X-bar_n - mu) / (sigma / sqrt(n))
       = (S_n - n*mu) / (sigma * sqrt(n))

   As n -> infinity:   Z_n  -->  Normal(0, 1)   (in distribution)

   so for large n,  X-bar_n  is approximately  Normal(mu, sigma^2 / n)

The classical (i.i.d.) central limit theorem: centre, scale by sqrt(n), and the bell appears.

Read the convergence honestly. "Converges in distribution" means the cumulative distribution function of Z_n approaches that of the standard normal at every point — it is a statement about probabilities and shapes, the weakest of the four convergence modes from the first guide of this rung, not a claim that Z_n itself settles down to a fixed random value. The two scalings are doing different jobs: dividing the sum by n (the law of large numbers) kills the randomness, while dividing by sqrt(n) (the CLT) keeps exactly the right amount alive to see its shape. That factor of sqrt(n), not n, is the heart of the whole result.

Why a bell, and why always the same one?

The cleanest reason uses the characteristic function, a tool the transforms rung built for exactly this. Its great advantage over the moment generating function is honesty: the mgf may fail to exist for heavy-tailed laws, but the characteristic function phi_X(t) = E[e^(itX)] always exists for every distribution. And it turns products into sums: the characteristic function of a sum of independent variables is the product of their characteristic functions, so adding independent pieces is just multiplying their transforms.

Standardise each term to mean 0 and variance 1, so its characteristic function has the Taylor expansion phi(t) = 1 - t^2/2 + (smaller terms), where the linear term vanishes (mean 0) and the t^2 coefficient is set by the variance.
The characteristic function of the standardised sum Z_n is phi(t / sqrt(n)) raised to the power n, because independence turns the sum into a product and the sqrt(n) scaling shrinks the argument.
Substitute the expansion: [1 - t^2/(2n) + (smaller)]^n. This is the classic limit [1 + a/n]^n -> e^a, here giving e^(-t^2/2).
But e^(-t^2/2) is exactly the characteristic function of the standard normal — and a characteristic function determines its distribution uniquely. So the limit must be Normal(0, 1).

Notice what survived the limit and what did not. Only the first two terms of the expansion — the mean (forced to 0 by centring) and the variance (forced to 1 by scaling) — reached the answer; every higher detail of the original distribution, its skewness, its kurtosis, its exact shape, was crushed by the sqrt(n) shrinkage. That is the precise mechanism behind the universality: the bell is not a special property of dice or coins, it is the unique fixed point that survives when you add many small independent things and rescale. The normal is the attractor of summation.

A tiny worked feel, and how fast it converges

Roll a single fair die and you get a flat, blocky distribution over 1 to 6 — nothing bell-like at all, with mean mu = 3.5 and variance sigma^2 = 35/12 is about 2.92. Now roll several dice and look at the average. With just two dice the histogram of the sum is already a tidy triangle peaking at 7; with five it is visibly humped and roughly symmetric; by ten or so it is hard to tell from a normal curve by eye. Nothing about a single die hints at this — the bell is born purely from the act of averaging, and you watched the same erasure of detail the characteristic-function argument predicted.

But "converges" is an asymptotic promise, and honesty demands we ask how good the approximation is at a *finite* n. The [[berry-esseen-theorem|Berry-Esseen theorem]] answers this: the gap between the true cumulative distribution of Z_n and the standard normal one is at most C * rho / (sigma^3 * sqrt(n)), where rho = E[|X - mu|^3] is the third absolute moment and C is a universal constant under 1. Two lessons fall out. First, the error shrinks like 1/sqrt(n) — slow; quadrupling the sample roughly halves the error. Second, the more skewed or heavy-tailed the ingredient (large rho), the larger n you need before the bell is trustworthy.

What the theorem does not say

The single most important hypothesis is finite variance. The CLT we stated needs sigma^2 < infinity, and the most famous failure is the Cauchy distribution, whose tails are so heavy that even its mean is undefined. Average n independent Cauchy draws and you do not get a tightening bell — you get back the very same Cauchy distribution, no matter how large n is (the CLT fails for the Cauchy). Adding more data buys you nothing because a single freakish outlier can dominate the whole sum. The next guide is devoted to exactly why finite variance is the load-bearing assumption.

A second trap is treating the CLT as a license to call everything normal. It speaks about the distribution of a *sum or average* of many independent contributions; it says nothing about a single raw observation. Heights are roughly normal because they are the sum of many small genetic and environmental effects, but file sizes, incomes, and city populations are heavy-tailed and stubbornly non-normal — they are not built by adding many comparable independent pieces. And the i.i.d. assumption can be relaxed (the Lindeberg condition lets terms differ in distribution, provided no single one dominates), but it cannot simply be dropped: strong dependence or one giant term breaks the result.