Using the CLT: Approximations and Pitfalls

From theorem to recipe

You spent the last three guides earning the central limit theorem: for independent, identically distributed pieces with mean mu and finite variance sigma^2, the standardized sum (Sn - n mu) / (sigma sqrt(n)) converges in distribution to a standard normal. That is a statement about a limit, an idealized object reached only as n goes to infinity. But nobody ever has infinitely many data points. The whole practical value of the theorem is that you turn it around and use it backwards: for a large-but-finite n, the distribution of the sum (or the average) is approximately normal, and you compute with that approximation as if it were exact. This guide is about doing that honestly.

The recipe has exactly two ingredients you must get right: the center and the spread. A sum of n pieces has mean n mu and variance n sigma^2, so its standard deviation is sigma sqrt(n). An average X-bar of n pieces has mean mu and variance sigma^2 / n, so its standard deviation is sigma / sqrt(n). Once you know the right mean and the right standard deviation, the CLT says you may treat the quantity as Normal with those two numbers and read off any probability you like using the standard-normal table — by first converting to a z-score. Almost every misuse of the CLT is really a slip in one of these two numbers, so it is worth slowing down on them.

Sum of n iid:     mean = n*mu      sd = sigma*sqrt(n)
Average of n iid: mean = mu        sd = sigma/sqrt(n)

z = (value - mean) / sd
P(quantity <= value) ~ Phi(z)        [Phi = standard normal CDF]

The whole recipe on one card: get the mean and sd, standardize, look up Phi.

A worked approximation, end to end

Let us make it concrete with the friendliest case: a coin. Flip a fair coin 100 times and ask for the chance of getting 60 or more heads. The exact answer is a sum of binomial probabilities — a finite but tedious calculation. The CLT lets you skip it. Each flip is a piece with mean mu = 0.5 and variance sigma^2 = 0.25, so sigma = 0.5. The count of heads is a sum of n = 100 such pieces, so its mean is 100 * 0.5 = 50 and its standard deviation is 0.5 * sqrt(100) = 5. We have our two numbers.

Identify mean and sd of the count: mean = 50, sd = 5, from the boxed formulas above.
Apply a continuity correction. The count is a whole number, but the normal is continuous, so 'at least 60' becomes 'at least 59.5' to fairly split the gap between 59 and 60.
Standardize: z = (59.5 - 50) / 5 = 9.5 / 5 = 1.9.
Look up the tail: P(Z >= 1.9) ~ 0.0287, so about a 2.9% chance. The exact binomial answer is about 0.0284 — the approximation is excellent.

How fast, and how to check before you trust

The coin worked beautifully, but 'large n' is not a magic word; it is a question of how large is large enough, and that depends on the shape of the underlying pieces. The honest quantitative answer is the Berry-Esseen theorem, which puts a hard number on the worst-case error of the normal approximation to the standardized-sum CDF. It says that error is at most C * rho / (sigma^3 * sqrt(n)), where rho = E[|X - mu|^3] is the third absolute moment and C is a universal constant (a bit under 0.5). Two lessons fall straight out: the error shrinks like 1/sqrt(n), which is slow, and it is inflated by skew and heavy tails through that third-moment factor rho / sigma^3.

That second lesson is the practical heart of it. A symmetric, well-behaved population like the coin reaches normality almost immediately — n in the tens is plenty. A strongly skewed population, like incomes or insurance claims or waiting times, can need n in the hundreds or thousands before the bell shape settles in, and even then the far tails are the last thing to converge. The old classroom slogan 'n >= 30 is enough' is a rule of thumb, not a theorem; it is fine for mildly non-normal data and badly optimistic for heavily skewed data. Treat 30 as a starting suspicion, never as a guarantee.

Where the CLT quietly fails

Now the pitfalls, the part most courses skip. The first and deepest failure is the one you met last guide: the CLT requires a finite variance, and when the variance is infinite there is no sigma to standardize by and no bell curve to converge to. The cleanest example is the Cauchy distribution, whose tails are so heavy that even its mean is undefined. Average n Cauchy variables and you do not get a tighter and tighter distribution around some center — you get back exactly the same Cauchy you started with, no matter how large n is. Averaging buys you nothing; the CLT has no grip at all. Heavy-tailed data in the wild (some financial returns, some network and file-size data) sit close enough to this regime that the normal approximation can be dangerously wrong.

The second failure is dependence. The classic CLT assumes the pieces are independent (or at least that dependence dies away fast enough). When observations are strongly correlated — consecutive days in a time series, repeated measures on the same person, clustered survey responses — the effective amount of independent information is far less than n, and the true standard deviation of the average is much larger than the naive sigma / sqrt(n) would claim. Plugging in sigma / sqrt(n) anyway produces error bars that are too narrow and confidence that is unearned. The fix is not to abandon the CLT but to use a version honest about dependence, which still gives normality but with the correct, larger spread.

The third failure is non-identical pieces with no single dominating constraint. The Lindeberg-Feller version of the CLT, met earlier in this rung, relaxes 'identically distributed' to a condition saying that no single term is allowed to dominate the sum. When that condition breaks — when one or two terms carry most of the total variance — the sum keeps the fingerprint of those few big terms and need not look normal at all. This is the honest reason the CLT is a statement about many small comparable contributions adding up, not about any sum whatsoever.

Three traps of interpretation

Even when the CLT genuinely applies, it is easy to read it wrong. Trap one: the CLT is not the law of large numbers, and the two are constantly confused. The law of large numbers says the average converges to mu — the spread of the average collapses to zero. The CLT is the refinement that describes the SHAPE of that collapsing spread along the way: blown up by the factor sqrt(n), the deviations of the average from mu look normal. One says where the average goes; the other says how it fluctuates around that destination. And neither says individual outcomes 'even out' — independent trials have no memory, which is the gambler's fallacy all over again.

Trap two: the CLT is about the distribution of the sum or average, not about the data themselves. People sometimes say 'my data are normal because of the CLT' — but the CLT never claims your raw observations become normal as you collect more of them; the population shape is whatever it is and does not change. What becomes normal is the sampling distribution of the average computed from many observations. Mixing these up is the difference between a histogram of incomes (still skewed) and a histogram of average incomes from many samples (approximately bell-shaped).

Trap three: in practice you almost never know the true sigma, so you plug in an estimate from the data. That is legitimate, and the reason it stays legitimate is a tool from earlier in this rung — Slutsky's theorem, which says that if the standardized average converges to a normal and your estimate of sigma converges to the true sigma, then the version using the estimated sigma still converges to that same normal. This is the quiet machinery that lets real-world confidence intervals work at all. The same family of results includes the delta method, which extends normal approximations from an average to a smooth function of an average — so you can put error bars on things like a ratio or a log of an estimate, not just the estimate itself.