Where the variance hides in the statement
In the previous guide you met the central limit theorem in its standard form: for independent, identically distributed variables X_1, ..., X_n with mean mu and finite variance sigma^2, the standardized sum (X_1 + ... + X_n - n*mu) / (sigma * sqrt(n)) converges in distribution to a standard normal. Read that statement slowly and you will spot sigma^2 sitting in three places at once. It is assumed to be finite; it is the very scale, sigma * sqrt(n), that we divide by; and it survives as the variance of the limiting bell curve. The theorem is built out of the variance — so it should be no surprise that taking the variance away dismantles it.
The deep reason the variance is exactly the right currency is the square-root-of-n scaling. A sum of n independent copies has mean that grows like n and standard deviation that grows like sqrt(n), so the sum spreads out as sqrt(n) — that is why we divide by sqrt(n) and no other power. But "the standard deviation grows like sqrt(n)" is only true when each term contributes a finite, comparable amount of variance. The whole machine assumes that no single term, however far out, can dominate the pile. Remove the finite variance and that promise evaporates: one freak term can outweigh all the others combined.
The Cauchy: a distribution that refuses to average
The cleanest place to watch the theorem fail is the Cauchy distribution, whose density is the gentle hump f(x) = 1 / (pi * (1 + x^2)). It looks innocent — symmetric, peaked at zero, vaguely bell-like — but its tails decay only like 1/x^2, far too slowly. The defining integral for the mean, the integral of x * f(x), does not even converge; the Cauchy has no well-defined expectation, and therefore certainly no finite variance. It fails the CLT's hypothesis before we even begin.
Now here is the genuinely startling part. Take n independent Cauchy variables and average them. You would hope, as with any well-behaved law, that the average X-bar tightens around some center as n grows. It does not. A remarkable fact — provable in one line with the characteristic function — is that the average of n independent standard Cauchy variables is itself exactly standard Cauchy, for every n. Averaging a thousand of them gives you back the very same wide distribution you started with. The spread never shrinks; the law of large numbers fails for the Cauchy too, and the bell curve never appears.
One line of proof, with the characteristic function
It is worth seeing the calculation, because it shows precisely where the usual CLT machinery jams. The characteristic function turns sums into products: for independent variables, the characteristic function of a sum is the product of the individual ones. For n independent standard Cauchy variables, each contributes e^(-|t|), so their sum S_n has characteristic function e^(-n|t|). The average is S_n / n, and rescaling by 1/n replaces t with t/n inside the function. Watch what happens.
Follow it through. A single standard Cauchy has phi(t) = e^(-|t|), so the independent sum S_n has phi_{S_n}(t) = (e^(-|t|))^n = e^(-n|t|). The average is X-bar = S_n / n, and rescaling by 1/n means evaluating at t/n: phi_{X-bar}(t) = phi_{S_n}(t/n) = e^(-n*|t/n|) = e^(-|t|). Dividing the sum by n cancels the n inside the exponent exactly, leaving the characteristic function of a single Cauchy. The average is stuck as a Cauchy and never concentrates. Notice the contrast with the CLT, where the right rescaling is by sqrt(n); there it would have produced e^(-sqrt(n)|t|), which goes nowhere useful, while the finite-variance case instead converges to the Gaussian e^(-t^2/2).
Compare this with a finite-variance variable, whose characteristic function near zero looks like 1 - (sigma^2 * t^2)/2 + ..., a smooth parabola. That little t^2 term is the variance, and it is exactly what the CLT proof exponentiates into the Gaussian e^(-t^2/2) after the sqrt(n) rescaling. The Cauchy's characteristic function instead has a corner at the origin — it behaves like 1 - |t| + ..., with no t^2 term to find, because there is no variance to supply one. No quadratic term, no Gaussian limit. The missing variance is visible right there in the shape of the transform.
How heavy is too heavy?
The Cauchy is not a freak; it is the visible tip of a whole family. The right way to measure danger is the tail index. Suppose a distribution's tail decays like a power, P(|X| > x) approximately C / x^alpha for large x. The variance is the integral of x^2 against the density, so it is finite only when the tail dies fast enough to kill that extra x^2 — which works out to alpha > 2. The mean survives whenever alpha > 1. So there are three regimes, and only one of them gives you the ordinary bell curve.
tail: P(|X| > x) ~ C / x^alpha alpha > 2 : variance finite -> classical CLT applies, Gaussian limit 1 < alpha <= 2 : mean finite, variance INFINITE -> no Gaussian; alpha-stable limit alpha <= 1 : mean infinite too -> even the law of large numbers can fail (the Cauchy is the alpha = 1 boundary case)
A concrete example is the Pareto distribution used to model wealth, city sizes, and insurance losses, with tail P(X > x) = (x_m / x)^alpha. With alpha = 3 it has a finite variance and the CLT works fine. With alpha = 1.5 the mean exists but the variance is infinite, and sums are dominated by the single largest term — the average of a hundred such losses is essentially the biggest loss, not a smooth Gaussian blur. This is not a pathology invented by mathematicians; it is the everyday reality of financial returns, file sizes, and earthquake magnitudes, where the CLT's reassuring bell curve quietly does not apply.
The honest fine print: finite variance is sufficient, almost necessary
Two refinements keep us honest. First, identical distributions are not actually required — the Lindeberg-Feller theorem gives a CLT for sums of independent but differently distributed terms, provided each term has a finite variance AND no single term's variance dominates the total (the Lindeberg condition). That second clause is the precise statement of the intuition we met at the start: no term may take over the pile. Finite variance alone is not the whole story when the terms differ; you also need them to be individually negligible.
Second, when finite variance does fail, the bell curve is not replaced by chaos — it is replaced by a different limit. The generalized central limit theorem says that heavy-tailed sums with tail index alpha < 2, properly rescaled (by n^(1/alpha), not sqrt(n)), converge to an alpha-stable distribution. The normal is simply the alpha = 2 member of this family; the Cauchy is the alpha = 1 member. So the CLT we love is one special case of a richer law, and the role of finite variance is to single out the Gaussian as the attractor rather than one of its heavy-tailed cousins.
Putting the hypothesis in its place
Step back and the picture is tidy. The central limit theorem is a statement with hypotheses, not a slogan that fits everything. Independence keeps the terms from conspiring; finite variance fixes the sqrt(n) scale and supplies the t^2 term that becomes the Gaussian; and a negligibility condition (automatic under identical distributions) keeps any single term from running the show. Honour all three and the bell curve is guaranteed. Break the variance condition and you fall into a different, heavier world with its own beautiful but non-Gaussian limits.
The single most useful habit this guide can leave you with is to ask, before invoking the CLT on real data, "does this thing even have a finite variance?" For heights, measurement errors, and coin flips the answer is plainly yes and the bell curve is well earned. For losses, returns, network traffic, and anything with a power-law tail, the answer may quietly be no — and assuming the CLT there is one of the most expensive mistakes in applied probability. The next guide turns to using the theorem well when it does apply, and the pitfalls that remain even then.