Why one notion of convergence is not enough
Up to now "x_n converges to L" has meant the plain thing from calculus: the numbers x_n get and stay arbitrarily close to a fixed limit L. That single definition served you well because each x_n was just a number. But in this rung the objects in the sequence are random variables — each X_n is a whole gamble, a function on the sample space — and asking whether X_1, X_2, X_3, ... "approaches" some limit X turns out to have several honest, genuinely different answers. The reason is subtle: a random variable carries not just a value but an entire pattern of chances, so "getting close" can mean close in value on each outcome, close in probability, or merely close in the shape of its distribution.
This is not pedantry. The two great results ahead live in two different modes. The weak law of large numbers says the sample average converges in one weaker sense; the strong law upgrades that to a stronger sense; and the central limit theorem is a statement in yet a third, weakest mode. If you blur the modes together you will think these theorems say the same thing or contradict each other — they do neither. So we slow down here and lay out all four carefully, from strongest to weakest, with a tiny concrete picture for each.
Almost sure: convergence outcome by outcome
The strongest of the four is almost sure convergence. Remember that each X_n is a function on the sample space: fix one underlying outcome omega — one full run of the experiment, say one infinite coin-toss history — and the numbers X_1(omega), X_2(omega), X_3(omega), ... form an ordinary sequence of real numbers. Almost sure convergence asks for that ordinary calculus limit to hold for essentially every omega: the set of outcomes where the sequence fails to converge has probability zero. It is allowed to fail on a few freak histories, as long as those freak histories together carry zero probability.
Picture tossing a fair coin forever and letting X_n be the running fraction of heads after n tosses. For an overwhelming set of histories that fraction homes in on 1/2 and stays there. There do exist bizarre histories — all heads forever, for instance — where it never reaches 1/2; almost sure convergence simply notes that the collection of all such bad histories has total probability zero. This "with probability 1" promise is exactly the language of the strong law of large numbers, and it is why we call that law strong.
In probability and in mean: two more notions
Convergence in probability is weaker and asks less. It does not track each history to the end; it only requires that for any tolerance epsilon you pick, the chance that X_n strays more than epsilon from the limit shrinks to zero as n grows: P(|X_n - X| > epsilon) -> 0. The crucial difference from almost sure is the order of the words. Convergence in probability says that at each large n it is very unlikely to be far off, but it permits rare excursions to keep happening forever — just less and less often. Almost sure says that for almost every history the excursions eventually stop altogether.
A vivid way to feel the gap: imagine a blinking light that flashes ever more rarely but never permanently switches off. At any late moment you are very unlikely to catch it mid-flash (so it converges in probability to "dark"), yet along almost every infinite timeline you still see infinitely many flashes (so it does not converge almost surely). The weak law of large numbers is precisely a convergence-in-probability statement, which is why the strong law is a real upgrade rather than a restatement.
The third notion measures closeness with an average instead of a probability. Convergence in r-th mean requires E[|X_n - X|^r] -> 0; the most common case r = 2, convergence in mean square, demands E[(X_n - X)^2] -> 0. This is the natural mode when you care about expected squared error, as in much of statistics and signal work. Mean-square convergence forces convergence in probability too — that follows directly from Markov's inequality applied to (X_n - X)^2 — but it can fail when rare-but-enormous values keep the average error large even as the probability of a miss shrinks.
In distribution: only the shape converges
The weakest and most permissive is convergence in distribution. It does not ask the random variables to get close to each other at all — it only asks that their distributions get close. Formally, convergence in distribution holds when the cumulative distribution functions match in the limit, F_n(x) -> F(x), at every point x where the limit F is continuous. The X_n may be defined on completely unrelated experiments; all that converges is the pattern of chances, the silhouette of the histogram, not the values themselves.
This is exactly the mode the central limit theorem speaks in. The CLT does not claim a standardized sample average settles on one number — it cannot, since that quantity stays random forever. It claims something subtler and beautiful: the distribution of the standardized average approaches the standard normal shape. The bell curve is a limit of shapes, reached in distribution. Why is this mode enough to be useful? Because the deep machinery uses the characteristic function: convergence of distributions is equivalent to pointwise convergence of these transforms, the content of the Levy continuity theorem, and that is the lever the CLT proof pulls.
How the four modes line up
These four are not a flat list — they form a ladder of strength, and knowing the implications among them saves real confusion later. The arrows run one way only: almost sure convergence implies convergence in probability, and convergence in r-th mean also implies convergence in probability, and convergence in probability implies convergence in distribution. The reverse arrows fail in general, which is exactly why we needed four names and not one. The blinking-light picture earlier was a counterexample showing convergence in probability does not give almost sure convergence.
Strength ladder (arrow = "implies"):
almost sure -----+
|
+---> in probability ---> in distribution
|
r-th mean -------+
Reverse arrows fail in general.
Special case: if the limit is a CONSTANT,
in probability <==> in distribution.
Who lives where:
Strong LLN ........ almost sure
Weak LLN .......... in probability
CLT ............... in distributionTwo honest cautions before you climb on. First, almost sure and r-th mean are not comparable to each other: neither implies the other, because one controls every history while the other controls an average, and these can disagree. Second, the tools you already own connect straight to this ladder — Chebyshev's inequality bounds P(|X_n - mu| > epsilon) by Var(X_n)/epsilon^2, which is the cleanest route to convergence in probability and so the engine of the weak law in the next guide. With the modes sorted, the law of large numbers and the central limit theorem will read not as slogans but as precise claims, each tagged with the exact sense in which something converges.