The Union Bound and the Borel-Cantelli Lemmas

The one-line inequality everything leans on

The earlier guides in this rung each squeezed a single random variable: Markov's inequality bounded how far X strays above its mean, Chebyshev's used the variance, and the Chernoff and Hoeffding bounds gave exponentially small tails for sums. This last guide changes the question. Instead of 'how big can one quantity get?', we ask 'what is the chance that at least one of many bad things happens?' The tool is so simple it is almost embarrassing — and so useful it is everywhere.

The union bound (also called Boole's inequality) says: for any events A_1, A_2, ..., A_n, P(A_1 or A_2 or ... or A_n) <= P(A_1) + P(A_2) + ... + P(A_n). The probability that any of them happens is no more than the sum of their individual probabilities. Notice what is NOT required: the events need not be independent, need not be disjoint, need not have anything to do with each other. You just add up the separate chances, and that sum is an upper ceiling on the chance that something goes wrong.

Why is it true? Picture each event as a region of paint splashed onto the sample space. The union 'A_1 or A_2 or ...' is the total painted area. If you instead add up the areas one event at a time, every point that two events both cover gets counted twice, every point three cover gets counted thrice, and so on — you can only ever over-count, never under-count. So the sum of the parts is at least the area of the union. The bound is exact precisely when the events are mutually exclusive (no overlap, nothing double-counted); the more they overlap, the slacker it gets. It is the rough, one-sided cousin of the exact inclusion-exclusion principle, which keeps all the overlap-correction terms instead of throwing them away.

Why a crude bound is so powerful: failure of the worst part

The union bound shines when you have many things that each almost never fail, and you want them all to succeed at once. Flip it around with the complement rule: if A_i is 'component i fails', then 'everything works' is the complement of 'something fails', so P(all succeed) = 1 - P(some fail) >= 1 - sum P(A_i). To guarantee the whole system works with probability at least 0.99, it is enough to make the individual failure probabilities add up to at most 0.01. You never have to untangle how the failures correlate — a colossal simplification.

Here is a tiny worked number. Suppose you run 1000 independent statistical tests and each, by bad luck alone, raises a false alarm with probability 0.001. What is the chance that at least one false alarm appears? The union bound says at most 1000 x 0.001 = 1, which here is useless (a probability can never exceed 1, so the bound told you nothing). But shrink each test's threshold so its false-alarm rate is 0.00001; now the chance of any false alarm at all is at most 1000 x 0.00001 = 0.01. This is exactly the logic behind the Bonferroni correction in statistics: to keep the chance of any false discovery below a level, divide that level by the number of tests. The union bound is the whole proof.

The first Borel-Cantelli lemma: rare events eventually stop

Now stretch the union bound to infinitely many events and you get something that feels like magic: a way to decide, with certainty, whether a sequence of events keeps happening forever or shuts off after some point. The key phrase is 'infinitely often' — the event that A_n occurs for infinitely many values of n, no matter how far out you look there is always another one waiting. The first Borel-Cantelli lemma gives a clean test for ruling this out.

It says: if the probabilities sum to a finite total, sum P(A_n) < infinity, then with probability 1 only finitely many of the A_n ever occur. In plain words, if the chances die down fast enough that their grand total is finite, the events are guaranteed to stop — there is some last time A_n happens, and after that, never again. The proof is one application of the union bound. The chance that A_n still happens at some index n or beyond is at most the tail sum P(A_n) + P(A_(n+1)) + ..., and because the whole series converges, that tail can be driven below any tiny number you like by going far enough out. A probability that is smaller than every positive number is zero. So 'infinitely often' has probability zero, which means 'finitely often' has probability one.

First Borel-Cantelli   (no independence needed):

   sum_n P(A_n) < infinity     ==>    P(A_n infinitely often) = 0
                                      i.e. only finitely many A_n occur, a.s.

   proof in one line:
      P(some A_k with k >= n) <= sum_{k>=n} P(A_k)  --> 0   (union bound + convergent tail)


Second Borel-Cantelli   (INDEPENDENCE required):

   sum_n P(A_n) = infinity  AND  A_n independent
                                ==>    P(A_n infinitely often) = 1


   The sum being finite or infinite is the whole switch:
      converges  ->  events stop forever
      diverges   ->  events keep returning (if independent)

The two lemmas as a single dichotomy: whether the series of probabilities converges decides whether the events stop or keep coming back.

The second lemma: when rare events refuse to stop

The first lemma only ever concludes 'probability zero'. To conclude the opposite — that events DO keep coming back forever — you need a converse, and here a price must be paid. The second Borel-Cantelli lemma says: if the probabilities sum to infinity, sum P(A_n) = infinity, AND the events A_n are independent, then with probability 1 infinitely many of them occur. The divergent sum now forces the events to recur forever rather than to stop.

The independence hypothesis is not red tape — it is genuinely indispensable, and dropping it makes the claim false. Here is a one-line counterexample to keep you honest. Let A be a single event with P(A) = 1/2, and define A_n = A for every n (the same event, repeated, perfectly correlated rather than independent). Then sum P(A_n) = 1/2 + 1/2 + ... = infinity, so the divergence condition holds. But 'A_n infinitely often' happens exactly when A happens, which has probability 1/2, not 1. The divergent sum did not force recurrence, because the events were not independent. Independence is what stops the probability mass from piling onto the same outcomes.

Together the two lemmas form a beautiful all-or-nothing law for independent events: the count of occurrences is either finite for sure (sum converges) or infinite for sure (sum diverges) — never anything in between, never something like 'a 60% chance of recurring'. That stark 'zero or one' verdict is no accident; it is a special case of Kolmogorov's zero-one law, which says any 'tail event' (one whose truth does not depend on any finite number of the trials) of independent variables must have probability exactly 0 or 1. 'A_n happens infinitely often' is precisely such a tail event — changing the first million trials cannot affect whether something happens infinitely often.

Putting it to work: convergence, monkeys, and the law of large numbers

The first lemma is the standard route to proving that something happens almost surely. A sequence X_n converges to X with probability 1 exactly when, for every small tolerance epsilon, the bad events A_n = '|X_n - X| > epsilon' occur only finitely often. So the recipe is: bound P(|X_n - X| > epsilon) for each n (Markov, Chebyshev, and the Chernoff or Hoeffding bounds are your suppliers here), check that these bounds sum to something finite, and the first Borel-Cantelli lemma hands you almost sure convergence for free. This is the bridge from the tail bounds of this whole rung to the strongest mode of convergence in the next.

The second lemma supplies the famous reassurances about persistence. Toss a fair coin forever: the event 'heads on toss n' is independent across n and each has probability 1/2, so sum P = infinity and by the second lemma you get heads infinitely often, with probability 1 — and the same for tails, and for any fixed finite pattern like ten heads in a row (each block has a fixed positive probability, the blocks along a subsequence are independent, the sum diverges). This is the rigorous version of the 'infinite monkeys' idea: any fixed finite text will appear infinitely often in an endless stream of random letters. Crucially this is NOT the gambler's fallacy — the coin has no memory and is not 'due' for anything. It is simply that with infinitely many independent chances, each with the same positive probability, a divergent sum guarantees endless recurrence.

Finally, this machinery is one of the cleanest routes to the strong law of large numbers — the statement that the running average of independent identically distributed variables converges to the true mean with probability 1, not merely in probability. Be careful about what that promise is, though. The strong law is about the average settling down to mu; it does not say the sum, or the raw count of heads minus tails, ever 'evens out' or returns to zero. In fact, by these same lemmas the gap between heads and tails grows without bound and is large infinitely often — what shrinks is that gap divided by n. The averages converge; the discrepancies do not vanish. Holding those two facts together at once is the mark of really understanding this rung.