Hoeffding and Concentration of Sums

From one variable to a whole sum

In the previous guide you sharpened the Chernoff bound: instead of bounding a tail with the mean (Markov) or the variance (Chebyshev), you fed the whole moment generating function into Markov's inequality and got a tail that falls off *exponentially*. The Chernoff recipe was 'optimize over a knob t and watch the bound collapse'. Hoeffding's inequality is the moment that recipe pays off in full. It takes a sum of many independent bounded variables and hands you one clean exponential tail — no distribution details required, only the range each variable lives in.

Picture the most ordinary thing in statistics: a sum, or its cousin the average. Let X_1, X_2, ..., X_n be independent, and let S = X_1 + ... + X_n be their sum. The weak law of large numbers already told you the *average* S/n settles toward its mean — but it was vague about speed, only promising the deviation probability eventually goes to zero. Hoeffding fills that gap with a number you can actually use: for any deviation t you choose, it tells you a concrete, exponentially small ceiling on P(S - E[S] >= t). It turns 'eventually concentrates' into 'concentrates *this* tightly, *this* fast'.

The inequality itself

Here is the one assumption Hoeffding needs, and it is wonderfully cheap: each X_i is bounded, living inside some interval [a_i, b_i]. You do not need to know its mean, its variance, its shape — only that it cannot escape that range. Under that single condition, the Hoeffding inequality says: P(S - E[S] >= t) <= exp(-2 t^2 / sum_i (b_i - a_i)^2). The whole personality of each variable is compressed into one number, its range width (b_i - a_i). A wide-ranging variable contributes a lot of room to wander; a tightly bounded one contributes almost none.

Read the exponent slowly, because it carries all the meaning. The deviation t sits *squared* on top, so the bound is brutally sensitive to how far out you go — doubling the deviation quarters the exponent's distance from zero, crushing the probability. On the bottom sits the total squared range. If every variable shares the same range of width c, that denominator becomes n * c^2, and the bound is exp(-2 t^2 / (n c^2)). To keep the average's deviation fixed you set t = n*epsilon, and the exponent becomes -2 n epsilon^2 / c^2 — a bound that *itself* shrinks exponentially in n. That is the quantitative heart the weak law was missing.

Hoeffding (one-sided), X_i independent, a_i <= X_i <= b_i :

    P( S - E[S] >= t )  <=  exp( -2 t^2 / sum_i (b_i - a_i)^2 )

For the average of n equal-range variables (width c), with t = n*eps:

    P( S/n - mean >= eps )  <=  exp( -2 n eps^2 / c^2 )

Two-sided (both tails):  multiply the right-hand side by 2.

  eps^2  on top  -> tiny deviations are cheap, big ones vanish fast
  n      on top  -> more data => exponentially tighter
  c^2    below   -> wider variables => looser bound

Hoeffding's inequality, with each piece of the exponent labelled by what it controls.

A tiny worked poll

Let us make it concrete with the most familiar bounded variable of all: a yes/no answer. Each polled person gives an indicator X_i that is 1 for 'yes' and 0 for 'no', so a_i = 0 and b_i = 1 and the range width is exactly 1. The average S/n is the observed fraction of 'yes' votes, and its mean is the true population fraction p. We want to know: with n = 1000 people, how likely is our poll to be off by more than 5 percentage points (epsilon = 0.05)?

Identify the ranges. Each X_i is in [0, 1], so c = b_i - a_i = 1 and the denominator sum is n * c^2 = 1000.
Pick the deviation of the average: epsilon = 0.05, so the sum's deviation is t = n*epsilon = 1000 * 0.05 = 50.
Plug into the exponent: -2 t^2 / (n c^2) = -2 * (50)^2 / 1000 = -2 * 2500 / 1000 = -5.
One tail: P(S/n - p >= 0.05) <= e^(-5) ≈ 0.0067. Both tails (off by 5 points either way): double it, about 0.013.

So with just a thousand people, the probability of being off by five points or more is below about 1.3% — and notice we never needed to know p. That p-free guarantee is exactly why this argument underpins real polling margins, A/B test stopping rules, and the generalization bounds of machine learning: you bound the worst case without knowing the answer you are trying to estimate. Want a tighter guarantee? Because n sits in the exponent, quadrupling to n = 4000 sends the exponent to -20 and the tail to e^(-20), an almost unimaginably small number. Reliability is *cheap* in this regime — that is the gift of exponential concentration.

Why bounded? Sub-Gaussian and the engine inside

Why does boundedness do all the work? The engine, hidden under the hood, is the Chernoff machinery from the last guide applied to each piece. The key fact (Hoeffding's lemma) is that a bounded variable in [a, b] is sub-Gaussian: its moment generating function is controlled by that of a Gaussian whose variance comes from the range. A sub-Gaussian variable is one whose tails decay at least as fast as a normal's. Independence then lets the MGFs *multiply*, range widths add up in the exponent, and the single Chernoff optimization over t closes the bound. Boundedness is just the cleanest sufficient condition for being sub-Gaussian.

This also tells you honestly where Hoeffding *stops*. If a variable is unbounded with heavy tails — incomes, insurance claims, anything where one rare giant value dominates — the MGF may not even exist, sub-Gaussianity fails, and Hoeffding simply does not apply. (Recall the Chernoff bound needs the MGF to exist at all, and some distributions have no MGF even when a characteristic function always does.) For such cases you reach for tools that need only a variance, like Chebyshev, or for the central limit theorem — but the CLT itself needs *finite variance* and famously fails for the Cauchy distribution. There is no free lunch: stronger conclusions ask for stronger assumptions.

When independence bends: Azuma and martingales

Hoeffding's one hard requirement is independence — the pieces must not pull on each other. But many real processes are not a clean sum of independent parts: a quantity that builds up step by step, where each step depends on the history so far. Remarkably, you can still get a Hoeffding-shaped exponential tail, as long as the running process is a martingale — a sequence that is 'fair' in the sense that, given everything so far, the expected next change is zero (a fair game's fortune, neither drifting up nor down on average).

The Azuma-Hoeffding inequality says: if a martingale's steps are bounded (each increment lives in a range of width c_i), then the same exponential tail holds, P(deviation >= t) <= exp(-2 t^2 / sum_i c_i^2). It is the same shape, with 'independent variables' swapped for 'bounded martingale increments'. This single generalization is what lets concentration arguments reach far beyond simple sums into algorithms, random graphs, and any quantity that changes a little at each step without one step ever dominating. Hoeffding was the doorway; Azuma is the hallway it opens onto.

Step back and see the whole rung as one staircase. Markov used only the mean; Chebyshev added the variance and gave the weak law; Chernoff fed in the full MGF for an exponential tail; Hoeffding turned that into a clean, distribution-free bound for sums of bounded variables; and Azuma loosened independence to bounded-increment martingales. Each rung kept the same Markov-on-a-clever-transform skeleton but bought a sharper or broader conclusion. Next, the final guide trades exponential sharpness for sheer breadth — the humble union bound and the Borel-Cantelli lemmas, which control infinitely many events at once.