The Geometric Distribution and Waiting for Success

Flipping the binomial question on its head

In the previous guide you fixed the number of trials at some n and asked how many of them were successes — that count is the binomial distribution. Both the binomial and the model we meet now are built from the very same raw material: a long run of identical, independent Bernoulli trials, each a yes-or-no experiment that succeeds with probability p. What changes is the question. Instead of fixing the number of trials and counting successes, we now fix the target — the first success — and count the trials it takes to reach it.

This is the natural shape of a huge family of real questions. How many sales calls until the first "yes"? How many die rolls until the first 6? How many times must you reload a flaky web page before it finally loads? In each case the trials are repeated under the same conditions and you are waiting for the first success. The geometric distribution is the answer to exactly that waiting question, and it is the simplest member of the discrete model zoo whose story is about *when*, not *how many*.

Building the pmf one failure at a time

You can derive the whole probability mass function from scratch with nothing but the multiplication rule for independent trials. For the first success to land exactly on trial k, two things must happen in sequence: the first k-1 trials must all be failures, AND the k-th trial must be a success. A single failure has probability 1-p, and the trials are independent, so k-1 failures in a row have probability (1-p)^(k-1). Multiply by the probability p of the success that finally arrives, and you have the answer.

P(X = k) = (1 - p)^(k-1) * p      for k = 1, 2, 3, ...

example (p = 1/6, the first 6 on a die):
  P(X = 1) = (5/6)^0 * 1/6 = 0.1667
  P(X = 2) = (5/6)^1 * 1/6 = 0.1389
  P(X = 3) = (5/6)^2 * 1/6 = 0.1157
  ...probabilities shrink by a factor of 5/6 each step

The geometric pmf: failures pile up as a power, then one success closes the deal. Each term is 5/6 of the one before — a geometric sequence, which is where the name comes from.

Notice the shape these numbers trace out. The single most likely value is always k = 1: getting the success immediately is the most probable single outcome, even when p is small, because every later value has to survive extra failures first. From there the probabilities fall off geometrically, never quite reaching zero. That is why the support runs over all the positive integers — there is always some tiny chance you wait a very long time. And the terms do add up to exactly 1, because the infinite sum p + p(1-p) + p(1-p)^2 + ... is a geometric series summing to p / (1 - (1-p)) = 1, a satisfying sign that the model is self-consistent.

How long should you expect to wait?

The expected value of a geometric variable is beautifully simple: E[X] = 1/p. If a success happens one time in p, on average it takes 1/p trials to see it. Roll a die for the first 6 and you expect 1/(1/6) = 6 rolls. Aim for an event with p = 0.01 and you expect about 100 attempts. This matches the intuition perfectly — rarer successes make you wait proportionally longer — and it gives you a quick sanity number for any waiting problem.

But the average alone hides an important warning, the kind this rung keeps stressing: the spread of the waiting time is enormous. The variance of the geometric distribution is Var(X) = (1-p)/p^2, which for small p is roughly 1/p^2 — so the standard deviation is about 1/p, nearly as large as the mean itself. With p = 0.01 you expect 100 trials, but a standard deviation near 100 means waits of 30 or 250 are entirely ordinary. The geometric distribution is right-skewed and long-tailed; quoting only "on average 100" badly understates how wildly the actual wait can swing.

E[X]   = 1 / p                 (expected number of trials)
Var(X) = (1 - p) / p^2         (spread, large for small p)

p = 1/6 :  E[X] = 6,    SD = sqrt((5/6)/(1/36)) = sqrt(30) ~ 5.48
p = 0.01:  E[X] = 100,  SD ~ 99.5  (almost as big as the mean!)

Mean and variance of the geometric distribution. The standard deviation grows almost as fast as the mean, so the wait is far less predictable than the single number 1/p suggests.

The astonishing lack of memory

Here is the property that makes the geometric distribution famous, and it trips up almost everyone the first time. Suppose you have already rolled a die 10 times without a single 6. How many more rolls until your first 6? The honest answer is: still 6 on average, exactly as if you had not rolled at all. The past failures bought you nothing. Formally this is the memorylessness property: P(X > m + n given X > m) = P(X > n) for any positive integers m and n. The geometric is the *only* discrete distribution with this property.

Why is it true? Because the trials are independent, the die simply does not know or care what came before. Each fresh roll is a brand-new Bernoulli trial with the same probability p, so the count of *additional* rolls until success has the very same geometric distribution as if you were starting clean. You can even see it in the algebra: P(X > n) = (1-p)^n (the chance the first n trials are all failures), and (1-p)^(m+n) / (1-p)^m = (1-p)^n, so the m cancels completely.

Waiting for the r-th success: the negative binomial

The geometric distribution waits for the *first* success. The obvious generalization is to wait for the *r-th* success, and that gives the negative binomial distribution, the star of guide 4 in this rung. The link is clean and worth holding in your head: the geometric distribution is exactly the negative binomial with r = 1. And just as a binomial count is a sum of independent Bernoulli indicators, a negative binomial wait is a sum of r independent geometric waits — wait for the first success, then reset and wait for the next, r times over.

This decomposition is more than a curiosity; it is a tool. Because expectation is linear, the expected wait for r successes is just r copies of the single-success wait: E = r/p. Variance adds for independent pieces too, giving Var = r(1-p)/p^2. You never have to memorize the negative binomial's mean and variance separately — they are the geometric's numbers multiplied by r. Building distributions out of simpler independent pieces, then summing their means and variances, is a habit that will pay off again and again across this entire ladder.

Confirm you are watching repeated independent trials with the same success probability p — the Bernoulli engine. If p drifts between trials, the geometric model does not apply.
Ask what you are waiting for. The first success means geometric (r = 1); the r-th success means negative binomial.
For an exact probability, use P(X = k) = (1-p)^(k-1) * p, or sum it to get tail probabilities like P(X > n) = (1-p)^n.
For a quick gut check, use E[X] = 1/p for the mean and remember the spread is large, so do not trust the mean as a tight prediction.

When the geometric model fits — and when it lies

The geometric distribution earns its keep whenever you genuinely have a stream of identical, independent yes/no trials and you care about the wait to the first yes. But its honesty depends on two assumptions that the real world loves to break, and recognizing the breakage is half the skill of choosing the right discrete model. First, the trials must be independent — if a failed sales call makes you sharper on the next one (or more discouraged), the trials influence each other and memorylessness fails. Second, p must stay constant — if the success probability changes over time, no single geometric model describes the whole stream.

A common real example where it nearly-but-not-quite fits: drawing cards from a deck until your first ace. Each draw is success-or-failure, but because you do not replace the cards, the probability of an ace shifts after every draw — that is sampling without replacement, which belongs to the hypergeometric family in guide 4, not the geometric. The geometric is the *with-replacement* waiting model. Keeping that distinction sharp is exactly the kind of judgment the last guide of this rung will train.

Finally, a caution that echoes the whole rung's spirit: a clean formula does not certify a clean model. The geometric pmf will happily compute a number for any p you feed it, even when the underlying trials are not really independent or identically distributed. The formula is honest about the math but cannot check your assumptions for you. Before you trust P(X = k) = (1-p)^(k-1) * p, ask whether the trials in front of you truly are independent Bernoulli repeats. If they are, the geometric distribution is one of the most elegant and reliable tools you own.