Bernoulli and Binomial: Counting Successes

The atom: one yes-or-no trial

You arrive at this rung already knowing what a random variable is and how a probability mass function assigns weights to its possible values. Now we stop talking in the abstract and meet the named distributions that show up over and over in real problems. The very first one is the simplest object in all of discrete probability: a single experiment with exactly two outcomes. A coin lands heads or tails. A patient responds to a drug or does not. An email is spam or it is not. We call any such two-outcome experiment a [[bernoulli-trial|Bernoulli trial]].

To turn that into a number, label one outcome "success" (worth 1) and the other "failure" (worth 0). The choice of which is the success is yours and is just bookkeeping — success need not be a happy event; for a quality inspector a "success" might mean finding a defect. Let p be the probability of success, so 1 - p is the probability of failure. The resulting random variable X, which equals 1 with probability p and 0 with probability 1 - p, follows the [[prob-bernoulli-distribution|Bernoulli distribution]], written X ~ Bernoulli(p). Its entire pmf fits on one line.

X ~ Bernoulli(p)

  P(X = 1) = p          (success)
  P(X = 0) = 1 - p      (failure)

  E[X]   = p
  Var(X) = p(1 - p)

The Bernoulli distribution in full: a single trial with success probability p.

The mean and variance fall out instantly and are worth feeling in your bones. Since X only takes the values 0 and 1, E[X] = 0*(1 - p) + 1*p = p. For the variance use Var(X) = E[X^2] - (E[X])^2; but X is 0 or 1, so X^2 = X, hence E[X^2] = p and Var(X) = p - p^2 = p(1 - p). That p(1 - p) is largest at p = 1/2, where the outcome is most uncertain, and shrinks to 0 as p approaches 0 or 1, where the result is nearly a foregone conclusion. Uncertainty is highest exactly when the coin is fairest.

Stacking trials: the binomial distribution

A single trial is rarely the question. We usually want: out of n trials, how many successes? Flip a coin 10 times — how many heads? Send 200 emails — how many bounce? To get a clean answer we make two honest assumptions. First, the n trials are independent: the result of one tells you nothing about the others. Second, they are identically distributed: every trial has the same success probability p. Under those two conditions, the count of successes X = X_1 + X_2 + ... + X_n, a sum of n independent Bernoulli(p) variables, follows the [[prob-binomial-distribution|binomial distribution]], X ~ Binomial(n, p).

Where does its formula come from? Suppose we want exactly k successes in n trials. One specific way this can happen — say success on trials 1 through k and failure afterward — has probability p^k * (1 - p)^(n - k), because the trials are independent so their probabilities multiply. But that is only one arrangement. Any other sequence with k successes and n - k failures has the very same probability, since multiplication does not care about order. So we must count how many such sequences exist: that is the number of ways to choose which k of the n positions are the successes, namely the binomial coefficient C(n, k), the "n choose k" you met counting combinations earlier.

X ~ Binomial(n, p)

  P(X = k) = C(n, k) * p^k * (1 - p)^(n - k),   k = 0, 1, ..., n

  C(n, k) = n! / ( k! (n - k)! )

  E[X]   = n p
  Var(X) = n p (1 - p)

The binomial pmf: count the arrangements C(n, k), times the probability p^k (1 - p)^(n - k) of any one of them.

A tiny worked number anchors it. Flip a fair coin (p = 1/2) three times and ask for exactly 2 heads. Then P(X = 2) = C(3, 2) * (1/2)^2 * (1/2)^1 = 3 * (1/8) = 3/8. The three arrangements are HHT, HTH, THH — you can literally list them, and the formula just counted them for you. As a sanity check, summing P(X = k) over all k from 0 to n always gives exactly 1; that is the binomial theorem expanding (p + (1 - p))^n = 1^n = 1, which is a satisfying reason the pmf is built the way it is.

Mean and variance the easy way

You could grind out E[X] for the binomial by summing k * C(n, k) * p^k * (1 - p)^(n - k) over all k. Please do not. There is a far more beautiful route that also teaches a habit you will use everywhere: build the binomial out of its Bernoulli atoms and use linearity of expectation. Write X = X_1 + ... + X_n where each X_i is the indicator of success on trial i — it equals 1 if that trial succeeded and 0 otherwise. This is the [[indicator-variable-trick|indicator-variable trick]], and it is one of the most powerful moves in all of probability.

Write the count as a sum of indicators: X = X_1 + X_2 + ... + X_n, with each X_i ~ Bernoulli(p).
Apply linearity: E[X] = E[X_1] + ... + E[X_n]. Linearity needs NO independence — it is always true — so this step is free.
Each E[X_i] = p, so E[X] = n p. Done — no sums, no factorials.
For variance you now DO use independence: variance of a sum of independent variables is the sum of variances, so Var(X) = Var(X_1) + ... + Var(X_n) = n * p(1 - p).

The shape, and what it tells you

Plot P(X = k) against k and a clear picture emerges. The shape of the binomial is a single hump: probability rises to a peak near k = n p and falls away on both sides. When p = 1/2 the hump is symmetric around n/2; when p is small the hump leans toward the left (few successes are most likely), and when p is large it leans right. The most likely value, the mode, sits at or next to n p. So if you flip a fair coin 100 times, the count clusters around 50, and getting, say, 70 heads is possible but lives far out in the thin tail.

Two limits are worth previewing because they connect this guide to the rest of the rung. When n is large and p moderate, the bumpy binomial smooths into the familiar bell curve — a foretaste of the central limit theorem, since X is a sum of many independent pieces. When n is large but p is tiny so that n p stays moderate, the binomial slides instead toward the Poisson distribution of guide 3, the law of rare events. The same counting model thus has two famous offspring depending on how you push it to the limit.

Honest fine print and traps

The binomial is only valid when its two assumptions actually hold, and skipping that check is the number-one way people misuse it. Independence is the fragile one. If you draw 5 cards from a deck and count the aces, the trials are NOT independent and p does not stay fixed — once you have drawn an ace, fewer remain, so the next draw's probability changes. That is drawing without replacement, and the correct model is the hypergeometric distribution of guide 4, not the binomial. The binomial quietly assumes you sample with replacement, or equivalently from a pool so large that removing a few barely moves the odds.

One more clean fact that ties trials together: the binomial is closed under addition. If X ~ Binomial(n, p) and Y ~ Binomial(m, p) are independent and share the SAME p, then X + Y ~ Binomial(n + m, p). This is obvious once you think in atoms — you are just pooling n + m independent Bernoulli(p) trials and counting all the successes together. It is a baby case of the sum of independent variables and of convolution, ideas a later rung develops fully. But it fails the moment the two groups have different success probabilities; then there is no single p, and the sum is not binomial at all.