A transform built for counting
The earlier guides in this rung gave you the moment generating function, M(t) = E[e^(tX)], a single function that quietly carries a whole distribution and turns sums into products. The mgf is wonderfully general, but it works hardest when X can be any real number. A huge family of random variables, though, is more modest: they only ever take the values 0, 1, 2, 3, .... Counts of heads, of customers, of defects, of goals. For these whole-number counters there is a transform that fits like a glove — the probability generating function, or pgf.
The definition is one tiny tweak of the mgf. Instead of E[e^(tX)], write G(s) = E[s^X], where s is just a number, usually thought of as somewhere in the range from -1 to 1. (They are even close relatives: setting s = e^t recovers the mgf, since s^X = e^(tX).) That swap looks cosmetic, but for a whole-number-valued variable it produces something delightfully concrete: a power series in s whose coefficients are exactly the probabilities. Write out E[s^X] over the possible values and you get G(s) = P(X = 0) + P(X = 1) s + P(X = 2) s^2 + P(X = 3) s^3 + ... — the whole pmf packed into one polynomial-like object.
The coefficients ARE the probabilities
Here is the pgf's signature trick, the thing the mgf cannot do nearly as cleanly. Because G(s) = P(X=0) + P(X=1) s + P(X=2) s^2 + ..., every probability sits in plain sight as a coefficient. Set s = 0 and everything with an s in it vanishes, leaving G(0) = P(X = 0). Differentiate once and the constant term dies, the s-term becomes a constant, and at s = 0 you read off G'(0) = P(X = 1). Differentiate twice and divide by 2 to get P(X = 2), and in general P(X = k) is the k-th derivative at 0 divided by k-factorial. This is why people say the function 'generates' the probabilities: turn the crank — differentiate and evaluate at 0 — and out they come, one at a time.
A second handle gives the moments. The pgf recovers moments too, but through what are called factorial moments rather than ordinary ones, and that is no accident — counting problems are naturally about products like X(X-1). The clean facts: G(1) = 1 always (it is just the total probability summing to one, a free sanity check), G'(1) = E[X] gives the mean, and G''(1) = E[X(X-1)]. From those two you rebuild the variance with Var(X) = G''(1) + G'(1) - (G'(1))^2, because E[X^2] = E[X(X-1)] + E[X] = G''(1) + G'(1). So the very same function hands you probabilities by differentiating at 0 and moments by differentiating at 1.
G(s) = E[s^X] = P(X=0) + P(X=1)s + P(X=2)s^2 + ... probabilities (differentiate at s = 0): P(X=k) = G^(k)(0) / k! e.g. G(0) = P(X=0) moments (differentiate at s = 1): G(1) = 1 (total probability) G'(1) = E[X] (the mean) G''(1)= E[X(X-1)] (factorial moment) Var(X) = G''(1) + G'(1) - (G'(1))^2
A worked example: the Poisson, in one move
Take a Poisson count with rate lambda — the number of emails in an hour, say. Its pmf is P(X = k) = e^(-lambda) lambda^k / k-factorial. Plug into the definition and the sum collapses beautifully: G(s) = sum over k of s^k e^(-lambda) lambda^k / k-factorial = e^(-lambda) times the series for e^(lambda s), which is e^(lambda(s - 1)). That single compact formula, G(s) = e^(lambda(s-1)), is the whole Poisson distribution in your pocket.
- Check the sanity condition: G(1) = e^(lambda(1-1)) = e^0 = 1. Good — total probability is one.
- Mean: differentiate, G'(s) = lambda e^(lambda(s-1)), so E[X] = G'(1) = lambda. The average count is exactly the rate.
- Factorial moment: G''(s) = lambda^2 e^(lambda(s-1)), so G''(1) = lambda^2 = E[X(X-1)].
- Variance: Var(X) = G''(1) + G'(1) - (G'(1))^2 = lambda^2 + lambda - lambda^2 = lambda. Mean and variance both equal lambda — the famous Poisson signature, recovered without summing a single infinite series by hand.
Sums of counts become products of pgfs
The pgf inherits the headline superpower of all these transforms: it turns adding into multiplying. If X and Y are independent, then for Z = X + Y the pgf factors, G_Z(s) = G_X(s) times G_Y(s) — the exact same multiply-on-sums rule you met for the mgf of a sum. The reason is the same one liner: s^(X+Y) = s^X times s^Y, and independence lets the expectation of a product split into a product of expectations. So instead of grinding the messy convolution of two count distributions, you multiply two tidy functions.
Watch it work in one line. Add two independent Poissons with rates lambda and mu: G_Z(s) = e^(lambda(s-1)) times e^(mu(s-1)) = e^((lambda+mu)(s-1)). That product is, on its face, the pgf of a Poisson with rate lambda + mu. Conclusion, with no calculus and no infinite sum: the total of two independent Poisson counts is itself Poisson, with the rates simply adding. The same move shows a sum of independent Bernoulli indicators with pgf (1 - p + p s) gives (1 - p + p s)^n — the pgf of the binomial, which is exactly the picture of the binomial as n coin flips added up.
What it pins down, and where it stops
The pgf does not just summarize a count distribution — it determines it. Two non-negative integer variables with the same pgf on an interval of s have the same probabilities, term by term, because matching power series must have matching coefficients. So when you multiply pgfs and recognize the product, that recognition is a genuine proof, not a lucky guess: the Poisson-plus-Poisson and Bernoulli-to-binomial arguments above are airtight. This uniqueness is the same principle behind the mgf uniqueness theorem, just stated for the count case where the coefficients are literally the probabilities.
Be honest about the boundaries, though. The pgf is a specialist: it only lives on non-negative integer counts, so it says nothing about heights, waiting times, or temperatures. And while G(s) converges safely for s between -1 and 1 (the probabilities sum to one, so the series behaves there), reading moments off requires the derivatives at s = 1 to be finite, which can fail for heavy-tailed counts with no finite mean. When you outgrow the count world — continuous variables, possibly negative values, distributions with no finite moments at all — you graduate to the characteristic function, the one transform that exists for every distribution without exception, the subject of the next guide.