The Moment Generating Function

One function that remembers everything

By now you can compute an expectation, a variance, and in principle any of the moments E[X], E[X^2], E[X^3], and so on. Each moment is a separate integral or sum, and each tells you one fact about the shape of a random variable X — where it sits, how spread out it is, how lopsided. A natural dream is to bottle all of them at once. The moment generating function, or mgf, is exactly that bottle: a single function of a dummy variable t that, once you have it, hands you every moment on demand.

The definition is short. The mgf of X is M(t) = E[e^(tX)] — the expected value of e raised to the power t times X — viewed as a function of t near 0. That is it. You take the ordinary exponential, plug in tX, and average. For a discrete X you sum e^(tx) times P(X = x) over the values; for a continuous X you integrate e^(tx) times the density. The output is not a probability and not a single moment; it is a whole curve in t whose *behaviour near t = 0* secretly stores the entire list of moments.

How it generates the moments

Here is the mechanism that earns the mgf its name, and it is the headline result of this guide — the next guide will put it through its paces. Expand the exponential inside the expectation: M(t) = E[1 + tX + (t^2/2!) X^2 + (t^3/3!) X^3 + ...]. By linearity of expectation the average of a sum is the sum of averages, so M(t) = 1 + t E[X] + (t^2/2!) E[X^2] + (t^3/3!) E[X^3] + .... Read off the structure: the mgf is a power series in t whose coefficients are the moments, scaled by factorials. The moments were never lost; they are simply the Taylor coefficients of M.

That gives a clean recipe for pulling any moment back out, captured by the slogan that the mgf generates moments: differentiate M(t) k times and set t = 0. Each derivative strips off one factor of t and slides the series down one notch, and evaluating at 0 kills every surviving term except the constant — which is exactly E[X^k]. So the k-th moment is the k-th derivative of M at zero. The mgf does not just *store* the moments; it lets you *recover* each one with a derivative instead of a fresh integral.

M(t) = E[e^(tX)] = 1 + t E[X] + (t^2/2!) E[X^2] + (t^3/3!) E[X^3] + ...

  M'(0)   = E[X]          (first moment / the mean)
  M''(0)  = E[X^2]        (second moment)
  M^(k)(0)= E[X^k]        (k-th moment)

  Var(X)  = M''(0) - (M'(0))^2 = E[X^2] - (E[X])^2

Differentiate the mgf k times, set t = 0, and out drops the k-th moment; the mean and variance follow immediately.

A tiny worked example

Let X be a single coin flip — a Bernoulli variable that is 1 with probability p and 0 with probability 1 - p. Its mgf is a two-term sum: M(t) = e^(t*1) p + e^(t*0) (1 - p) = p e^t + (1 - p). No calculus yet — just plug each value into e^(tx) and weight by its probability. Already this little formula contains the mean and the variance, waiting to be differentiated out.

Write the mgf: M(t) = p e^t + (1 - p).
First derivative: M'(t) = p e^t, so M'(0) = p e^0 = p. Hence E[X] = p — the flip averages p, exactly as expected.
Second derivative: M''(t) = p e^t too, so M''(0) = p. Hence E[X^2] = p. (No surprise: since X is only 0 or 1, X^2 = X, so they share a mean.)
Variance: Var(X) = E[X^2] - (E[X])^2 = p - p^2 = p(1 - p). The whole computation came from one formula and two derivatives.

Notice what just happened. We never set up a separate sum for the mean and another for E[X^2]; we wrote one function and differentiated. For a richer distribution — a Poisson, an exponential, a normal — the payoff is far bigger, because those direct moment integrals can be painful while the mgf is a tidy closed form you differentiate as many times as you like. That labour-saving is real, but it is not even the deepest reason the mgf matters, as the next sections show.

Two superpowers: sums and uniqueness

The mgf's first superpower is how gracefully it handles sums of independent variables — the engine behind much of probability. If X and Y are independent and S = X + Y, then M_S(t) = E[e^(t(X+Y))] = E[e^(tX) e^(tY)], and because independence lets the expectation of a product split into a product of expectations, this is M_X(t) * M_Y(t). In words: the mgf of a sum of independent variables is the product of their mgfs. A messy convolution of densities has become a simple multiplication — the mgf-of-a-sum rule that the next guide leans on to add up distributions almost effortlessly.

The second superpower is the uniqueness theorem, the subject of guide 5: if two random variables have the same mgf on an open interval around 0, they have the same distribution. This is what makes the multiplication trick truly useful — once you recognize the product M_X(t) * M_Y(t) as the mgf of, say, a known family, you may conclude S belongs to that family, with no further work. It is the reason an mgf can stand in for the whole distribution: it is, as an earlier rung put it, the complete description in transform clothing.

The catch — and the fix

Now the honest caveat, because the mgf has a real weakness. The defining average E[e^(tX)] can be infinite. For heavy-tailed distributions the e^(tX) weight grows so fast on the tail that the sum or integral diverges for every t except 0, and then the mgf simply does not exist as a usable function. The Cauchy distribution is the classic offender: its tails are so fat that even E[X] fails to exist, so there is no mgf at all. An mgf that does not exist cannot generate moments, multiply over sums, or pin down a distribution — the superpowers vanish with it.

This is where the rest of the rung earns its keep, by offering two repairs. For variables that take only the counts 0, 1, 2, ..., the probability generating function (pgf) of guide 3, defined as E[s^X], is tailor-made and always finite for s in [0, 1]. And for *any* random variable whatsoever, the characteristic function of guide 4 — defined as E[e^(itX)] with an imaginary i in the exponent — is always finite, because e^(itX) lies on the unit circle and never blows up. The characteristic function is the mgf's bulletproof cousin: same generate-moments, multiply-over-sums, pin-down-the-distribution powers, but with no existence worries at all.

So hold the mgf in its proper place. When it exists — and for the everyday families like the binomial, Poisson, exponential, and normal it does, on a neighbourhood of 0 — it is the friendliest tool there is, because it lives in plain real-valued calculus you already know. The pgf is its specialist for counts, and the characteristic function is its always-available backstop. Across the next four guides you will see the same three jobs — generate moments, turn sums into products, and uniquely identify a distribution — done by whichever transform is right for the variable in front of you.