Generating Moments and Summing Independent Variables

Two jobs, one function

In the previous guide you met the moment generating function, defined as M_X(t) = E[e^(tX)] — the expected value of e raised to t times your random variable, viewed as a function of the dial t. On its own that might look like a strange thing to bother computing. This guide is where it pays off, because the same M_X(t) quietly does two completely different jobs that we usually have to grind through separately. First, it generates moments: every E[X], E[X^2], E[X^3] is hiding inside it, recoverable by differentiation. Second, it turns the messy operation of adding independent variables into plain multiplication. Learn to use those two facts and a lot of probability suddenly becomes bookkeeping.

Why should one function carry both gifts? The secret is the exponential e^(tX). Differentiating it in t pulls down a factor of X each time, which is what surfaces the moments; and the exponential of a sum factors into a product, e^(t(X+Y)) = e^(tX) e^(tY), which is what turns sums into products. Everything below is just these two algebraic facts about the exponential, dressed up in expectation. Keep that in mind and the formulas will feel inevitable rather than memorized.

Generating moments: differentiate at zero

Here is the first superpower, the one that names the whole thing. Expand the exponential as its power series: e^(tX) = 1 + tX + (t^2 / 2!) X^2 + (t^3 / 3!) X^3 + ... . Take the expectation term by term (linearity does this for us) and you get M_X(t) = 1 + t E[X] + (t^2 / 2!) E[X^2] + (t^3 / 3!) E[X^3] + ... . Look at what that says: M_X(t) is a Taylor series whose coefficients are the moments, scaled by factorials. The mgf is a tidy package that holds every moment of X at once. The reason it generates the moments is now visible — they are literally its Taylor coefficients.

In practice you do not expand the series; you differentiate and set t = 0. Differentiating the series once and plugging in t = 0 kills every term except the one linear in t, leaving M_X'(0) = E[X]. Differentiate twice and set t = 0 and you get M_X''(0) = E[X^2]. In general the k-th derivative at zero is the k-th moment: the n-th derivative of M_X evaluated at 0 equals E[X^n]. The point t = 0 is special because that is where e^(tX) = 1 and all the higher-order clutter vanishes, leaving exactly the coefficient you want.

Take the exponential mgf: for X ~ Exponential(rate lambda), M_X(t) = lambda / (lambda - t), valid for t < lambda.
First derivative: M_X'(t) = lambda / (lambda - t)^2. Set t = 0: M_X'(0) = lambda / lambda^2 = 1 / lambda. So E[X] = 1 / lambda.
Second derivative: M_X''(t) = 2 lambda / (lambda - t)^3. Set t = 0: M_X''(0) = 2 / lambda^2. So E[X^2] = 2 / lambda^2.
Assemble the variance with the computational formula: Var(X) = E[X^2] - (E[X])^2 = 2 / lambda^2 - 1 / lambda^2 = 1 / lambda^2. No integrals fought, just two derivatives.

Compare that to finding E[X] and E[X^2] for the exponential the honest way — two integration-by-parts battles. The mgf replaced both with routine differentiation. That is the day-to-day value of the first superpower: once you know M_X(t) in closed form, every moment is a derivative away, and the variance falls out of E[X^2] - (E[X])^2 with no extra effort.

Summing independent variables: convolution becomes multiplication

Now the second superpower, and it is the real reason transforms are indispensable. Suppose you add two independent variables, S = X + Y. You learned in the previous rung that the density of a sum of independent variables is a convolution — a sticky integral that smears one distribution across the other. Convolutions are painful: add three or ten variables and the integral nests miserably. The mgf cuts straight through it. Because S = X + Y, we have M_S(t) = E[e^(t(X+Y))] = E[e^(tX) e^(tY)], and when X and Y are independent the expectation of a product splits into a product of expectations, giving M_S(t) = E[e^(tX)] E[e^(tY)] = M_X(t) M_Y(t).

Adding independent variables, two views:

  densities:   f_S  =  f_X  *  f_Y        (convolution -- hard)
  mgfs:        M_S(t) = M_X(t) * M_Y(t)    (multiply -- easy)

  n iid copies:  M_{X1+...+Xn}(t) = [ M_X(t) ]^n

The same addition of independent variables is a convolution of densities but a plain product of mgfs.

That single line, the mgf of a sum is the product of the mgfs, is the workhorse. For n independent and identically distributed copies it becomes M_X(t) raised to the n: adding n things is raising one mgf to the n-th power. Try it on the Poisson distribution, whose mgf is e^(lambda(e^t - 1)). Add an independent Poisson(mu): multiply the mgfs to get e^(lambda(e^t - 1)) times e^(mu(e^t - 1)) = e^((lambda + mu)(e^t - 1)). That is exactly the mgf of a Poisson(lambda + mu). So the sum of independent Poissons is Poisson, with the rates simply added — a fact that is a slog by convolution but a one-line multiplication here. The Poisson is "closed under addition," and the mgf shows it in a single step.

Why the answer is trustworthy: the uniqueness theorem

In the Poisson example we multiplied two mgfs, recognized the product as a Poisson(lambda + mu) mgf, and declared the sum to be that Poisson. That last step needs justifying: how do we know two different distributions cannot share the same mgf, leaving us guessing? The guarantee is the mgf uniqueness theorem: if two random variables have the same mgf and that mgf is finite on an open interval around t = 0, then they have the same distribution, full stop. The mgf is a fingerprint. So matching a computed mgf to a known one really does pin down the distribution — the recognition step is legitimate, not hand-waving.

Read the fine print, though, because it matters. The theorem demands the mgf be finite near zero. That is exactly where the mgf has a weakness: for some distributions E[e^(tX)] is infinite for every t other than zero, so the mgf does not exist in any useful sense. Heavy-tailed distributions like the Cauchy are the classic offenders — their tails are so fat that e^(tX) has infinite expectation, and the mgf machinery simply has nothing to grip. When the mgf is absent, you cannot use it to generate moments (indeed the Cauchy has no mean to generate) and you cannot use it to add variables.

Putting both powers to work

The two superpowers are at their best together. Want the mean and variance of a sum of independent variables? Multiply the mgfs, then differentiate the product at zero — moments of the whole sum without ever touching a convolution. There is an even cleaner shortcut that makes this explicit. Take logarithms: the cumulant generating function is K_X(t) = log M_X(t). Because the mgf of a sum multiplies, its logarithm adds: K_S(t) = K_X(t) + K_Y(t). And the first two derivatives of K at zero hand you the mean and the variance directly — K'(0) = E[X] and K''(0) = Var(X). So for independent variables, means add and variances add, which you can now see is just "the log of a product is a sum of logs."

A flagship example ties it all together: the normal distribution. Its mgf is M(t) = e^(mu t + sigma^2 t^2 / 2). Add two independent normals, X ~ Normal(mu1, sigma1^2) and Y ~ Normal(mu2, sigma2^2): multiply the mgfs, which means add the exponents, giving e^((mu1 + mu2) t + (sigma1^2 + sigma2^2) t^2 / 2). That is the mgf of a Normal(mu1 + mu2, sigma1^2 + sigma2^2). By the uniqueness theorem the sum is that normal — the famous fact that the sum of independent normals is normal, with means and variances adding. Three lines of algebra replace a daunting convolution of two bell curves.

One honest caveat before you run wild. Matching mgfs proves equality of distributions, but adding mgfs does not — there is no "M_{X+Y} = M_X + M_Y." Sums of variables multiply mgfs; you never add mgfs. And the whole sum-as-product shortcut requires independence, as the earlier callout warned. With those two guardrails in mind, the mgf (and, where it fails to exist, the characteristic function waiting in the next guide) turns moment-hunting and variable-summing from grinding integration into clean algebra.