From mean and variance to a whole family
By now you can summarize a random variable with two numbers: the expectation E[X], which marks where the distribution balances, and the variance Var(X), which measures how far it typically spreads from that balance point. Those two are genuinely useful, but think about what they cannot see. Two distributions can share the very same mean and the very same variance and still look completely different — one symmetric, one lopsided; one with thin tails, one prone to wild outliers. Center and spread are only the first two facts about a shape. To capture the rest, we need more numbers of the same kind.
The unifying idea is the moment. The k-th moment of X is simply the expected value of the k-th power, E[X^k]. The first moment, E[X^1], is just the mean. Higher moments E[X^2], E[X^3], E[X^4], and so on, each weigh the values of X more and more heavily the farther they are from zero, because raising to a larger power amplifies large numbers dramatically. Each moment is computed exactly the way you already know — through the law of the unconscious statistician, summing or integrating x^k against the distribution. So moments are not a new tool; they are the same expectation machine, fed the functions x, x^2, x^3, and onward.
Skewness: which way does the distribution lean?
The third central moment, E[(X - mu)^3], answers a question the mean and variance cannot: is the distribution symmetric, or does it lean to one side? The cube is the key. When X is above the mean, (X - mu)^3 is positive; when below, it is negative; and the cube punishes distance harshly. So if a distribution has a long thin tail stretching to the right, the rare large positive deviations get cubed into big positive contributions that outweigh the many small negative ones, and the third central moment comes out positive. A long left tail makes it negative. A perfectly symmetric distribution makes it exactly zero, because positive and negative cubes cancel.
To make this comparable across distributions of different scales, we divide by the cube of the standard deviation, giving the dimensionless skewness: skewness = E[(X - mu)^3] / sigma^3. Dividing by sigma^3 strips out the units, so a salary distribution in dollars and a height distribution in centimetres can be compared on the same scale. Positive skew means a tail to the right (think incomes: most people clustered low, a few enormous earners pulling the tail out). Negative skew means a tail to the left. Zero skew is the symmetric case, of which the normal distribution is the famous example.
Kurtosis: how heavy are the tails?
The fourth central moment, E[(X - mu)^4], scaled to a dimensionless number by dividing by sigma^4, gives kurtosis. Because the power is even, sign no longer matters — left and right deviations both contribute positively — so kurtosis is blind to lean and instead measures something else: how much of the distribution's behavior comes from rare, extreme deviations. The fourth power weights far-out values so heavily that kurtosis is essentially a tail-weight detector. High kurtosis means heavy tails: most of the time things are calm, but extreme events are more common than a normal curve would suggest.
The normal distribution is the natural yardstick: its kurtosis is exactly 3. Because that baseline is so handy, people often quote excess kurtosis, which subtracts 3, so that a normal has excess kurtosis 0. A distribution with positive excess kurtosis has tails heavier than a normal's — it is more prone to outliers. Financial returns are a classic real example: they look roughly bell-shaped most days but produce crashes far more often than a normal would, so they have positive excess kurtosis. This is exactly the kind of risk a quick glance at mean and variance hides.
Here is the honest catch that ties this rung together. Every moment past the first is an expectation of a high power of X, and that expectation only exists if the relevant integral or sum converges. For heavy-tailed distributions some moments are simply infinite or undefined. The Cauchy distribution is the cautionary tale: it has no mean at all (so no variance, no skewness, no kurtosis), because its tails are so fat that even E[X] fails to converge. The Pareto distribution with a small exponent can have a finite mean but infinite variance. So moments are powerful, but they are not guaranteed to exist — a distribution is not obliged to hand you the numbers you want.
The moment generating function: one machine, all the moments
Computing each moment separately by integrating x^k is tedious. The moment generating function, or mgf, packages all moments of X into a single function of a helper variable t. It is defined as the expectation of e^(tX): M(t) = E[e^(tX)]. At first this looks like an odd thing to compute, but the magic is in the exponential's Taylor series. Since e^(tX) = 1 + tX + (tX)^2/2! + (tX)^3/3! + ..., taking expectations term by term gives M(t) = 1 + t E[X] + t^2 E[X^2]/2! + t^3 E[X^3]/3! + ... — every moment is sitting inside, tagged by its own power of t.
That structure is why it is called a *generating* function: the moments are encoded as the coefficients of its Taylor expansion. To pull out the k-th moment, you differentiate M(t) k times and set t = 0. Each derivative knocks down one power of t and evaluating at zero kills every other term, leaving exactly E[X^k]. So M'(0) = E[X], M''(0) = E[X^2], and onward. One function, differentiated repeatedly, hands you the entire ladder of moments — no separate integral for each.
M(t) = E[e^(tX)] (definition)
= 1 + t E[X] + t^2/2! E[X^2] + ... (Taylor series)
M'(0) = E[X] (1st moment = mean)
M''(0) = E[X^2] (2nd raw moment)
Var(X) = M''(0) - (M'(0))^2 = E[X^2] - (E[X])^2The mgf earns its keep beyond bookkeeping. It is a near-fingerprint of a distribution: by the uniqueness theorem, if two random variables have the same mgf on an interval around zero, they have the same distribution. That turns hard probability questions into algebra — for example, the mgf of a sum of independent variables is just the product of their mgfs, which is how you can prove a sum of independent normals is again normal, almost without effort.
When the mgf fails, and the limits of moments
Be honest about a real limitation: the mgf does not always exist. E[e^(tX)] requires that expectation to be finite for t in some interval around zero, and for a heavy-tailed distribution it can be infinite for every t > 0. The lognormal distribution is a famous example — it has all its moments finite, yet its mgf does not exist on any interval, so you cannot recover it from moments alone. The Cauchy, with no moments to speak of, also has no mgf. When the mgf is missing, the machine simply will not start.
Probabilists fix this with a sturdier cousin, the characteristic function, E[e^(itX)], where i is the imaginary unit. The crucial difference is that |e^(itX)| = 1 always, so the expectation is always finite no matter how heavy the tails. That is why the characteristic function exists for every distribution — Cauchy and lognormal included — while the mgf is a sometimes-available convenience. The characteristic function carries the same fingerprint and sum-into-product magic; it is simply the version that never breaks. You will meet it properly when we use transforms to prove the central limit theorem.
Step back and see what this rung has given you. Expectation set the center; variance set the spread; skewness and kurtosis read the lean and the tails; and the generating functions bundle the whole family into one object you can differentiate or multiply. But hold one more caution. A finite list of moments rarely pins a distribution down completely, and the mean in particular can mislead whenever the data are skewed or heavy-tailed — which is exactly when the mode or the median may describe a 'typical' value far better. Moments are a powerful language for shape, but they are a description, not the whole truth — and knowing when they fall short is itself part of mastering them.