The question: what is the distribution of X + Y?
You have met two ways to push a random variable through a function: the cdf method, where you write down P(g(X) <= z) and differentiate, and the change-of-variables formula with its Jacobian. Both handled a function of ONE variable, or a clean one-to-one map. Now we attack the single most common compound quantity in all of probability: a sum, Z = X + Y, built from two pieces. How many people walk through two doors combined? How long do two queued jobs take together? What is the total of two dice, two measurement errors, two days of rainfall? The answer 'X + Y' is itself random, and we want its whole distribution, not just its average.
One warning before we start: knowing the average of the sum is easy, knowing its whole shape is the hard part. Linearity of expectation gives E[X + Y] = E[X] + E[Y] for FREE, no independence needed. And if X and Y are independent the variance of the sum adds too, Var(X + Y) = Var(X) + Var(Y). But the mean and variance are just two numbers; they do not tell you whether the sum is bell-shaped, skewed, or spiky. To get the full density of Z we need a genuinely new tool — and that tool is convolution.
The discrete picture: count every way to make the total
Start where the idea is impossible to misread: dice. Roll two fair dice and ask for the total Z = X + Y. There is exactly one way to get 2 (a 1 and a 1) but five ways to get 6 (1+5, 2+4, 3+3, 4+2, 5+1). That little count — sum over all the splits z = x + y — is the whole idea. Because the dice are independent, the chance of each split (x, then y = z - x) factors: P(X = x and Y = z - x) = P(X = x) times P(Y = z - x). The independence is what lets the joint probability split into a product; without it we would need the full joint pmf instead.
Add up over every split that reaches z and you have the pmf of the sum, the discrete convolution: P(Z = z) = sum over x of P(X = x) times P(Y = z - x). Read it as a slogan: to land the total on z, x can be anything, as long as the partner y picks up the slack z - x. As x slides up, z - x slides down in lockstep, sweeping out every pairing that hits the target. For the dice, P(Z = 6) = sum of P(X = x) P(Y = 6 - x) over x = 1..5, which is 5 times (1/6)(1/6) = 5/36 — exactly the five ways we counted. This is the convolution of distributions in its plainest form.
The continuous picture: slide, flip, and overlap
For continuous variables the sum becomes an integral instead of a sum, but the spirit is identical. If X and Y are independent with densities f and g, the density of Z = X + Y is the convolution integral: h(z) = integral over all x of f(x) g(z - x) dx. Same slogan as the dice: x ranges over everything, the partner soaks up z - x, and you accumulate all the ways to reach the total — except now 'count' becomes 'integrate density'. You can derive this directly from the cdf method: P(Z <= z) is the probability mass over the region x + y <= z in the plane, and differentiating that double integral in z drops you onto exactly this single integral.
The name 'slide and flip' earns itself in the formula. Look at g(z - x) as a function of x: the minus sign FLIPS g left-to-right, and the z SLIDES the flipped copy along. At each position z you multiply the two curves point-by-point and measure the overlapping area — that area is the new density's height at z. So convolution physically blurs one shape with another: where both put a lot of mass, the overlap is large and the sum is likely; where they barely meet, the sum is rare. This is why summing tends to SMOOTH and SPREAD: two boxy uniforms convolve into a triangle, and the triangle is gentler than either box.
discrete: P(Z=z) = sum_x P(X=x) * P(Y = z - x)
continuous: h(z) = integral f(x) * g(z - x) dx
slide ----> by z flip <---- the minus sign
meaning: 'x can be anything; the partner must equal z - x'
works ONLY when X, Y are independent
example (two dice): P(Z=6) = 5 * (1/6)(1/6) = 5/36When the family is closed: sums that keep their shape
Convolution usually changes the shape — but for a few special families the shape is preserved, and these are the workhorses of probability. Add two independent normals and you get another normal: the sum of independent normals is normal, with the means and variances simply adding, X ~ Normal(mu1, sigma1^2) plus Y ~ Normal(mu2, sigma2^2) gives Normal(mu1 + mu2, sigma1^2 + sigma2^2). Add two independent Poissons and you get a Poisson with rate lambda1 + lambda2 — total arrivals from two sources are themselves Poisson. These families are called 'closed under convolution', and that closure is precisely why they show up so relentlessly when we add things up.
There is a beautiful one to feel directly: independent exponentials. A single exponential wait has a sharp peak at zero, decaying away. Convolve two of the same rate and the peak moves OFF zero into a hump — that hump is a gamma (an Erlang). Stacking n waiting times gives the n-stage gamma, which is exactly why the time until the n-th event in a Poisson process is gamma-shaped. You can SEE convolution at work: the more independent waits you add, the more the sharp exponential corner rounds into a smooth, increasingly symmetric mound. That rounding-toward-symmetry is a preview of the deepest fact in the subject.
The shortcut, and why sums build the bell curve
Convolution integrals get ugly fast: adding three or four variables means nesting integral inside integral, and most students never want to do that twice. There is a famous escape hatch. Transforms turn convolution into multiplication. The moment generating function of a sum of independent variables is just the PRODUCT of the individual mgfs: M_Z(t) = M_X(t) M_Y(t). So instead of grinding an integral you multiply two functions, then recognise the answer. The normal and Poisson 'closure' facts above fall out in one line this way — multiply the mgfs and the product is again the mgf of the same family.
Now the payoff. Every time you add an independent variable you convolve once more, and we saw that convolution smooths and rounds toward symmetry. Pile up enough independent pieces and the sum's shape converges to one universal bell — this is the central limit theorem, the reason normal curves are everywhere measurements add up. Be honest about the fine print, though: the CLT needs each piece to have FINITE variance. The Cauchy distribution breaks it — its convolutions never settle toward a bell, and the average of n Cauchy variables is no better than a single one. So convolution is the machinery, finite variance is the licence, and the bell curve is the reward when both hold.