Gamma, Beta, and the Distribution Family Tree

From a zoo to a family tree

By now this rung has handed you a small menagerie of continuous shapes: the flat uniform, the falling exponential, and the beautiful symmetric normal bell curve. It is tempting to treat each as a separate fact to memorize — its own formula, its own mean, its own picture. That is the wrong mental model, and it makes everything harder than it needs to be. The truth is far kinder: these are relatives, branches of one family tree, and almost every continuous distribution you will meet is some other one with an operation applied to it.

There are only a handful of operations that grow the tree, and you already know all of them from earlier rungs. Add independent copies of a distribution and you climb to a new branch. Square a variable, or sum several squares, and you land on another. Take a ratio of two relatives and you get a third. Push a parameter to a limit and one shape melts into another. Once you see the operations, the names stop being arbitrary and the formulas stop being things to dread. This guide is the map that ties the whole zoo together.

The gamma: adding up exponentials

Start from the exponential, the waiting time for one event in a memoryless stream — say, the time until the next customer walks in. Now ask a slightly bigger question: how long until the *third* customer arrives? You are waiting for one exponential gap, then a second, then a third, and the total time is the sum of three independent exponentials. That sum is no longer exponential; it has a new shape, and it is called the gamma distribution. The gamma has two knobs: a shape parameter (here, the number of events you are waiting for, often written k or alpha) and a rate lambda inherited from the exponentials.

When that shape parameter is a whole number, the gamma has its own honest nickname: the Erlang distribution, which is exactly "the time of the k-th arrival in a Poisson process." The shape is intuitive. With shape 1 the gamma *is* the exponential — one wait, the spike-at-zero-then-falling curve you already know. As the shape grows, you are averaging more and more independent waits, so the curve pulls away from zero, develops a hump, and grows more symmetric — a first quiet hint of the central limit theorem at work, since you are summing independent pieces. The mean is simply the sum of the parts: E[X] = k / lambda, exactly k times the mean 1/lambda of a single exponential.

Be careful with the word "add," though. Summing independent gammas stays inside the gamma family only when they share the same rate lambda: a Gamma(k1, lambda) plus an independent Gamma(k2, lambda) is a Gamma(k1 + k2, lambda). With different rates the sum is something messier that has no tidy name. This is the same lesson from the discrete world — independent Poissons add, but only the counts — and named families are closed under addition only under specific conditions, never automatically.

The chi-squared: squaring normals

Now take a different operation — squaring — and apply it to the family's most important member, the normal. Take a standard normal variable Z (mean 0, variance 1) and square it. The result Z^2 cannot be negative, and small values near zero are most likely while large squares are rare; that lopsided, non-negative shape is the chi-squared distribution with 1 degree of freedom. Add up the squares of d independent standard normals and you get the chi-squared distribution with d degrees of freedom — the d is just how many squared normals you summed.

Here is where the tree reveals its secret wiring: the chi-squared is not a new species at all. The sum of d squared standard normals turns out to be exactly a gamma distribution — specifically a Gamma with shape d/2 and rate 1/2. So squaring-and-summing normals and adding-up-exponentials, two operations that sound utterly different, land on branches of the very same gamma tree. That is not a coincidence to memorize; it is why the chi-squared inherits the gamma's mean so cleanly: E[X] = d, one unit of mean per degree of freedom. The chi-squared is the engine behind variance estimation and the goodness-of-fit tests you will meet in statistics later.

Two roads to the SAME gamma tree:

   Exp(lambda) + Exp(lambda) + ... (k terms)   =  Gamma(shape = k,   rate = lambda)
   Z1^2 + Z2^2 + ... + Zd^2  (Zi standard normal) =  Gamma(shape = d/2, rate = 1/2)
                                                  =  Chi-squared(d)

   Means:  E[Gamma(k, lambda)] = k / lambda
           E[Chi-squared(d)]   = d

Adding exponentials and summing squared normals are two branches of one gamma family.

The beta: a ratio that lives on [0, 1]

Every distribution so far has spread over an unbounded range. But many quantities are honestly trapped between 0 and 1 — a proportion, a probability, a fraction of voters, a batting average. For those you want a flexible shape that lives only on the interval [0, 1], and that is the beta distribution. The beta has two shape parameters, usually called alpha and beta, and tuning them lets the curve be flat, bell-shaped, U-shaped, or piled up at either end. The flat uniform on (0,1) is just the special case alpha = beta = 1 — so the uniform you met first in this rung is secretly the simplest beta of all.

Where does the beta come from on the tree? From a ratio. Take two independent gamma variables that share a rate, X and Y, and form X / (X + Y). The shared rate cancels in the fraction, the scale washes out, and what survives is a number squeezed into [0, 1] whose distribution is exactly a beta. So the beta is the gamma family's way of asking "what *fraction* of the total does this part account for?" — which is why it is the natural home for proportions. Its mean is delightfully simple: E[X] = alpha / (alpha + beta), just the obvious share of the two shape parameters.

The beta has one more starring role you can feel intuitively. If you treat an unknown probability p as itself uncertain and give it a beta prior, then watch coin flips, the updated belief is again a beta — you just add your successes to alpha and your failures to beta. That tidy "add the data to the parameters" update is why the beta is the natural partner, the conjugate prior, for binomial data in Bayesian inference. A beta starts as your honest guess about a proportion and tightens, flip by flip, toward the truth.

The normal at the center, and the limits that lead to it

If the gamma is the trunk for waiting times and squares, the normal is the gravitational center of the whole forest, and the reason is the central limit theorem from the previous guides. Average enough independent contributions of comparable size and the sum, suitably rescaled, drifts toward the normal regardless of where the pieces came from. That is why the normal keeps appearing as a *limit* of other distributions: a binomial with many trials, a Poisson with a large mean, and a chi-squared with many degrees of freedom all flatten into bell curves. The normal is less one more animal in the zoo than the shape the whole zoo bends toward when you add things up.

The same trunk grows two distributions you will lean on constantly in statistics, both built by ratios involving normals and chi-squareds. Divide a standard normal by the square root of an independent (scaled) chi-squared and you get Student's t-distribution — a bell curve with heavier tails that accounts for the extra uncertainty of estimating a standard deviation from a small sample; as the degrees of freedom grow, its tails thin and it slides back toward the normal. Take a ratio of two independent chi-squareds, each divided by its degrees of freedom, and you get the F-distribution, the workhorse for comparing two variances. Same handful of operations — squares, sums, ratios — generating the entire statistical toolkit.

Reading the tree: a worked walkthrough

The payoff of the family tree is that you can decode a new distribution by its operations instead of opening a formula sheet. Suppose a server processes jobs in a memoryless stream and you ask: what is the total time to finish the first 4 jobs, and how does the *fraction* of that time spent on the first 2 behave? Each gap is exponential; their sum is a gamma; and a fraction of the total is a beta. Let us walk it through.

Identify the atom. Each inter-job gap is independent and memoryless, so each is Exponential(lambda). This is the building block, the shape-1 gamma.
Apply the operation: add. The time for the first 4 jobs is a sum of 4 independent exponentials of the same rate, which is Gamma(shape = 4, rate = lambda) — equivalently the Erlang of order 4.
Read off the mean for free. Since means add, E[total] = 4 / lambda — no integral, just four exponential means stacked.
Form the fraction. Split the total as X (first 2 jobs, a Gamma with shape 2) plus Y (last 2 jobs, a Gamma with shape 2). The fraction X / (X + Y) is a Beta(alpha = 2, beta = 2), with mean 2 / (2 + 2) = 1/2 — symmetric, exactly as fairness suggests.

Notice what just happened: three named distributions, three means, all read straight off the operations — sum, sum, ratio — without integrating a single density. One honest reminder to carry forward: a density is not a probability, and a single exact value (the total being exactly 4/lambda, say) has probability zero, just as it did for every continuous variable in this rung. The family tree tells you the *shape* and the *summary numbers*; to turn it into an actual probability you still integrate the density over an interval. That is the whole continuous rung in one move — and the launchpad for the expectation and joint-distribution rungs ahead.