From Probability to Statistics

The arrow turned around

Everything you have done so far in this ladder points in one direction: you fix a model — a coin with bias p, a Poisson with rate lambda, a normal with mean mu and variance sigma^2 — and then you compute what the data will look like. That is probability: model in, data out. Statistics is the same machinery run in reverse. The data already happened, and the model is the unknown. You hold a fixed sample of numbers and ask: what value of p, or lambda, or mu, made this sample plausible? The whole subject is the art of running the probability arrow backward.

The bridge between the two worlds is one quiet shift of viewpoint. In probability you wrote a density p(x given theta) and read it as a function of x, the data, with the parameter theta held fixed. In statistics you read the very same expression as a function of theta, with x held fixed at the values you observed. Read forward it is a density; read backward it is a likelihood. Same formula, opposite question. That single re-reading is the doorway from this rung of probability into all of statistics.

An estimator is a random variable

Here is the idea that quietly unifies the whole subject, and it is pure probability. An estimator is a rule that turns a sample into a number — the sample mean, the sample variance, the largest observation. Before you collect the data, the sample X(1), ..., X(n) is a list of random variables, so any function of them is itself a random variable. The sample mean Xbar = (X(1) + ... + X(n)) / n is not a fixed number; it has its own distribution, called the sampling distribution. Statistics is, in large part, the study of those sampling distributions — because they tell you how much your single answer would have jiggled had the world dealt you a different sample.

Two probability theorems from earlier rungs are doing the heavy lifting. The law of large numbers promises that as n grows, Xbar converges to the true mean mu — so a sensible estimator homes in on the truth as data accumulates. The central limit theorem then describes the leftover wobble: for large n, Xbar is approximately Normal(mu, sigma^2 / n). That sigma^2 / n is the engine room of statistics. The standard deviation of the estimator, sigma / sqrt(n), is its standard error — and the sqrt(n) is why halving your uncertainty costs four times the data, not twice.

Maximum likelihood: let the data vote

If the likelihood ranks parameter values by how well they explain the data, the obvious move is to pick the winner: the value of theta that makes the observed data most probable. That is maximum likelihood estimation, the workhorse of classical statistics. Because the data points are usually independent, the likelihood is a product of per-point densities, and products are awkward to maximize. The standard trick is to take the logarithm — which turns the product into a sum and, since log is increasing, does not move the location of the maximum — and then set the derivative of the log-likelihood to zero. That single derivative is where your calculus from the foundations finally pays a dividend.

Write the likelihood L(theta) = product of p(x(i) given theta) over the observed points x(1), ..., x(n).
Take logs to get the log-likelihood l(theta) = sum of log p(x(i) given theta); products become sums.
Differentiate l(theta) with respect to theta and set it to zero: this gives the likelihood equation.
Solve for theta and check it is a maximum (the second derivative is negative there), not a minimum or saddle.

Work it once and it sticks. Flip a coin n times and see k heads. The likelihood is L(p) = p^k * (1-p)^(n-k); the log-likelihood is l(p) = k log p + (n-k) log(1-p); its derivative is k/p - (n-k)/(1-p), and setting that to zero gives p = k/n. The maximum likelihood estimate of a coin's bias is simply the fraction of heads you saw — exactly the answer your gut would have shouted, now derived rather than guessed. The same recipe on the Poisson hands you lambda = sample mean, and on the normal it hands you mu = sample mean, the satisfying confirmation that the machinery agrees with common sense on the easy cases before you trust it on the hard ones.

Sufficiency: when a summary loses nothing

Notice something striking about the coin example: the answer depended only on k, the total number of heads — not on the order in which they fell, not on which flips were heads. You could throw away the entire sequence, keep only the count, and lose nothing about p. A statistic with that property is called a sufficient statistic: once you know its value, the original data carries no further information about the parameter. Formally, the conditional distribution of the full sample given the sufficient statistic does not depend on theta at all — every drop of theta-relevant information has been squeezed into that one number.

There is a clean test for sufficiency, the factorization criterion: T is sufficient for theta exactly when the likelihood splits as L(theta) = g(T(x), theta) * h(x), where h does not involve theta. In words, theta touches the data only through T. This is not mere tidiness; it is data compression with a guarantee. A million coin flips collapse to a single count, a sample from a normal collapses to just the sum and the sum of squares — and the guarantee is that no estimator built from the raw data can beat one built from the sufficient summary. Sufficiency tells you the smallest honest description of your data for the question at hand.

One honest caveat: sufficiency is always relative to a model. The count k is sufficient for the bias of a coin only if you have already committed to the model 'independent flips with a fixed p'. If you suspect the coin's bias drifts over time, the order suddenly matters again and k is no longer sufficient. A sufficient statistic compresses everything the model says is relevant — and nothing about whether the model is right in the first place.

The bootstrap: a thousand worlds from one dataset

Maximum likelihood hands you a single number; the sampling distribution tells you how much it would wobble — but computing that wobble usually needs a formula, and for a fancy estimator (the median, a ratio, a trimmed mean) the formula may be hopeless. The bootstrap is the gloriously simple workaround, and it leans on exactly the Monte Carlo thinking from guides 2 and 3 of this rung. The trick: you cannot resample from the true population, because you do not have it — but your sample is your best picture of that population. So treat the sample as if it were the population, and draw new samples from it.

Start with your one real sample of size n. From it, draw n points at random with replacement — a bootstrap sample (some originals appear twice, some not at all).
Compute your estimator (mean, median, whatever you care about) on that bootstrap sample. Record the number.
Repeat the resample-and-compute thousands of times. You now have thousands of estimator values.
The spread of those values approximates the sampling distribution: their standard deviation is the standard error, and their 2.5th and 97.5th percentiles give a 95% confidence interval.

It feels like cheating — pulling extra information out of thin air — but it is not. Resampling with replacement does not invent new facts; it reveals how unstable your estimator already is by replaying the one piece of randomness you can see, namely which points happened to land in your sample. Be honest about the limits, though. The bootstrap is only as good as the assumption that your sample resembles the population, so it struggles with tiny samples and with quantities that depend on rare extremes — the maximum of a distribution, for instance, can never exceed the largest value you actually observed, so the bootstrap systematically understates the tail. It is a brilliant default, not a universal solvent.

Two roads up the same mountain

You now hold both halves of statistics. The frequentist road, traveled in this guide, fixes theta as an unknown constant and treats the data as random: maximum likelihood, standard errors, the bootstrap, and the confidence interval, whose honest reading is subtle — '95% of intervals built this way would cover the true theta', not 'there is a 95% chance theta lies in this one interval'. The Bayesian road, from guide 1, treats theta itself as a random variable with a prior, updates it with the likelihood, and reports a credible interval, which you genuinely may read as 'theta lies here with probability 0.95'. The two answers often nearly coincide with lots of data, and diverge most when data is scarce and the prior speaks loudly.

Step back and admire how little new machinery this took. Estimators are random variables; their behavior is governed by the law of large numbers and the central limit theorem; likelihood is just a density read backward; sufficiency is a conditional-distribution statement; the bootstrap is Monte Carlo applied to your own sample. Every tool in this guide is a probability idea you already met, merely pointed at the inverse question. That is the real lesson of the bridge: statistics is not a separate subject bolted onto probability — it is probability, run in reverse, with the courage to admit that the model is the thing we do not know.