Order Statistics: The Max, the Min, and the Median

Sorting turns a sample into new random variables

Imagine you draw five independent measurements of how long a bus takes to arrive: 7, 3, 11, 4, 9 minutes. Before you look, each is its own random variable. Now sort them: 3, 4, 7, 9, 11. The sorted list is a sequence of new random variables called the order statistics, written X_(1), X_(2), ..., X_(n), where X_(1) is the smallest (the min), X_(n) is the largest (the max), and the middle one is the median. The key mental shift is that X_(1) is no longer "the first thing you measured"; it is "whichever of the measurements turned out smallest", and that makes it a genuinely new random quantity with its own distribution.

Why care? Order statistics are everywhere the extreme or the typical value matters. The max models the strongest flood in a hundred years or the worst-case latency in a server farm; the min models the first component to fail and so the lifetime of a chain; the median gives a robust sense of the center that one wild outlier cannot drag around. We will assume throughout that the n original draws are independent and identically distributed — an i.i.d. sample from one common distribution with cdf F and density f — because that assumption is what makes the algebra clean. The whole subject is really just the cdf method from earlier in this rung, applied with one good idea.

The max and the min: one clever event each

Start with the maximum, because it has the slickest trick. To find its cdf we ask: what is P(X_(n) <= x)? Here is the good idea — the largest of the n values is at most x if and only if every single value is at most x. That converts a statement about the messy maximum into a statement about all n variables at once, and because they are independent the joint probability factors into a simple product. With a common cdf F, each draw satisfies P(X_i <= x) = F(x), so P(X_(n) <= x) = F(x)^n. That is the entire derivation of the distribution of the maximum.

The minimum yields to the mirror-image trick, but you must flip to the complement first. The smallest value exceeds x if and only if all of them exceed x, so P(X_(1) > x) = (1 - F(x))^n. That gives the survival side directly; subtract from 1 to get the cdf: P(X_(1) <= x) = 1 - (1 - F(x))^n. A tiny number check makes it stick: with n = 3 i.i.d. draws from the uniform distribution on [0, 1], where F(x) = x, the max has cdf x^3 and the min has cdf 1 - (1 - x)^3. At x = 0.5 the max is below 0.5 with probability only 0.125, while the min is below 0.5 with probability 0.875 — exactly the intuition that with three tries the largest tends to sit high and the smallest tends to sit low.

Densities of the extremes, and a worked example

Once you have a cdf you get the density by differentiating, exactly the recover-the-density move from earlier in this rung. Differentiating F(x)^n by the chain rule gives the density of the maximum: f_max(x) = n * F(x)^(n-1) * f(x). Read it as a story: to land the maximum exactly at x, one of the n draws must sit at x (the f(x) factor), and the other n - 1 must all fall below it (the F(x)^(n-1) factor), and there are n choices for which draw is the high one (the leading n). The minimum mirrors it: f_min(x) = n * (1 - F(x))^(n-1) * f(x), where now the other n - 1 must all sit above.

Let us make it concrete with a lifetime problem. Suppose a gadget contains 4 independent components, each with lifetime exponentially distributed so that F(x) = 1 - e^(-x) for x >= 0 (time in years, mean 1 year). If the gadget dies the instant its *first* component dies, its lifetime is the minimum of the four. Then 1 - F(x) = e^(-x), and f_min(x) = 4 * (e^(-x))^3 * e^(-x) = 4 e^(-4x). That is just an exponential with rate 4 — the minimum of n i.i.d. exponentials is again exponential, with rates added. So a 4-component series gadget has mean life 1/4 year, four times shorter than a single component, which matches the gut feeling that a chain is only as strong as its weakest link.

The general k-th order statistic and the median

The min and max were special because every other draw had to fall on one side. For a general X_(k) — say the median, the middle value — the bookkeeping is richer but the picture is the same. To put the k-th smallest exactly at x, you need one draw at x, exactly k - 1 draws below x, and the remaining n - k draws above x. The probabilities of "below", "at", and "above" are F(x), f(x), and 1 - F(x). Counting how many ways to assign the n draws into those three roles is a multinomial choice, giving the density:

Density of the k-th order statistic of an i.i.d. sample (cdf F, density f):

  f_(k)(x) = [ n! / ((k-1)! * 1! * (n-k)!) ] * F(x)^(k-1) * f(x) * (1 - F(x))^(n-k)
             \___________ count ___________/   \below/   \at/   \___above___/

Special cases:
  k = n  (max):  f_(n)(x) = n * F(x)^(n-1) * f(x)
  k = 1  (min):  f_(1)(x) = n * f(x) * (1 - F(x))^(n-1)

One master formula: pick which draw sits at x, how many fall below, how many above.

The sample median is just X_(k) with k chosen at the middle (for odd n, k = (n+1)/2). Plugging into the master formula and integrating x against that density gives the median's expected value — a robust center estimate. Here is an honest caution that trips people up: even when the original draws are symmetric, the median's density is generally not the same shape as a single draw's density, and the *expected* sample median need not equal the population median except in nicely symmetric cases. Order statistics are their own distributions, not copies of the parent. A second honest point: a single sorted value still has zero probability of equaling any exact x — as always for continuous variables, the density is not a probability; only its integral over an interval is.

The range, and why uniform order statistics are the secret skeleton

Two order statistics combine into a famous third quantity: the range, R = X_(n) - X_(1), the spread from smallest to largest. The sample range is the simplest measure of dispersion there is, and finding its distribution needs the *joint* density of the min and the max rather than either alone — the same combine-then-transform spirit as the convolution and Jacobian guides earlier in this rung, where a function of several variables is pushed through to a new one. A quick sanity feel: as n grows, the max drifts up and the min drifts down, so the range tends to widen — more samples means a better chance of catching an extreme on each end.

There is one deep simplification worth carrying away. The order statistics of a uniform sample on [0, 1] are the universal building block, because of the order-statistics property: feed any continuous data through its own cdf F (the probability integral transform you met in the previous guide) and you get uniform values, and crucially *sorting commutes with that monotone transform*. So the order statistics of any continuous distribution are just F-inverse applied to the order statistics of a uniform sample. Uniform order statistics have an exceptionally clean form — X_(k) follows a Beta distribution with parameters k and n - k + 1, with mean simply k / (n + 1).

That little formula is delightfully concrete. With n i.i.d. uniforms on [0, 1], the expected positions of the sorted values are 1/(n+1), 2/(n+1), ..., n/(n+1) — the sorted points lay themselves out evenly with equal expected gaps, leaving a slightly larger margin at each end. For n = 4 the expected sorted values are 0.2, 0.4, 0.6, 0.8: tidy fifths, not the quarters a beginner might guess. That single example captures the spirit of the whole rung — a function of random variables (here, sorting) produces a new, fully describable distribution, and the cdf method is the screwdriver that opens every one of them.