Z-Scores and the 68-95-99.7 Rule

One bell to rule them all

In the previous guide you met the normal distribution X ~ Normal(mu, sigma^2): a whole family of bell curves, one for every choice of centre mu and spread sigma. That sounds like infinitely many shapes to learn, but here is the liberating fact — they are all the same shape, merely shifted along the axis and stretched or squeezed. A heights-in-centimetres bell and a test-scores bell differ only in where their peak sits and how wide they are; tilt your head and rescale, and one becomes the other exactly. So instead of studying a thousand bells, we study just one and learn how to translate everything else into its language.

That one reference bell is the standard normal, written Z ~ Normal(0, 1): centred at mu = 0 with standard deviation sigma = 1. Its job is to be the universal yardstick. The recipe that turns any normal X into Z is short and is worth carving into memory: subtract the mean, then divide by the standard deviation. The number you get out is called the z-score, and it answers a single clean question — how many standard deviations is this value above (positive) or below (negative) the mean?

Written as a formula the recipe is Z = (X - mu) / sigma, and a few worked numbers fix it in place. Take a test where scores are X ~ Normal(70, 8), so mu = 70 and sigma = 8. A score of 86 gives z = (86 - 70)/8 = +2.0, meaning two standard deviations above average. A score of 62 gives z = (62 - 70)/8 = -1.0, one standard deviation below. A score of exactly 70 gives z = 0, dead centre. Notice that the same z = +2.0 would describe a height two sigma above the mean or a salary two sigma above the mean — that is the whole point of a common ruler.

Why standardising leaves the shape intact

Why is it safe to subtract and divide like this without breaking anything? Because of a rule you already proved a couple of rungs back about shifting and scaling a random variable. If you add a constant to X, the mean moves by that constant and the spread is untouched; if you multiply X by a constant, both the mean and the standard deviation scale by it, so the variance scales by its square. Writing it cleanly: for the transform a + bX you get E[a + bX] = a + b E[X] and Var(a + bX) = b^2 Var(X). The z-score is exactly this transform with a = -mu/sigma and b = 1/sigma — see the scale-and-shift rule from the random-variables rung if it has gone fuzzy.

Run those rules on Z = (X - mu)/sigma and the magic falls out by plain arithmetic. The new mean is E[Z] = (E[X] - mu)/sigma = (mu - mu)/sigma = 0. The new variance is Var(Z) = Var(X)/sigma^2 = sigma^2/sigma^2 = 1, so its standard deviation is 1. And because shifting and scaling a normal variable yields another normal variable (a special, not generic, property of this family), Z is genuinely Normal(0, 1). Standardising therefore relocates the peak to zero and resets the ruler to one — without distorting the curve, exactly as tilting and rescaling a photo keeps every face in proportion.

The 68-95-99.7 rule

Now that every normal speaks the same standardised language, one set of probabilities serves them all. The empirical rule — also called the three-sigma rule — says that for any normal distribution, about 68% of the probability lies within one standard deviation of the mean, about 95% within two, and about 99.7% within three. In z-score terms that is simply: P(-1 < Z < 1) is roughly 0.68, P(-2 < Z < 2) is roughly 0.95, and P(-3 < Z < 3) is roughly 0.997. These are not three separate facts to memorise per distribution; they are three facts about the single standard bell that you import into every problem.

Picture it on the curve. The bell is symmetric about zero, so the 68% inside one sigma leaves 32% in the two tails combined — 16% in each. The 95% inside two sigma leaves 5%, so 2.5% in each tail; this is the source of the famous "95% interval" used everywhere in statistics. Inside three sigma sits 99.7%, so only 0.3% of the probability — three parts in a thousand — falls beyond plus-or-minus three standard deviations, split as 0.15% per tail. A value out past z = 3 is genuinely rare for a normal variable, which is what makes the rule a quick lie-detector for unusual data.

interval        z range        prob inside     prob in EACH tail
  mu +/- 1 sigma   -1 < Z < 1      ~ 0.68          ~ 0.16
  mu +/- 2 sigma   -2 < Z < 2      ~ 0.95          ~ 0.025
  mu +/- 3 sigma   -3 < Z < 3      ~ 0.997         ~ 0.0015

(symmetry: prob in each tail = (1 - prob inside) / 2)

The empirical rule, with the tail split that symmetry forces.

Putting it to work: comparing and locating values

The z-score's first superpower is comparison across different scales. Suppose Mei scored 86 on a maths test that was Normal(70, 8) and Lin scored 84 on a history test that was Normal(75, 5). Who did better relative to their class? Compute z-scores: Mei is z = (86 - 70)/8 = +2.0, Lin is z = (84 - 75)/5 = +1.8. Mei sits two standard deviations above her class average, Lin sits 1.8 above hers — so Mei did slightly better in relative terms, even though the raw scores are close and on different tests. The z-score is what lets apples and oranges be ranked on one common ruler.

State the model and the question: with X ~ Normal(70, 8), what fraction of students score above 86?
Standardise the cutoff: z = (86 - 70)/8 = +2.0, so "X above 86" is the same event as "Z above 2".
Use the rule: inside +/- 2 sigma is ~95%, leaving ~5% in the two tails, so each tail is ~2.5%.
Read off the answer: P(Z > 2) is about 0.025, so roughly 2.5% of students score above 86.

Run the same machine in reverse to find a value from a probability — for instance the cutoff for the top 2.5%, or the percentiles of a distribution. You first locate the z that leaves the desired tail probability (here z = 2 for the top 2.5%), then un-standardise by reversing the recipe: X = mu + z sigma = 70 + 2 * 8 = 86. This forward-and-back motion is exactly the quantile function at work: the cumulative distribution function turns a value into a left-tail probability, and the quantile function turns a probability back into a value.

What the rule is not, and where it breaks

The empirical rule is a fact about the normal distribution, not about data in general. If your data is skewed, heavy-tailed, or bimodal, the 68-95-99.7 percentages can be wildly off — there is nothing forcing real measurements to be bell-shaped. A blunter, weaker bound called Chebyshev's inequality (a tail bound you met among the inequalities) does hold for any distribution with finite variance: at least 75% of the probability lies within two sigma and at least 89% within three. Notice how much weaker those guarantees are than 95% and 99.7%; the extra tightness of the empirical rule is the reward for the strong assumption of normality, and it evaporates the moment that assumption fails.

Two more traps deserve a word. The z-score by itself does not tell you a probability unless the variable really is normal — for a t-distribution or other heavy-tailed shape, a z of 3 is far less rare than 0.15% per tail, because more probability lives out in the tails. And remember from the density discussion: a z-score is a location, not a probability, and the height of the bell at that location is a density, not a probability. Probabilities for a continuous variable are areas under the curve between two points, which is exactly why every question above became "how much area lies beyond this z?" rather than "what is the curve's height here?".