The Normal Distribution and the Bell Curve

The shape that keeps coming back

So far in this rung you have met models with sharp personalities. The uniform is a flat slab: every value in its range is equally favoured. The exponential is a downhill slope that starts high and decays, the memoryless waiting time of the previous guides. The normal distribution, by contrast, is the one with no edges and no corners — a smooth, symmetric hill that rises to a single peak in the middle and tapers off gently on both sides. People call it the bell curve for the obvious reason: drawn on paper, it looks like the silhouette of a bell.

What makes the normal worth a whole guide of its own is not its prettiness but its stubborn ubiquity. Heights of adult women, errors in a careful measurement, the daily noise in a sensor reading, the total of many small independent nudges — over and over, when you collect such data and draw a histogram, the same hill appears. It is so common that for two centuries it was simply called the error curve, and the German name on the old ten-mark note. We will see later why this is no coincidence; it falls out of a theorem, not luck.

Two knobs: where it sits and how wide it spreads

Every normal distribution is fully described by exactly two numbers, and they have refreshingly direct meanings. The first is the mean mu (the Greek letter mu), which is also the median and the mode here — by symmetry the peak sits right over mu. Slide mu and the whole bell slides left or right along the axis without changing shape at all; mu is the location knob. The second is the standard deviation sigma, which controls width. A small sigma gives a tall, narrow, concentrated spike; a large sigma gives a short, fat, spread-out mound. Crucially, sliding or stretching the bell never breaks its normality: a shifted, rescaled normal is still normal.

Here is the actual formula for the density of a Normal(mu, sigma^2). Do not be alarmed by it — almost no one computes with it by hand. f(x) = (1 / (sigma * sqrt(2*pi))) * e^(-(x - mu)^2 / (2*sigma^2)). You can read its shape straight off the algebra. The term (x - mu)^2 measures how far x is from the centre; the minus sign and the exponential e^(-...) turn 'far from the centre' into 'rapidly vanishing height', which is exactly why the tails drop off. The 2*sigma^2 in the denominator says that the bigger sigma is, the slower that drop-off, hence a wider bell. The clutter out front, 1/(sigma*sqrt(2*pi)), is just the bookkeeping constant that makes the total area under the curve equal to 1.

f(x) = -------------- * e^(-(x - mu)^2 / (2 * sigma^2))
        sigma*sqrt(2pi)

  (x - mu)^2   -> distance from the centre, squared
  e^(-...)     -> turns distance into fast-fading height (the tails)
  2*sigma^2    -> larger sigma => slower fade => wider bell
  1/(sigma*sqrt(2pi)) -> makes the total area equal 1

  X ~ Normal(mu, sigma^2):  E[X] = mu,   Var(X) = sigma^2

The normal density, with each piece labelled by the job it does in shaping the bell.

A density is a height, not a probability

It is tempting to read f(x) as 'the probability that X equals x', but for any continuous variable that is simply wrong, and the normal is no exception. The value f(x) is a HEIGHT of the curve, not a probability — and indeed for a tall, narrow bell f(x) can exceed 1, which no probability ever could. This is the density-is-not-probability point you met earlier in the rung, and it bites hardest here because the bell looks so concrete. For a continuous X, the probability of landing exactly on any single number is zero: P(X = mu) = 0, peak and all. There is no contradiction — the peak is where outcomes pile up densely, not where any one outcome carries weight.

Probability for a continuous variable lives in AREAS, not heights. P(a <= X <= b) is the area under the bell between a and b — the integral of f from a to b. Because a single point has zero width it has zero area, which is why P(X = x) = 0, and it is also why you can be casual about endpoints: P(X < b) and P(X <= b) are equal here. The whole curve encloses area 1, the total probability. To get the probability that X falls within one standard deviation of the mean, you shade the strip from mu - sigma to mu + sigma and measure its area; it comes out near 0.68, the first number of the famous rule the next guide is devoted to.

One bell rules them all: standardising

Because every normal is the same shape just shifted by mu and stretched by sigma, there is a single master bell that all others reduce to: the standard normal, Z ~ Normal(0, 1), centred at 0 with standard deviation 1. The trick that performs the reduction is the z-score. Given X ~ Normal(mu, sigma^2), define Z = (X - mu) / sigma. Subtracting mu re-centres the bell at 0; dividing by sigma squeezes it to width 1. The remarkable fact is that this new Z is exactly standard normal, no matter what mu and sigma you started with. So a z-score answers one clean question: how many standard deviations above or below the mean is this value?

This is why a single table or a single function on your calculator can handle every normal problem in the world. Suppose adult resting heart rate is roughly Normal(mu = 70, sigma^2 = 100), so sigma = 10 beats per minute. Someone with a rate of 90 has z = (90 - 70) / 10 = 2.0 — two standard deviations above average. A rate of 55 has z = (55 - 70) / 10 = -1.5, one and a half below. The z-score has stripped away the units (beats, dollars, centimetres) and left a pure position on the universal bell, from which any probability can be looked up. The next guide turns these z-scores into the 68-95-99.7 rule and reads real probabilities off them.

Write down the model: X ~ Normal(mu, sigma^2), and read off mu (the centre) and sigma (the spread, the square root of the variance).
Standardise the value of interest: Z = (X - mu) / sigma. This converts your value into 'how many sigmas from the mean'.
Look up the area for that z on the standard normal — from a table, a calculator, or the 68-95-99.7 rule for the round cases.
Translate the area back into the probability you actually wanted (above, below, or between), remembering symmetry: the area below -z equals the area above +z.

Why the bell is everywhere: a glimpse of the CLT

Now the deep question: why does this one shape govern heights, errors, and so much else, when nothing about a person's height looks like e^(-(x-mu)^2/...)? The answer is the central limit theorem, the crown jewel waiting at the top of this ladder. Loosely, it says that when you ADD UP many small, independent influences — none of them dominating — their total tends toward a normal distribution, almost regardless of how each individual influence is shaped. Height is the sum of many genetic and nutritional nudges; a measurement error is the sum of many tiny disturbances. Add enough small independent pieces and the bell emerges on its own.

The normal has one more gift that makes it the natural attractor here: it is closed under addition. If X and Y are independent normals, then X + Y is again exactly normal, with the means and the VARIANCES adding: E[X+Y] = E[X] + E[Y] and, by independence, Var(X+Y) = Var(X) + Var(Y). Notice the variances add, not the standard deviations — so two independent Normal(0, 1) variables sum to Normal(0, 2), whose spread is sqrt(2), not 2. This stability is why, once a quantity is built from many normal-ish additive parts, it stays a single clean bell instead of degenerating into some lumpy mess, and it is the thread we will pull all the way to the central limit theorem.

When the bell is the wrong model

Because the normal is so beloved, the real skill is knowing when NOT to use it. Three honest warning signs. First, skew: incomes, house prices, and city populations have a long right tail and a hard floor at zero — a symmetric bell with its left tail running into negative values is a poor fit, and the log-normal often serves better. Second, hard boundaries: a normal puts positive probability on every real number, so it can never be exactly right for a quantity that cannot be negative, like a waiting time or a weight; it can only be a convenient approximation when mu sits many sigmas above zero.

Third, fat tails. The normal's tails fade extraordinarily fast — values beyond four or five sigma are so rare they are practically forbidden. But many real systems, especially in finance and in the sizes of natural disasters, produce extreme events far more often than a bell predicts. Modelling such data as normal can lull you into thinking a five-sigma crash 'cannot happen', when in truth the right model has heavier tails. The cruellest example is the Cauchy distribution: it looks like a bell at a glance, yet its tails are so heavy that it has NO finite mean and NO finite variance, which is exactly why, as the callout above noted, the central limit theorem refuses to apply to it. Looking bell-shaped is not the same as being normal.