Expectation: The Long-Run Average

What number would you bet on?

You have spent the earlier rungs learning what a random variable is — a rule that attaches a number to each outcome of a chance experiment — and how its probabilities are spread out across the possible values. A whole distribution is a lot to hold in your head, though. Often you want a single number that summarizes "where it sits": one figure you could plan around, bet on, or report. The most important such summary is the expectation, written E[X], and it answers a precise question — if you ran this random variable again and again and averaged the results, what number would that average settle on?

Start with the most familiar average. If a class scores 70, 80, 80, and 90 on a test, the average is (70 + 80 + 80 + 90) / 4 = 80. Notice 80 appeared twice and so counts twice — already this is a kind of weighting. Now imagine you do not have a finished list of four scores but instead a random variable X that takes the value 70 with probability 1/4, 80 with probability 1/2, and 90 with probability 1/4. The natural average is to weight each value by how often it shows up: 70 times 1/4 plus 80 times 1/2 plus 90 times 1/4, which is again 80. That weighting-by-probability is the whole idea.

The weighted-average definition

For a discrete random variable, the expectation is the sum of each value times its probability: E[X] = sum over x of x times P(X = x). Every value pulls on the average in proportion to its probability mass — a value the variable almost never takes barely tugs at all, while a likely value pulls hard. A vivid mechanical picture makes this stick: imagine the number line as a thin ruler, and place a lump of weight P(X = x) at each value x. The expectation E[X] is exactly the balance point, the spot where the ruler would sit level on a fingertip. Probability is mass; expectation is the centre of mass.

What if X is continuous, with no separate lumps of mass but a smooth density f(x) instead? The sum becomes an integral, with the density playing the role the probabilities played before: E[X] = integral of x times f(x) dx over all x. The picture is identical — a continuous bar of clay whose thickness at each point is f(x), balanced at its centre of mass. Be careful with the honest fine print here: a density f(x) is not a probability, and at any single point the probability is exactly zero. The density only becomes probability once you integrate it over an interval, so we weight by f(x) dx, never by f(x) alone.

Discrete:    E[X] = sum_x  x * P(X = x)
Continuous:  E[X] = integral  x * f(x) dx

Example (discrete), the test scores:
  E[X] = 70*(1/4) + 80*(1/2) + 90*(1/4)
       = 17.5 + 40 + 22.5
       = 80

Example (continuous), X ~ Uniform(0, 10):
  f(x) = 1/10 for 0 <= x <= 10
  E[X] = integral_0^10  x * (1/10) dx
       = (1/10) * [x^2 / 2]_0^10
       = (1/10) * 50 = 5     (the midpoint, as the balance picture predicts)

Discrete sums and continuous integrals are the same weighted average — value times probability mass.

Expectation is not the most likely value

Here is the most common beginner trap, worth meeting head-on. The expectation is an average, not a typical outcome, and it need not be a value the variable can even take. Roll a fair six-sided die: E[X] = (1 + 2 + 3 + 4 + 5 + 6) / 6 = 3.5. You will never roll a 3.5 — yet it is the correct long-run average per roll. The expectation is the centre of gravity, and a balance point can sit between the weights, in empty space. Saying "I expect a 3.5" is loose English; the precise meaning is purely about the long-run average.

The expectation can also mislead in a different way when the distribution is lopsided. The single most probable value has its own name — the mode — and the value that splits the probability in half is the median; neither has to equal E[X]. With a long tail on one side, a few extreme outcomes drag the mean far from the bulk of the data. A village where everyone earns a modest wage but one resident is a billionaire has a sky-high average income that describes nobody. This gap between mean and typical is exactly the caution flagged under when the mean misleads — the mean is one honest summary, but it is not the only one, and it is the wrong one for skewed, heavy-tailed quantities.

Why "long-run average" is more than a slogan

Calling E[X] the long-run average is not just suggestive language — there is a theorem that makes it literally true. Suppose you draw independent copies X_1, X_2, ..., X_n from the same distribution and form their sample average (X_1 + ... + X_n) / n. The law of large numbers says this sample average closes in on E[X] as n grows: roll a die ten thousand times and the running average will hover near 3.5. This is what justifies the whole interpretation. The expectation is precisely the value that repeated experience converges to.

Two honest cautions keep this from being misunderstood. First, the law of large numbers is about the average, not about sums "evening out." After a streak of low rolls, the average drifts back toward 3.5 because later rolls dilute the early ones, not because the dice owe you anything — the total surplus of high over low can actually keep growing. Believing the dice will "correct" themselves is the gambler's fallacy: independent trials have no memory, and a fair coin that just landed heads five times is still exactly 50/50 on the next toss. Second, the theorem needs the expectation to exist in the first place; for the Cauchy, the sample average never settles down, no matter how large n gets.

Where this rung is heading

Expectation is the foundation stone for everything in this rung, so it helps to see the road ahead. So far you can only take the expectation of X itself, but you will constantly want the expectation of some function of X — like X^2, or a payoff g(X). The next guide gives the clean shortcut for that, the law of the unconscious statistician, which lets you compute E[g(X)] by reweighting g(x) with the same probabilities, no new distribution needed.

After that comes the single most useful property of expectation: linearity, the rule E[aX + bY] = a E[X] + b E[Y]. Its quiet superpower is that it holds whether or not X and Y are independent — you may add expectations even when the variables are tangled together, which makes otherwise brutal calculations melt away. Then we measure spread, not just centre, with the variance Var(X) = E[X^2] - (E[X])^2 and its square root the standard deviation; and we close the rung with the higher moments, the moment generating function, skewness, and the shape of a distribution. Master the balance-point idea here and the rest of the rung becomes a series of natural next steps.