Markov's Inequality: A Bound from the Mean

One number, a promise about the tail

You have spent the earlier rungs learning to describe a random variable in full — its whole distribution, its expectation, its variance, its generating functions. But often you do not know the full distribution. You know almost nothing except, say, the average. The remarkable claim of this rung is that even a single summary number already forbids certain behavior. Markov's inequality is the first and simplest such promise: if a quantity cannot be negative, its mean alone caps how often it can be far above that mean.

Here is the statement. Let X be a random variable that is never negative, so X is at least 0 always, and let a be any positive number. Then P(X >= a) <= E[X] / a. In words: the chance that X reaches the threshold a is at most the mean divided by that threshold. The bigger the threshold relative to the mean, the smaller the chance — and you get this guarantee knowing nothing about X except E[X] and the fact that it is nonnegative.

A concrete picture: suppose the average income on a street is 50 (in some unit), and income is never negative. Markov says at most 1/4 of households can earn 200 or more, because E[X]/a = 50/200 = 1/4. It says nothing about who they are or what the distribution looks like — just that no more than a quarter can sit that high. Notice how little input bought how strong a conclusion.

Why it is true: the indicator trick

The proof is short and worth owning, because the same move reappears all through this rung. The key is the indicator random variable: define a quantity that equals 1 when the event {X >= a} happens and 0 otherwise. Call it 1{X >= a}. Its expectation is exactly the probability we want, because E[1{X >= a}] = P(X >= a) — averaging a 0/1 variable just gives the fraction of the time it is 1.

Now the one clever line. Compare X with the scaled indicator a * 1{X >= a}, outcome by outcome. If X happens to be at least a, the indicator is 1, so the right side is a, and X >= a means X is at least that. If X is below a, the indicator is 0, so the right side is 0, and X >= 0 because X is nonnegative. So in every single outcome, X >= a * 1{X >= a}. A pointwise inequality survives taking expectations, so E[X] >= a * E[1{X >= a}] = a * P(X >= a). Divide both sides by the positive number a and you have P(X >= a) <= E[X] / a.

If X >= a:   1{X>=a} = 1,  so a*1{X>=a} = a   and  X >= a   = a*1{X>=a}
If X <  a:   1{X>=a} = 0,  so a*1{X>=a} = 0   and  X >= 0   = a*1{X>=a}

In both cases:    X  >=  a * 1{X >= a}
Take E[.]:     E[X]  >=  a * P(X >= a)
Divide by a:   P(X >= a)  <=  E[X] / a

The whole proof: X dominates the scaled indicator in every outcome, so expectations keep the inequality.

Nonnegativity is the whole engine

Read the proof again and notice exactly where X being nonnegative was used: in the case X < a, we needed the right side 0 to be no bigger than X, which holds only because X >= 0. Drop that assumption and the inequality can fail outright. Imagine a variable that is +1000 with probability 1/2 and -1000 with probability 1/2. Its mean is 0, so the naive Markov bound E[X]/a would say P(X >= 1000) <= 0, yet the true probability is 1/2. The bound is simply false here, because a large negative tail dragged the mean down without limiting the positive tail. Markov needs X >= 0 (or more generally a known lower bound) for a reason.

One more honest caution about reading the statement. Markov bounds the tail event {X >= a}, a genuine probability. Do not confuse this with the value of a density at a point. A density can be large without any contradiction, since a density is not a probability and a single point carries probability zero. Markov is about how much mass sits at or beyond a threshold, not about the height of a curve.

How loose is it? Tight in the worst case, weak in practice

Markov's inequality is often very loose, and it is important to be honest about that. Take X uniform on the integers 1 through 100, so E[X] = 50.5. Markov says P(X >= 80) <= 50.5/80 = 0.63, while the true probability is 21/100 = 0.21. The bound is correct but far above reality. This bluntness is the price of using only one number: a single mean cannot possibly capture the shape of the tail.

And yet, with no extra information, Markov cannot be improved — it is tight. Here is a distribution that meets it exactly: let X equal a with probability E[X]/a, and equal 0 the rest of the time. Then X is nonnegative, its mean works out to a * (E[X]/a) = E[X] as required, and P(X >= a) is exactly E[X]/a. So among all nonnegative variables with that mean, this two-spike distribution is the worst case, and Markov nails it. The lesson: Markov is loose for any particular friendly distribution, but it is the best universal bound when the mean is all you are allowed to assume.

Where Markov leads: the seed of concentration

Markov is loose by itself, but it is the universal first step of a powerful recipe. Whenever you bound the probability of a tail event, you push the event onto some nonnegative quantity and apply Markov to that. Feed it the squared deviation and you get Chebyshev, the topic of the next guide and the engine behind the weak law of large numbers. Feed it the exponential e^(tX) and you get the Chernoff bound, whose tails shrink exponentially. Every theorem in this rung is, underneath, a clever choice of what nonnegative thing to hand to Markov.

Markov also has a direct life of its own as the first-moment method: a workhorse for proving that something is rare. If N counts how many times a bad event occurs and N is a nonnegative integer, then P(N >= 1) <= E[N], simply by taking a = 1 in Markov. So if the expected number of bad events is tiny, the probability that even one happens is tiny. This first-moment method is how, for example, one shows certain random structures almost surely avoid a forbidden pattern, and it is one of the most reused tricks in probabilistic combinatorics.