From events to numbers
You already have the cast of characters. There is a sample space, the set of everything that could happen, and there are events, which are subsets of it. You know how to combine events with the algebra of events — unions, intersections, complements. What you do not yet have is a way to say how *likely* an event is. That is the missing piece: a rule that takes an event and hands back a single number, a probability.
Think of P as a measuring instrument. Point it at an event A and it reads back P(A), a number on a scale from 0 to 1. Zero means "effectively impossible", one means "effectively certain", and everything in between is a shade of plausibility. The whole question of this guide is: what rules must such an instrument obey to deserve the name *probability*? Remarkably, just three short demands are enough.
Kolmogorov's three axioms
In 1933 Andrey Kolmogorov wrote down the rules that the whole modern subject rests on. They are called the Kolmogorov axioms, and there are three. First, non-negativity: every event gets a probability that is at least zero, so P(A) >= 0. You never read a negative likelihood. Second, normalization: the whole sample space has probability one, P(S) = 1. Something in S is guaranteed to happen, so the total weight is exactly one — no more, no less.
The third axiom is the powerful one: additivity. If two events A and B are mutually exclusive — they cannot both happen, so A and B share no outcomes — then P(A or B) = P(A) + P(B). Disjoint chances simply add. Roll a die: the chance of "a 2 or a 5" is 1/6 + 1/6 = 2/6, because a single roll cannot be both. This is the engine that turns separate pieces of likelihood into a total.
Axiom 1 (non-negativity): P(A) >= 0 for every event A Axiom 2 (normalization): P(S) = 1 Axiom 3 (additivity): A, B disjoint => P(A or B) = P(A) + P(B) (full form, countable additivity:) A_1, A_2, ... pairwise disjoint => P(A_1 or A_2 or ...) = P(A_1) + P(A_2) + ...
When the sample space is infinite — say, the number of phone calls in an hour, which could be 0, 1, 2, and on forever — additivity is sharpened into countable additivity: it must hold not just for two disjoint events but for any countable list of them. This stronger version is what lets probability handle limits and infinite sums cleanly, and it is doing quiet work behind almost every continuous distribution you will meet later. The trio of sample space, events, and P obeying these rules is exactly the probability space the rest of the subject lives inside.
Everything else is a theorem
The beauty of starting from so few rules is that every other familiar fact about probability is now *derived*, not assumed. Take the complement rule: an event A and its complement (A not happening) are mutually exclusive and together fill all of S. So P(A) + P(not A) = P(S) = 1, which rearranges to P(not A) = 1 - P(A). The chance it does not rain is one minus the chance it does — and that is a theorem, squeezed out of axioms 2 and 3.
- P of the impossible event is zero: the empty set is disjoint from S and S or (nothing) = S, so P(empty) = 0. (But beware — the reverse is not guaranteed; we return to this below.)
- Monotonicity: if A is contained in B, then P(A) <= P(B). A bigger event cannot be less likely than a smaller one nested inside it — see monotonicity.
- Every probability is at most one: since A is contained in S, monotonicity gives P(A) <= P(S) = 1. So 0 <= P(A) <= 1 always — that famous range is itself a consequence, not an axiom.
- The general addition rule: for events that may overlap, P(A or B) = P(A) + P(B) - P(A and B). You subtract the shared part so it is not double-counted — the smallest case of the inclusion-exclusion principle.
Notice how the general addition rule contains the third axiom as a special case: when A and B are disjoint, P(A and B) = 0 and the correction vanishes, leaving plain P(A) + P(B). The axioms are the seed; results like these are the tree. Once you trust the three rules, you can prove what you need rather than memorise a long list.
Reading the axioms honestly
A few subtleties are worth getting right early, because they trip people up for years otherwise. First: "impossible" and "probability zero" are not the same idea. Every impossible event has probability zero, but in infinite models the reverse fails — pick a number uniformly from the interval [0, 1] and the chance of landing on exactly 0.5 is zero, yet 0.5 is a perfectly possible outcome. Probability zero means *negligible in the total*, not *forbidden*. The mirror statement holds at the top: probability one means "almost certain", not strictly guaranteed.
Second: additivity is *only* for mutually exclusive events. The seductive mistake is to write P(A or B) = P(A) + P(B) for events that can both happen. Ask for the probability that a card drawn is a heart or a face card and you cannot just add 13/52 + 12/52 — the king, queen, and jack of hearts get counted twice. You must subtract the overlap, P(A and B) = 3/52, giving 22/52. Whenever you reach for a plus sign, check that the pieces truly cannot co-occur.
What the axioms leave open
It is worth being clear about what these three rules deliberately do *not* settle. They are silent on where the numbers come from. In the next guide we will see the classical definition — when outcomes are equally likely, P(A) is just the count of favourable outcomes over the total count — but that is one *model* consistent with the axioms, not a fourth axiom. Equally likely is an assumption you choose to make, true for a fair die, false for a thumbtack landing point-up.
The axioms are also silent on *meaning*. Is P(A) the long-run frequency of A over many repetitions, or a measured degree of belief? The mathematics works identically either way, which is exactly why one framework serves gamblers, physicists, and forecasters alike. That choice of interpretation is the subject of the last guide in this rung. For now, hold the satisfying thought that all of probability — every distribution, every theorem you will meet — is built on a foundation you can state in three lines.