Conditional Probability: Updating on Information

Information changes the question

In the earlier rungs you treated probability as fixed: roll a fair die, and P(it shows a 6) = 1/6, full stop. But real life feeds you partial information, and that information should move your numbers. Suppose a friend rolls the die behind a screen and tells you only "it came up even." The honest 1/6 is now stale. Given that the result is even, the only possibilities left are 2, 4, and 6, and the 6 is one of three equally likely survivors, so the probability is 1/3. Nothing about the die changed — what changed is what you know.

That quantity — the probability of one event once you know another has occurred — is the conditional probability of the first given the second, written P(A given B). Read it aloud as "the probability of A, given that B happened." It is the single most useful idea in all of probability, because almost every interesting question is really conditional: not "is this patient sick?" but "is this patient sick given the test was positive?"; not "will it rain?" but "will it rain given the sky is this grey?"

The definition: shrink the sample space

Here is the picture behind every conditional probability, and it is worth holding in your mind permanently. Your original sample space is the full set of outcomes, each carrying some probability. When you learn that B happened, you do something drastic: you throw away every outcome outside B entirely. The world has literally shrunk to B. This is the idea of conditioning shrinking the sample space — B becomes your new, smaller universe.

But shrinking the universe disturbs the bookkeeping. The surviving outcomes (those in B) carried only P(B) worth of probability among them, not the full 1. To make them a valid probability again — summing to 1 over the new universe — you rescale by dividing by P(B). The piece of A that survives is the overlap "A and B," so the probability of A in the new world is the share of B that is also A. That gives the definition.

P(A given B) = P(A and B) / P(B),    defined only when P(B) > 0

     whole sample space            condition on B (shrink to B)
   +---------------------+        +---------------------+
   |    A                |        |          | B        |
   |   +------+          |        |     +----+----+     |
   |   |  A&B |   B      |  --->  |     |A&B | (rest    |
   |   +------+----+     |        |     +----+  of B)   |
   |          |    |     |        |          | B        |
   +----------+----+-----+        +---------------------+

   P(A given B) = fraction of B that also lies in A

Conditioning on B deletes everything outside B, then rescales by dividing by P(B) so the survivors sum to 1.

Notice the small but vital fine print written into the formula: it requires P(B) > 0. You cannot condition on something that has probability zero, because dividing by zero is meaningless — and intuitively, you cannot rescale a universe that has no probability mass to spread around. (Conditioning on probability-zero events can be made sense of with heavier machinery, but that is far up the ladder; here, always insist P(B) > 0.)

A worked count makes it concrete

Let us nail it with a tiny, fully countable example. A standard deck has 52 cards; 4 of them are kings. Draw one card. P(king) = 4/52 = 1/13. Now a helpful onlooker tells you the card is a face card (jack, queen, or king). There are 12 face cards, so B = "face card" has P(B) = 12/52. The kings that are also face cards — well, all 4 kings are face cards, so "A and B" is just the 4 kings, with P(A and B) = 4/52.

Plug in: P(king given face card) = P(A and B) / P(B) = (4/52) / (12/52) = 4/12 = 1/3. The 52s cancel, leaving the clean count: among the 12 equally likely face cards, 4 are kings. The information "face card" lifted the probability of a king from 1/13 all the way to 1/3, because it discarded the 40 non-face cards that were dragging the odds down. The formula and the shrink-the-sample-space picture are saying the very same thing — in the equally-likely case, P(A given B) is just (number of outcomes in A and B) / (number of outcomes in B).

Reading the definition backwards: the multiplication rule

The definition P(A given B) = P(A and B) / P(B) is most famous read forwards, but its everyday workhorse form comes from multiplying both sides by P(B). That rearrangement gives P(A and B) = P(B) * P(A given B), the multiplication rule. In words: the chance that both A and B happen equals the chance that B happens, times the chance that A happens given B already did. You build a joint event one stage at a time, each stage conditioned on the stages before it.

This staged view chains naturally to more events, which is why it is also called the chain rule: P(A and B and C) = P(A) * P(B given A) * P(C given A and B). Try it on cards. Draw 2 cards without replacement; what is P(both aces)? Stage one: P(first ace) = 4/52. Stage two, conditioned on the first ace being gone: P(second ace given first ace) = 3/51. Multiply: (4/52) * (3/51) = 12/2652 = 1/221. The conditional 3/51 is the multiplication rule quietly accounting for the shrunken deck — exactly the without-replacement counting you met in the previous rung, now wearing its probability clothes.

When information changes nothing: independence

Conditioning usually moves the probability, but sometimes it does not — and that special case has a name. Two events are independent when learning that one happened leaves the other's probability untouched: P(A given B) = P(A). Substitute that into the multiplication rule and the clutter falls away, leaving the famous product form: P(A and B) = P(A) * P(B). This is the formal definition of independent events, and it is symmetric — if B tells you nothing about A, then A tells you nothing about B.

A roll of one die and a roll of another are independent: knowing the first showed a 6 tells you nothing about the second, so P(second is 6 given first is 6) is still 1/6. But beware a misconception this invites — the gambler's fallacy. After five sixes in a row from a fair die, the sixth roll's probability of a six is still exactly 1/6. The die has no memory; independent trials do not "balance out" in the short run to repay a streak. The law that does make long-run frequencies settle near 1/6 is about the average over many rolls, not about any single upcoming roll being "due."

Why this idea anchors the whole rung

Everything in the next four guides grows from the single formula P(A given B) = P(A and B) / P(B). When you split a complicated event by all the ways it could arise and add up the conditional pieces, you get the law of total probability (guide 2). When you flip a conditional from P(B given A) to the answer you actually wanted, P(A given B), you get Bayes' theorem (guide 3). When conditioning changes nothing you get independence (guide 4). And when human intuition fights the arithmetic, you get the famous puzzles and fallacies (guide 5).

Name the two events cleanly: A is what you want the probability of; B is what you have been told happened.
Confirm P(B) > 0 — you can only condition on something that could actually occur.
Find P(A and B), the overlap, and P(B); in equally-likely settings just count outcomes.
Divide: P(A given B) = P(A and B) / P(B). Sanity-check that the answer respects the shrunken universe B.
Ask whether the answer equals P(A). If yes, A and B are independent; if not, B carried real information about A.

Hold on to the central image as you climb: a conditional probability is what is left after you delete the impossible and rescale the rest. Every fancy theorem above is just careful arithmetic on that one move. Master "shrink the sample space, then renormalize," and the rest of this rung is built on solid ground.