JOVANA
Library Glossary Getting Started Three Levels Fields How it works Mission
Join the mission
All guides

The Law of Total Probability

When you cannot find a probability head-on, split the world into a few clean cases, find the probability inside each, and weave them back together as a weighted average. That weaving is the law of total probability — and it is the engine that powers Bayes' theorem next.

The problem: a probability you cannot see directly

In the previous guide you met conditional probability and the idea that conditioning shrinks the sample space: once you know B happened, you throw away every outcome outside B and rescale what is left. That is a wonderful tool when somebody hands you the condition. But real questions often run the other way. You can easily say how likely something is *inside* each scenario, yet the scenario itself is uncertain, and what you actually want is the plain, unconditional probability across all of them.

Picture a concrete example. A factory makes light bulbs on two machines. Machine A makes 70% of the bulbs and 2% of its bulbs are defective; machine B makes the other 30% and 5% of its bulbs are defective. You pick one bulb from the day's output at random. What is the chance it is defective? You do not know which machine made it — that is precisely what is uncertain. You only know the defect rate *given* the machine. So you cannot read P(defective) off any single line; you have to combine the two stories.

The trick is to imagine the uncertain scenario as the *first* random thing that happens — which machine made the bulb — and the thing you care about as the *second*. We already know how to handle two-stage randomness from the multiplication rule: P(A and B) = P(A) * P(B given A). The law of total probability is just the multiplication rule applied to every branch and then summed.

First you need a clean partition

The whole method rests on cutting the sample space into pieces the right way. A partition is a collection of events B_1, B_2, ..., B_k that are mutually exclusive (no two can happen together) and exhaustive (together they cover everything, so one of them must happen). Think of slicing a pizza: the slices do not overlap, and they leave no crumb of the pizza uncovered. In the factory, "made by A" and "made by B" form a partition of the sample space — every bulb was made by exactly one of the two machines.

Why do both conditions matter? Mutual exclusivity means that when we later add the pieces, nothing gets double-counted — no bulb is charged to both machines. Exhaustiveness means that when we add the pieces, nothing is left out — every bulb is accounted for somewhere. The simplest partition of all is an event and its complement, A and not-A, which is just two slices. Most real partitions are these few-piece ones; you rarely need a fancy one.

The law itself: a weighted average of conditionals

Now the statement. Let B_1, ..., B_k be a partition, and let A be any event you care about. The law of total probability says: P(A) = P(A and B_1) + P(A and B_2) + ... + P(A and B_k). The logic is almost too simple. The B_i tile the whole space, so the event A is chopped into the slivers "A and B_1", "A and B_2", and so on; those slivers do not overlap (because the B_i do not), so their probabilities simply add to give all of A.

But we rarely know P(A and B_i) directly; we know the conditionals. So expand each joint piece with the multiplication rule, P(A and B_i) = P(B_i) * P(A given B_i), and you get the form you will actually use: P(A) = sum over i of P(B_i) * P(A given B_i). Read that out loud and you hear what it means: P(A) is a weighted average of the conditional probabilities P(A given B_i), where each weight is how likely that branch is, P(B_i). The likely branches count more; the rare ones count less; the weights P(B_i) sum to 1 because the partition is exhaustive.

P(A) = P(B_1) * P(A given B_1)
     + P(B_2) * P(A given B_2)
     + ... + P(B_k) * P(A given B_k)

        weights P(B_i) sum to 1
        each P(A given B_i) is the chance of A inside branch i
The working form: weight each branch by how likely it is, P(B_i), then sum.

Working the factory example all the way through

Let A be "the bulb is defective," and partition on the machine. We were given the weights P(A_machine) = 0.70 and P(B_machine) = 0.30, and the conditionals P(defective given A_machine) = 0.02 and P(defective given B_machine) = 0.05. Plug straight into the law. A picture that helps: imagine 1000 bulbs flowing through. About 700 come from machine A and 300 from machine B; of the 700, about 2% = 14 are defective; of the 300, about 5% = 15 are defective. That is 29 defective bulbs out of 1000, i.e. 0.029 — and the formula will give exactly that.

  1. Name the target event A ("defective") and choose the partition (which machine: A_machine, B_machine).
  2. Write each branch weight: P(A_machine) = 0.70, P(B_machine) = 0.30. Check they sum to 1.
  3. Write each conditional: P(A given A_machine) = 0.02, P(A given B_machine) = 0.05.
  4. Multiply within each branch, then add: 0.70 * 0.02 + 0.30 * 0.05 = 0.014 + 0.015 = 0.029.
  5. Sanity-check against the 1000-bulb picture: 14 + 15 = 29 out of 1000 — matches.

Notice where the answer 0.029 sits: between the two defect rates 0.02 and 0.05, and much closer to 0.02 because machine A dominates production. That is the signature of a weighted average, and it is a free error-check. If your total probability ever lands outside the range of the per-branch conditionals — say you compute 0.06 here — you have made an arithmetic slip, because an average can never exceed its largest ingredient or fall below its smallest.

Why this is the doorway to Bayes

The law of total probability runs forward: from causes (which machine) to effects (defective). It answers "given the setup, how likely is the symptom?" The very next guide reverses the arrow. You observe a defective bulb and ask, "how likely is it that machine B made it?" — effect back to cause. That reversal is Bayes' theorem, and it is built directly on top of what you just did: the denominator in Bayes' theorem is exactly the total probability P(A) you computed here.

So the work was not just to get one number. By computing P(A) as a weighted average over a partition, you have already built the harder half of every Bayes calculation. When you reach Bayes you will simply take one of the branch contributions, P(B_i) * P(A given B_i), and divide it by the total P(A) — turning a forward weight into a backward, updated belief. Master the partition-and-sum move now and Bayes will feel like a one-line afterthought.