In short
When an outcome can be reached through several different routes, the probability of the outcome is a weighted sum: the total probability theorem says P(A) = \sum_i P(A \mid B_i) P(B_i) over a partition \{B_i\} of the sample space. For multi-stage experiments, a tree diagram lays out every path, and each path has a probability equal to the product of the branches along it. The probability of any event is the sum over all paths that reach it.
Three urns sit on a table. Urn A has 3 red and 2 blue balls. Urn B has 1 red and 4 blue. Urn C has 4 red and 1 blue. You pick an urn uniformly at random — each one with probability \tfrac{1}{3} — and then draw one ball from the chosen urn without looking. What is the probability that the ball is red?
There isn't one simple number to point at. The answer depends on which urn you ended up picking. If you picked A, it is \tfrac{3}{5}. If you picked B, it is \tfrac{1}{5}. If you picked C, it is \tfrac{4}{5}. Those three numbers have to combine somehow to give a single overall probability for "the ball is red."
This is the setting where ordinary conditional probability graduates to something more structured. You are dealing with two layers of uncertainty: which urn (first random choice) and what colour (second random choice given the urn). Each possible first-layer outcome has its own second-layer distribution. To extract a clean probability for an event that sits at the bottom, you need a tool that combines the two layers cleanly.
That tool is the total probability theorem, and its picture-language is the probability tree.
A tree diagram for the urn problem
Draw a tree. The root is the starting state. At the first split, three branches fan out to urns A, B, C — each with probability \tfrac{1}{3}. From each urn, two branches fan out to R (red) and \bar{R} (not red — i.e. blue), with probabilities that depend on the urn.
Each path from the root to a leaf represents one complete outcome of the experiment: "urn A, then red", for instance. The probability of that complete outcome is the product of the branches along the path. The path to "A then R" has probability \tfrac{1}{3} \cdot \tfrac{3}{5} = \tfrac{3}{15}. Similarly for every other path.
The six leaves and their probabilities are:
| Path | Probability |
|---|---|
| A \to R | \tfrac{1}{3} \cdot \tfrac{3}{5} = \tfrac{3}{15} |
| A \to B | \tfrac{1}{3} \cdot \tfrac{2}{5} = \tfrac{2}{15} |
| B \to R | \tfrac{1}{3} \cdot \tfrac{1}{5} = \tfrac{1}{15} |
| B \to B | \tfrac{1}{3} \cdot \tfrac{4}{5} = \tfrac{4}{15} |
| C \to R | \tfrac{1}{3} \cdot \tfrac{4}{5} = \tfrac{4}{15} |
| C \to B | \tfrac{1}{3} \cdot \tfrac{1}{5} = \tfrac{1}{15} |
The total of all six is \tfrac{3 + 2 + 1 + 4 + 4 + 1}{15} = \tfrac{15}{15} = 1. The tree partitions all of probability-space into these six mutually exclusive paths.
To answer the original question — what is the probability of drawing a red ball — you add up the paths that end in R:
So \tfrac{8}{15} \approx 0.533. Just over half. The answer sits between the smallest single-urn probability (\tfrac{1}{5}, from urn B) and the largest (\tfrac{4}{5}, from urn C), weighted toward the average.
The total probability theorem
What you just did has a name. You partitioned the sample space by the urn — every outcome belongs to exactly one of the three urn-categories — and then added up the conditional probabilities for each category, each weighted by the probability of that category.
Total probability theorem
Let B_1, B_2, \ldots, B_n be a partition of the sample space S. That is:
- The events B_i are pairwise disjoint: B_i \cap B_j = \varnothing whenever i \neq j.
- Their union is the whole sample space: B_1 \cup B_2 \cup \cdots \cup B_n = S.
- Each B_i has positive probability: P(B_i) > 0.
Then for any event A,
Reading the theorem. A partition is a way of splitting the sample space into non-overlapping "branches." For any event A whose probability you want, find the conditional probability of A given each branch, multiply by the probability of that branch, and add up the products. The result is the total probability of A.
Why is this true? Using the definition of conditional probability, P(A \mid B_i) P(B_i) = P(A \cap B_i). So the right-hand side of the theorem is
Because the B_i partition the sample space, the events A \cap B_1, A \cap B_2, \ldots, A \cap B_n are pairwise disjoint and their union is exactly A. By the additivity of probability, the sum equals P(A). Done.
In the urn problem, the partition was \{A, B, C\} — the three urns. The event was R (red ball). The theorem gave
Same answer, in equation form. The tree is the picture; the theorem is the algebra.
Two worked examples
Example 1: a factory with three machines
A factory has three machines producing the same component. Machine M_1 produces 50\% of the total output and has a defect rate of 2\%. Machine M_2 produces 30\% with a defect rate of 3\%. Machine M_3 produces 20\% with a defect rate of 5\%. A component is selected at random from the day's production. What is the probability that it is defective?
Step 1. Identify the partition. The components can be split by which machine produced them: M_1, M_2, M_3. Every component belongs to exactly one of these, so they form a partition.
Why: before you apply the total probability theorem, you must have a partition. Here the "cause" is the machine, and each component has exactly one source.
Step 2. List the partition probabilities and the conditional probabilities.
| B_i | P(B_i) | P(\text{defective} \mid B_i) |
|---|---|---|
| M_1 | 0.50 | 0.02 |
| M_2 | 0.30 | 0.03 |
| M_3 | 0.20 | 0.05 |
Step 3. Apply the total probability theorem.
Why: each product is the fraction of components that are both from that machine and defective; adding them across all machines gives the overall defect rate.
Step 4. Interpret. About 2.9\% of the factory's overall output is defective. Notice that this is not the simple average (2 + 3 + 5)/3 = 3.33\% — it is a weighted average, with the machines' shares of the output doing the weighting. Since M_1 has both the largest share and the lowest defect rate, it pulls the overall rate below the simple average.
Result: The probability that a random component is defective is P(D) = 0.029, or 2.9\%.
Example 2: two bags and a two-stage draw
You have two bags. Bag 1 contains 4 red and 2 blue balls. Bag 2 contains 2 red and 6 blue balls. You pick a bag uniformly at random, draw a ball, note its colour, and without returning it, draw a second ball from the same bag. What is the probability that the second ball is red?
Step 1. Notice that this is a three-stage experiment: pick a bag, draw the first ball, draw the second ball. The "which bag" and "first draw" are both relevant to the probability of the second draw.
Step 2. Partition the sample space by the pair (bag, first-ball colour). The four cases are:
- Bag 1, first ball red
- Bag 1, first ball blue
- Bag 2, first ball red
- Bag 2, first ball blue
These four cases are disjoint and cover everything.
Why: a clean partition needs to capture every piece of information that affects the probability of the target event. Both the bag and the first colour affect the composition of the bag for the second draw.
Step 3. Compute the probability of each branch and the conditional probability of "second ball red" given each branch.
Probability of picking bag 1 is \tfrac{1}{2}. Given bag 1, probability of first ball red is \tfrac{4}{6} = \tfrac{2}{3}, probability of first ball blue is \tfrac{2}{6} = \tfrac{1}{3}. So
For bag 2: probability of first red is \tfrac{2}{8} = \tfrac{1}{4}, probability of first blue is \tfrac{6}{8} = \tfrac{3}{4}. So
Now compute the conditional probability of "second ball red" for each branch. After drawing from bag 1 without replacement:
- After bag 1 first red, bag 1 has 3 red and 2 blue, so P(\text{2nd red}) = \tfrac{3}{5}.
- After bag 1 first blue, bag 1 has 4 red and 1 blue, so P(\text{2nd red}) = \tfrac{4}{5}.
For bag 2:
- After bag 2 first red, bag 2 has 1 red and 6 blue, so P(\text{2nd red}) = \tfrac{1}{7}.
- After bag 2 first blue, bag 2 has 2 red and 5 blue, so P(\text{2nd red}) = \tfrac{2}{7}.
Why: without replacement, the second draw's distribution depends on what was taken out on the first draw — that is why you must condition on the first colour, not just the bag.
Step 4. Combine.
Work each term:
- \tfrac{1}{3} \cdot \tfrac{3}{5} = \tfrac{1}{5}
- \tfrac{1}{6} \cdot \tfrac{4}{5} = \tfrac{4}{30} = \tfrac{2}{15}
- \tfrac{1}{8} \cdot \tfrac{1}{7} = \tfrac{1}{56}
- \tfrac{3}{8} \cdot \tfrac{2}{7} = \tfrac{6}{56} = \tfrac{3}{28}
Find a common denominator. \tfrac{1}{5} + \tfrac{2}{15} = \tfrac{3}{15} + \tfrac{2}{15} = \tfrac{5}{15} = \tfrac{1}{3}. And \tfrac{1}{56} + \tfrac{3}{28} = \tfrac{1}{56} + \tfrac{6}{56} = \tfrac{7}{56} = \tfrac{1}{8}.
So
Result: The probability that the second ball is red is \tfrac{11}{24}.
Notice one thing about the answer: \tfrac{11}{24} is less than \tfrac{1}{2}, and it is exactly what you would get for P(\text{first ball red}) too — because the two marginal draws from the same randomly-chosen bag have the same distribution. The first ball and the second ball, unconditionally, are equally likely to be red. This is a subtle consistency check: the two draws are not independent, but they are exchangeable, and exchangeable variables have the same marginal distribution.
Tactics for complex tree problems
Complex conditional-probability problems all reduce to the same workflow. Here are the moves that reliably work.
Draw the tree before you compute anything. The tree forces you to be explicit about which events are first-stage, which are second-stage, and how the branches split. Trying to compute probabilities in your head without a tree is where students lose track.
Each complete path has probability equal to the product of its branches. This is the multiplication rule, P(A \cap B) = P(A) P(B \mid A), applied stage by stage.
The probabilities of all leaves sum to 1. Always. If they don't, you made a computational error somewhere. Sum-to-one is your most important sanity check.
The probability of any event is the sum of the leaf probabilities for all paths that reach it. This is the total probability theorem, phrased in tree language.
When an event can happen in two consecutive ways that aren't mutually exclusive, use a different partition or inclusion-exclusion. For example, "at least one red in two draws" is a single event, but it happens along three different root-to-leaf paths — (R, R), (R, B), (B, R) — and you must add all three, not just pick one. Missing paths is the other way students lose answers.
Check symmetries when they exist. If the problem is symmetric under swapping R and B, or under swapping two identical bags, the answer should respect that symmetry. If it doesn't, look for an error.
Common confusions
-
"P(A \mid B) equals P(B \mid A)." Almost never. P(A \mid B) = \frac{P(A \cap B)}{P(B)} and P(B \mid A) = \frac{P(A \cap B)}{P(A)}. The two denominators are different, so the fractions are generally different. Swapping them is the most common error in Bayes-flavoured problems. (The relationship between the two is Bayes' theorem, covered in the next article.)
-
"The partition can overlap as long as it covers the sample space." No. The whole point of a partition is that every outcome belongs to exactly one branch. If two branches share an outcome, you will double-count when you add up leaf probabilities.
-
"You can ignore the first stage once you know the second stage's conditional probabilities." Only if the stages are independent. When the first stage affects the second (like drawing without replacement), the first-stage distribution determines which second-stage distribution applies, and you must condition on the first stage even when computing a second-stage probability.
-
"Adding the two second-draw-red conditional probabilities gives you P(\text{2nd red})." No — you need to weight them by how likely each first-draw outcome was. Unweighted sums of conditional probabilities are almost never the right answer.
-
"If the bags are identical, the choice doesn't matter." True only when the bags are literally identical. In the factory example, the three machines produce different fractions of the output and have different defect rates — so the choice very much matters, and you cannot just pick one machine's rate.
Going deeper
If you can draw trees and apply the total probability theorem, you have the working toolkit. The rest is for readers who want to see how total probability connects to Bayes' theorem, how it generalises to continuous settings, and what to do when the partition has to be refined during the problem.
From total probability to Bayes
The total probability theorem tells you how to compute P(A) from a partition. Bayes' theorem goes the other way: given that A occurred, it tells you which branch B_i is most likely to have caused it.
The denominator is exactly the total probability of A — the answer you just computed. The numerator is the probability of the specific path B_i \to A. So Bayes' theorem is saying: "given that A happened, the probability it came from branch i is the weight of path i divided by the total weight of all paths that reach A."
For the factory problem, given that a component is defective, which machine most likely produced it?
Roughly a 31\% chance the defective component came from M_2, even though M_2 only produces 30\% of the output — the conditional probability is nudged up slightly because M_2's defect rate is above the overall average. You will see Bayes' theorem in its own article; everything you need to state it is in the total probability theorem you already know.
Continuous partitions
The partition in the total probability theorem does not have to be finite, and it does not even have to be countable. If Y is a continuous random variable and A is some event whose probability depends on Y, then
where f_Y is the density of Y and the integral is over the range of Y. This is the continuous analogue of the discrete partition: instead of summing over n branches, you integrate over a continuum of "branches." Conceptually, nothing changes — it is the same theorem with a richer partition.
Refining a partition
Sometimes the partition you start with is too coarse, and you need to refine it to compute the answer. In the two-bag example, a first attempt might partition by "which bag" alone and ask "given a bag, what is the probability the second ball is red?" — but without knowing the first draw, the second draw is not just a simple fraction of reds in the bag. You have to split each bag-branch further by what happened on the first draw, giving a four-cell partition.
This is a general tactic: when the conditional probability on a branch is itself a compound probability, refine the branch into sub-branches until the conditional probability on each leaf is elementary. The sum P(A) = \sum P(A \mid B_i) P(B_i) still applies — with the refined partition the sum has more terms, but each term is easier to compute.
Where this leads next
You now have the machinery to handle multi-stage probability problems with confidence. The next set of ideas takes this machinery in two directions: the reverse direction (Bayes' theorem — inference from effect to cause) and the expectation direction (computing average values of random variables using a partition).
- Bayes Theorem — how to update beliefs about which branch you are on, given an observation at the leaf. The total probability theorem is the denominator of every Bayes calculation.
- Random Variables - Discrete — how to package the outcomes of a random experiment as a number, and compute its distribution.
- Expectation and Variance - Discrete — computing the average value of a random variable, often using a partition argument that mirrors the total probability theorem.
- Introduction to Inference — using conditional probability to reason backwards from data to hypotheses about the world.