Conditional Probability - Advanced

In short

When an outcome can be reached through several different routes, the probability of the outcome is a weighted sum: the total probability theorem says P(A) = \sum_i P(A \mid B_i) P(B_i) over a partition \{B_i\} of the sample space. For multi-stage experiments, a tree diagram lays out every path, and each path has a probability equal to the product of the branches along it. The probability of any event is the sum over all paths that reach it.

Three urns sit on a table. Urn A has 3 red and 2 blue balls. Urn B has 1 red and 4 blue. Urn C has 4 red and 1 blue. You pick an urn uniformly at random — each one with probability \tfrac{1}{3} — and then draw one ball from the chosen urn without looking. What is the probability that the ball is red?

There isn't one simple number to point at. The answer depends on which urn you ended up picking. If you picked A, it is \tfrac{3}{5}. If you picked B, it is \tfrac{1}{5}. If you picked C, it is \tfrac{4}{5}. Those three numbers have to combine somehow to give a single overall probability for "the ball is red."

This is the setting where ordinary conditional probability graduates to something more structured. You are dealing with two layers of uncertainty: which urn (first random choice) and what colour (second random choice given the urn). Each possible first-layer outcome has its own second-layer distribution. To extract a clean probability for an event that sits at the bottom, you need a tool that combines the two layers cleanly.

That tool is the total probability theorem, and its picture-language is the probability tree.

A tree diagram for the urn problem

Draw a tree. The root is the starting state. At the first split, three branches fan out to urns A, B, C — each with probability \tfrac{1}{3}. From each urn, two branches fan out to R (red) and \bar{R} (not red — i.e. blue), with probabilities that depend on the urn.

The tree diagram for the urn problem. Each path from root to leaf has a probability equal to the product of the branches along it. Adding the probabilities of the three "$R$" leaves gives the total probability of drawing a red ball.

Each path from the root to a leaf represents one complete outcome of the experiment: "urn A, then red", for instance. The probability of that complete outcome is the product of the branches along the path. The path to "A then R" has probability \tfrac{1}{3} \cdot \tfrac{3}{5} = \tfrac{3}{15}. Similarly for every other path.

The six leaves and their probabilities are:

Path	Probability
A \to R	\tfrac{1}{3} \cdot \tfrac{3}{5} = \tfrac{3}{15}
A \to B	\tfrac{1}{3} \cdot \tfrac{2}{5} = \tfrac{2}{15}
B \to R	\tfrac{1}{3} \cdot \tfrac{1}{5} = \tfrac{1}{15}
B \to B	\tfrac{1}{3} \cdot \tfrac{4}{5} = \tfrac{4}{15}
C \to R	\tfrac{1}{3} \cdot \tfrac{4}{5} = \tfrac{4}{15}
C \to B	\tfrac{1}{3} \cdot \tfrac{1}{5} = \tfrac{1}{15}

The total of all six is \tfrac{3 + 2 + 1 + 4 + 4 + 1}{15} = \tfrac{15}{15} = 1. The tree partitions all of probability-space into these six mutually exclusive paths.

To answer the original question — what is the probability of drawing a red ball — you add up the paths that end in R:

P(R) = \tfrac{3}{15} + \tfrac{1}{15} + \tfrac{4}{15} = \tfrac{8}{15}.

So \tfrac{8}{15} \approx 0.533. Just over half. The answer sits between the smallest single-urn probability (\tfrac{1}{5}, from urn B) and the largest (\tfrac{4}{5}, from urn C), weighted toward the average.

The sample space split by which urn is chosen. The three columns have equal width because each urn is chosen with probability $\tfrac{1}{3}$. The shaded regions at the top of each column are the "red" outcomes in that urn — three-fifths of column $A$, one-fifth of column $B$, and four-fifths of column $C$. The total shaded area is $\tfrac{8}{15}$, the overall probability of red.

The total probability theorem

What you just did has a name. You partitioned the sample space by the urn — every outcome belongs to exactly one of the three urn-categories — and then added up the conditional probabilities for each category, each weighted by the probability of that category.

Total probability theorem

Let B_1, B_2, \ldots, B_n be a partition of the sample space S. That is:

The events B_i are pairwise disjoint: B_i \cap B_j = \varnothing whenever i \neq j.
Their union is the whole sample space: B_1 \cup B_2 \cup \cdots \cup B_n = S.
Each B_i has positive probability: P(B_i) > 0.

Then for any event A,

P(A) \;=\; \sum_{i=1}^{n} P(A \mid B_i) P(B_i).

Reading the theorem. A partition is a way of splitting the sample space into non-overlapping "branches." For any event A whose probability you want, find the conditional probability of A given each branch, multiply by the probability of that branch, and add up the products. The result is the total probability of A.

Why is this true? Using the definition of conditional probability, P(A \mid B_i) P(B_i) = P(A \cap B_i). So the right-hand side of the theorem is

\sum_{i=1}^{n} P(A \cap B_i).

Because the B_i partition the sample space, the events A \cap B_1, A \cap B_2, \ldots, A \cap B_n are pairwise disjoint and their union is exactly A. By the additivity of probability, the sum equals P(A). Done.

In the urn problem, the partition was \{A, B, C\} — the three urns. The event was R (red ball). The theorem gave

P(R) = P(R \mid A) P(A) + P(R \mid B) P(B) + P(R \mid C) P(C) = \tfrac{3}{5} \cdot \tfrac{1}{3} + \tfrac{1}{5} \cdot \tfrac{1}{3} + \tfrac{4}{5} \cdot \tfrac{1}{3} = \tfrac{8}{15}.

Same answer, in equation form. The tree is the picture; the theorem is the algebra.

Two worked examples

Example 1: a factory with three machines

A factory has three machines producing the same component. Machine M_1 produces 50\% of the total output and has a defect rate of 2\%. Machine M_2 produces 30\% with a defect rate of 3\%. Machine M_3 produces 20\% with a defect rate of 5\%. A component is selected at random from the day's production. What is the probability that it is defective?

Step 1. Identify the partition. The components can be split by which machine produced them: M_1, M_2, M_3. Every component belongs to exactly one of these, so they form a partition.

Why: before you apply the total probability theorem, you must have a partition. Here the "cause" is the machine, and each component has exactly one source.

Step 2. List the partition probabilities and the conditional probabilities.

B_i	P(B_i)	P(\text{defective} \mid B_i)
M_1	0.50	0.02
M_2	0.30	0.03
M_3	0.20	0.05

Step 3. Apply the total probability theorem.

P(D) = (0.02)(0.50) + (0.03)(0.30) + (0.05)(0.20)

= 0.010 + 0.009 + 0.010 = 0.029.

Why: each product is the fraction of components that are both from that machine and defective; adding them across all machines gives the overall defect rate.

Step 4. Interpret. About 2.9\% of the factory's overall output is defective. Notice that this is not the simple average (2 + 3 + 5)/3 = 3.33\% — it is a weighted average, with the machines' shares of the output doing the weighting. Since M_1 has both the largest share and the lowest defect rate, it pulls the overall rate below the simple average.

Result: The probability that a random component is defective is P(D) = 0.029, or 2.9\%.

The factory tree. Adding the three defective-leaf probabilities $0.010 + 0.009 + 0.010$ gives the overall defect rate $0.029$.

Example 2: two bags and a two-stage draw

You have two bags. Bag 1 contains 4 red and 2 blue balls. Bag 2 contains 2 red and 6 blue balls. You pick a bag uniformly at random, draw a ball, note its colour, and without returning it, draw a second ball from the same bag. What is the probability that the second ball is red?

Step 1. Notice that this is a three-stage experiment: pick a bag, draw the first ball, draw the second ball. The "which bag" and "first draw" are both relevant to the probability of the second draw.

Step 2. Partition the sample space by the pair (bag, first-ball colour). The four cases are:

Bag 1, first ball red
Bag 1, first ball blue
Bag 2, first ball red
Bag 2, first ball blue

These four cases are disjoint and cover everything.

Why: a clean partition needs to capture every piece of information that affects the probability of the target event. Both the bag and the first colour affect the composition of the bag for the second draw.

Step 3. Compute the probability of each branch and the conditional probability of "second ball red" given each branch.

Probability of picking bag 1 is \tfrac{1}{2}. Given bag 1, probability of first ball red is \tfrac{4}{6} = \tfrac{2}{3}, probability of first ball blue is \tfrac{2}{6} = \tfrac{1}{3}. So

P(\text{bag 1, first red}) = \tfrac{1}{2} \cdot \tfrac{2}{3} = \tfrac{1}{3},

P(\text{bag 1, first blue}) = \tfrac{1}{2} \cdot \tfrac{1}{3} = \tfrac{1}{6}.

For bag 2: probability of first red is \tfrac{2}{8} = \tfrac{1}{4}, probability of first blue is \tfrac{6}{8} = \tfrac{3}{4}. So

P(\text{bag 2, first red}) = \tfrac{1}{2} \cdot \tfrac{1}{4} = \tfrac{1}{8},

P(\text{bag 2, first blue}) = \tfrac{1}{2} \cdot \tfrac{3}{4} = \tfrac{3}{8}.

Now compute the conditional probability of "second ball red" for each branch. After drawing from bag 1 without replacement:

After bag 1 first red, bag 1 has 3 red and 2 blue, so P(\text{2nd red}) = \tfrac{3}{5}.
After bag 1 first blue, bag 1 has 4 red and 1 blue, so P(\text{2nd red}) = \tfrac{4}{5}.

For bag 2:

After bag 2 first red, bag 2 has 1 red and 6 blue, so P(\text{2nd red}) = \tfrac{1}{7}.
After bag 2 first blue, bag 2 has 2 red and 5 blue, so P(\text{2nd red}) = \tfrac{2}{7}.

Why: without replacement, the second draw's distribution depends on what was taken out on the first draw — that is why you must condition on the first colour, not just the bag.

Step 4. Combine.

P(\text{2nd red}) = \tfrac{1}{3} \cdot \tfrac{3}{5} + \tfrac{1}{6} \cdot \tfrac{4}{5} + \tfrac{1}{8} \cdot \tfrac{1}{7} + \tfrac{3}{8} \cdot \tfrac{2}{7}.

Work each term:

\tfrac{1}{3} \cdot \tfrac{3}{5} = \tfrac{1}{5}
\tfrac{1}{6} \cdot \tfrac{4}{5} = \tfrac{4}{30} = \tfrac{2}{15}
\tfrac{1}{8} \cdot \tfrac{1}{7} = \tfrac{1}{56}
\tfrac{3}{8} \cdot \tfrac{2}{7} = \tfrac{6}{56} = \tfrac{3}{28}

Find a common denominator. \tfrac{1}{5} + \tfrac{2}{15} = \tfrac{3}{15} + \tfrac{2}{15} = \tfrac{5}{15} = \tfrac{1}{3}. And \tfrac{1}{56} + \tfrac{3}{28} = \tfrac{1}{56} + \tfrac{6}{56} = \tfrac{7}{56} = \tfrac{1}{8}.

P(\text{2nd red}) = \tfrac{1}{3} + \tfrac{1}{8} = \tfrac{8}{24} + \tfrac{3}{24} = \tfrac{11}{24} \approx 0.458.

Result: The probability that the second ball is red is \tfrac{11}{24}.

The two-stage tree. The four "second ball red" leaves combine to $\tfrac{11}{24}$ — the probability that the second ball drawn is red, averaged over all possible bags and first-draw outcomes.

Notice one thing about the answer: \tfrac{11}{24} is less than \tfrac{1}{2}, and it is exactly what you would get for P(\text{first ball red}) too — because the two marginal draws from the same randomly-chosen bag have the same distribution. The first ball and the second ball, unconditionally, are equally likely to be red. This is a subtle consistency check: the two draws are not independent, but they are exchangeable, and exchangeable variables have the same marginal distribution.

The three machines each contribute roughly equal amounts to the overall defect rate — despite producing very different fractions of the total output, and having very different defect rates individually. This is the balance that the total probability theorem captures: share of output times defect rate, summed across all machines.

Tactics for complex tree problems

Complex conditional-probability problems all reduce to the same workflow. Here are the moves that reliably work.

Draw the tree before you compute anything. The tree forces you to be explicit about which events are first-stage, which are second-stage, and how the branches split. Trying to compute probabilities in your head without a tree is where students lose track.

Each complete path has probability equal to the product of its branches. This is the multiplication rule, P(A \cap B) = P(A) P(B \mid A), applied stage by stage.

The probabilities of all leaves sum to 1. Always. If they don't, you made a computational error somewhere. Sum-to-one is your most important sanity check.

The probability of any event is the sum of the leaf probabilities for all paths that reach it. This is the total probability theorem, phrased in tree language.

When an event can happen in two consecutive ways that aren't mutually exclusive, use a different partition or inclusion-exclusion. For example, "at least one red in two draws" is a single event, but it happens along three different root-to-leaf paths — (R, R), (R, B), (B, R) — and you must add all three, not just pick one. Missing paths is the other way students lose answers.

Check symmetries when they exist. If the problem is symmetric under swapping R and B, or under swapping two identical bags, the answer should respect that symmetry. If it doesn't, look for an error.

Left: the prior probabilities of the three machines, weighted by their share of total output. Right: the posterior probabilities given that the observed component is defective. The posterior shifts probability toward the machines with higher defect rates — $M_3$ gains most heavily because its defect rate of $5\%$ is the largest, even though its output share is smallest.

Common confusions

"P(A \mid B) equals P(B \mid A)." Almost never. P(A \mid B) = \frac{P(A \cap B)}{P(B)} and P(B \mid A) = \frac{P(A \cap B)}{P(A)}. The two denominators are different, so the fractions are generally different. Swapping them is the most common error in Bayes-flavoured problems. (The relationship between the two is Bayes' theorem, covered in the next article.)
"The partition can overlap as long as it covers the sample space." No. The whole point of a partition is that every outcome belongs to exactly one branch. If two branches share an outcome, you will double-count when you add up leaf probabilities.
"You can ignore the first stage once you know the second stage's conditional probabilities." Only if the stages are independent. When the first stage affects the second (like drawing without replacement), the first-stage distribution determines which second-stage distribution applies, and you must condition on the first stage even when computing a second-stage probability.
"Adding the two second-draw-red conditional probabilities gives you P(\text{2nd red})." No — you need to weight them by how likely each first-draw outcome was. Unweighted sums of conditional probabilities are almost never the right answer.
"If the bags are identical, the choice doesn't matter." True only when the bags are literally identical. In the factory example, the three machines produce different fractions of the output and have different defect rates — so the choice very much matters, and you cannot just pick one machine's rate.

Going deeper

If you can draw trees and apply the total probability theorem, you have the working toolkit. The rest is for readers who want to see how total probability connects to Bayes' theorem, how it generalises to continuous settings, and what to do when the partition has to be refined during the problem.

From total probability to Bayes

The total probability theorem tells you how to compute P(A) from a partition. Bayes' theorem goes the other way: given that A occurred, it tells you which branch B_i is most likely to have caused it.

P(B_i \mid A) \;=\; \frac{P(A \mid B_i) P(B_i)}{\sum_{j} P(A \mid B_j) P(B_j)}.

The denominator is exactly the total probability of A — the answer you just computed. The numerator is the probability of the specific path B_i \to A. So Bayes' theorem is saying: "given that A happened, the probability it came from branch i is the weight of path i divided by the total weight of all paths that reach A."

For the factory problem, given that a component is defective, which machine most likely produced it?

P(M_2 \mid D) = \frac{(0.03)(0.30)}{0.029} = \frac{0.009}{0.029} \approx 0.310.

Roughly a 31\% chance the defective component came from M_2, even though M_2 only produces 30\% of the output — the conditional probability is nudged up slightly because M_2's defect rate is above the overall average. You will see Bayes' theorem in its own article; everything you need to state it is in the total probability theorem you already know.

Continuous partitions

The partition in the total probability theorem does not have to be finite, and it does not even have to be countable. If Y is a continuous random variable and A is some event whose probability depends on Y, then

P(A) \;=\; \int P(A \mid Y = y) f_Y(y)\, dy,

where f_Y is the density of Y and the integral is over the range of Y. This is the continuous analogue of the discrete partition: instead of summing over n branches, you integrate over a continuum of "branches." Conceptually, nothing changes — it is the same theorem with a richer partition.

Refining a partition

Sometimes the partition you start with is too coarse, and you need to refine it to compute the answer. In the two-bag example, a first attempt might partition by "which bag" alone and ask "given a bag, what is the probability the second ball is red?" — but without knowing the first draw, the second draw is not just a simple fraction of reds in the bag. You have to split each bag-branch further by what happened on the first draw, giving a four-cell partition.

This is a general tactic: when the conditional probability on a branch is itself a compound probability, refine the branch into sub-branches until the conditional probability on each leaf is elementary. The sum P(A) = \sum P(A \mid B_i) P(B_i) still applies — with the refined partition the sum has more terms, but each term is easier to compute.

Where this leads next

You now have the machinery to handle multi-stage probability problems with confidence. The next set of ideas takes this machinery in two directions: the reverse direction (Bayes' theorem — inference from effect to cause) and the expectation direction (computing average values of random variables using a partition).

Bayes Theorem — how to update beliefs about which branch you are on, given an observation at the leaf. The total probability theorem is the denominator of every Bayes calculation.
Random Variables - Discrete — how to package the outcomes of a random experiment as a number, and compute its distribution.
Expectation and Variance - Discrete — computing the average value of a random variable, often using a partition argument that mirrors the total probability theorem.
Introduction to Inference — using conditional probability to reason backwards from data to hypotheses about the world.