In short

The conditional probability of A given B is the probability that A happens once you already know B has happened. Its definition is

P(A \mid B) = \frac{P(A \cap B)}{P(B)} \quad \text{(provided } P(B) > 0 \text{)}.

Rearranging this gives the multiplication theorem P(A \cap B) = P(B) \cdot P(A \mid B), which is the standard tool for computing probabilities of chains of dependent events.

You roll two dice under a cup. You cannot see the dice, but a friend who can see them tells you: "the sum is at least 10." What is the probability that you rolled a pair of sixes?

Think about what has changed. Before your friend spoke, you had 36 equally likely outcomes and exactly one of them was (6, 6), so the probability of double sixes was \dfrac{1}{36} \approx 0.028. That number described your knowledge of the experiment before anyone told you anything.

After your friend speaks, your knowledge is different. You have eliminated all the outcomes whose sum is less than 10. What is left is just a handful of outcomes — (4, 6), (5, 5), (6, 4), (5, 6), (6, 5), (6, 6) — six outcomes in total. Under the old assumption of equally likely rolls, those six are still equally likely among themselves, and exactly one of them is (6, 6). So the probability of double sixes, given that the sum is at least 10, is \dfrac{1}{6} \approx 0.167.

The probability changed — not because the dice changed (they already landed, before your friend looked) but because your information changed. Conditional probability is the machinery for updating probabilities when new information comes in. It is the tool that makes probability responsive to evidence, and it is the heart of every statistical inference, every spam filter, every weather forecast, every medical test. The formula behind it is a single fraction, but it powers a startling amount of the world.

The concept image

Here is the image to hold in your head. You start with a rectangle — the sample space S — and two overlapping regions A and B inside it. The probability of A is the area of A divided by the area of S. Straightforward.

Now someone tells you "B happened." Immediately your world shrinks. The parts of S outside B are no longer possible. The only outcomes left are those in B — everything else has been ruled out. Your new universe is B, and you are still asking: "where inside this new universe is A?" That question is asking about the overlap A \cap B, which is the part of A that survives inside the new universe B.

The new probability of A is the ratio of the area of A \cap B to the area of B — not the area of S. Because B is now the entire world for you, and you are measuring A \cap B against that new total.

Conditioning on event B restricts the universe to BTwo Venn diagrams side by side. The left shows events A and B inside a sample space S, with the original probability of A being the area of A over the area of S. The right shows the same picture with B highlighted as the new universe, and the conditional probability of A given B is the area of A intersect B over the area of B.before: P(A) = area(A) / area(S)ABA ∩ Bafter B is known: P(A|B) = area(A∩B) / area(B)AB (new universe)A ∩ B
Before you learn anything, $P(A)$ is the fraction of $S$ that is in $A$. After you learn that $B$ has occurred, your universe shrinks to $B$, and the probability of $A$ becomes the fraction of $B$ that is in $A$ — the ratio of the overlap to $B$ alone.

That single picture is the whole content of the definition. Everything that follows is algebra.

The definition

Conditional probability

Let A and B be events in a sample space S with P(B) > 0. The conditional probability of A given B, written P(A \mid B), is

P(A \mid B) \;=\; \frac{P(A \cap B)}{P(B)}.

It is the probability that A occurs, recomputed under the assumption that B has already occurred.

Read the definition carefully. The numerator is the probability of both A and B happening — the part of the original probability "accounting for" the overlap between the two events. The denominator is the probability of B alone — the size of the new universe. Dividing one by the other rescales the overlap against the new universe, so that everything inside B has conditional probabilities that add up to 1.

The requirement P(B) > 0 is there because you cannot divide by zero. "What is the probability of A given an event that can't happen?" is not a meaningful question.

Why the formula has to be this

There is a way to derive the formula from the classical picture, which is worth doing at least once so that the formula stops feeling arbitrary.

Imagine the sample space S has n equally likely outcomes. Let A have n(A) outcomes, B have n(B) outcomes, and A \cap B have n(A \cap B) outcomes. By classical probability,

P(A) = \frac{n(A)}{n}, \qquad P(B) = \frac{n(B)}{n}, \qquad P(A \cap B) = \frac{n(A \cap B)}{n}.

Now restrict your attention to the outcomes in B. These n(B) outcomes are still equally likely (nothing has changed about the underlying experiment — only your information). Among these n(B) outcomes, the ones where A is also true are exactly the n(A \cap B) outcomes in the overlap. By the classical formula applied inside this new universe,

P(A \mid B) = \frac{n(A \cap B)}{n(B)}.

Divide the numerator and the denominator by n:

P(A \mid B) = \frac{n(A \cap B)/n}{n(B)/n} = \frac{P(A \cap B)}{P(B)}.

That is exactly the definition. The formula is not an arbitrary choice — it is forced on you the moment you accept that "restricting attention to B" means "throwing away outcomes not in B, keeping the rest equally likely."

Properties of conditional probability

Once you have P(\cdot \mid B), it behaves exactly like an ordinary probability function — a fact worth checking because it is the reason every theorem about probability works inside a conditional world too.

Property 1: Non-negativity. P(A \mid B) \ge 0 for any event A, because both P(A \cap B) and P(B) are non-negative and P(B) > 0.

Property 2: Certainty inside B. P(B \mid B) = \dfrac{P(B \cap B)}{P(B)} = \dfrac{P(B)}{P(B)} = 1. Once B is known, B is certain — which is common sense.

Property 3: Countable additivity. If A_1 and A_2 are mutually exclusive (i.e., A_1 \cap A_2 = \emptyset), then

P(A_1 \cup A_2 \mid B) = P(A_1 \mid B) + P(A_2 \mid B).

The proof is a line of algebra: (A_1 \cup A_2) \cap B = (A_1 \cap B) \cup (A_2 \cap B), and the two pieces on the right are still mutually exclusive, so their probabilities add. Divide both sides by P(B) and you get the identity.

These three properties are exactly the axioms of probability. So P(\cdot \mid B) really is a probability function — a new one, operating on the same sample space, but with the outcomes outside B stripped out.

Property 4: Complement inside the condition. P(A^c \mid B) = 1 - P(A \mid B). This is a direct consequence of Property 3 applied to the partition \{A, A^c\}.

Property 5: Monotonicity. If A_1 \subseteq A_2, then P(A_1 \mid B) \le P(A_2 \mid B). A larger event in the original world is still a larger event in the conditional world — which you can prove by dividing the usual monotonicity inequality by P(B).

The multiplication theorem

Take the definition of conditional probability and multiply both sides by P(B):

P(A \cap B) = P(B) \cdot P(A \mid B).

That is the multiplication theorem. In words: "the probability that both A and B happen equals the probability that B happens times the probability that A happens given that B already did."

There is, of course, no reason to prefer B as the first event — you can equally well condition on A:

P(A \cap B) = P(A) \cdot P(B \mid A).

Both are correct, and picking which one to use depends on which conditional probability is easier to compute in your problem.

Multiplication for three events. You can chain the theorem to handle three events. For A, B, C with P(A \cap B) > 0:

P(A \cap B \cap C) = P(A) \cdot P(B \mid A) \cdot P(C \mid A \cap B).

The pattern continues. For n events, you take the probability of the first, multiply by the probability of the second given the first, multiply by the probability of the third given the first two, and so on. This is the standard tool for computing probabilities of sequences of dependent events — drawing cards without replacement, picking balls from an urn without putting them back, progressing through stages of a game.

Example 1: drawing from a deck without replacement

Example 1: two aces in a row

Two cards are drawn one after the other from a standard deck of 52, without replacement. Find the probability that both cards are aces.

Step 1. Define the events. Let A = "first card is an ace" and B = "second card is an ace." You want P(A \cap B).

Why: the question "both cards are aces" is an intersection — both events must hold on the same trial.

Step 2. Compute P(A). There are 4 aces out of 52 cards, and the first card is chosen uniformly at random.

P(A) = \frac{4}{52} = \frac{1}{13}.

Why: on the first draw, the deck is untouched, and classical probability gives the fraction 4/52 directly.

Step 3. Compute P(B \mid A). Given that the first card was an ace, the deck now has 51 cards and only 3 of them are aces. Under this conditional information,

P(B \mid A) = \frac{3}{51} = \frac{1}{17}.

Why: the first card is gone from the deck. The new sample space has 51 equally likely outcomes (the remaining cards), of which 3 are aces. Classical probability applied to this new, smaller sample space gives the conditional probability.

Step 4. Apply the multiplication theorem.

P(A \cap B) = P(A) \cdot P(B \mid A) = \frac{1}{13} \cdot \frac{1}{17} = \frac{1}{221}.

Result: The probability of drawing two aces in a row is \dfrac{1}{221} \approx 0.0045.

Drawing two aces without replacement: a tree diagramA tree diagram showing two stages. The first stage is a single branch labelled P(A) = 4/52 for the first card being an ace. The second stage branches from there into P(B given A) = 3/51 for the second card also being an ace. The product along the path is 1 over 221.start4/5248/52A: acenot ace3/5148/51A ∩ B: acenot aceP(A ∩ B) = (4/52) × (3/51) = 12/2652 = 1/221
The tree diagram for two successive draws without replacement. The probability of the path *"ace, ace"* is the product of the probability on each edge: $\dfrac{4}{52} \cdot \dfrac{3}{51} = \dfrac{1}{221}$.

Example 2: the two-children puzzle

Example 2: a family has two children

A family is known to have exactly two children. You are told that at least one of them is a girl. What is the probability that both children are girls? Assume each child is independently a boy or a girl with probability 1/2.

Step 1. List the sample space. Using B for boy and G for girl, and writing older child first, the sample space of two-child families is

S = \{BB, BG, GB, GG\},

and each of the four outcomes is equally likely (probability 1/4).

Why: with two children and two possibilities each, there are 2 \times 2 = 4 equally likely outcomes. Writing birth order explicitly keeps the outcomes genuinely uniform.

Step 2. Define the events. Let E = "at least one girl" and F = "both girls." You want P(F \mid E).

E = \{BG, GB, GG\}, \qquad F = \{GG\}.

Why: translate the English into subsets of S. "At least one girl" keeps any family with one or more girls, which rules out BB and keeps the other three. "Both girls" is the single outcome GG.

Step 3. Compute P(E), P(F \cap E), and apply the definition.

P(E) = \frac{3}{4}, \qquad P(F \cap E) = P(\{GG\}) = \frac{1}{4}.

Note that F \subseteq E, so F \cap E = F itself.

P(F \mid E) = \frac{P(F \cap E)}{P(E)} = \frac{1/4}{3/4} = \frac{1}{3}.

Why: the conditional probability rescales the overlap (F \cap E) against the new universe (E). Before the condition, P(F) = 1/4. After the condition, P(F \mid E) = 1/3 — the probability has increased because knowing E excluded one outcome (BB) that was not in F either.

Result: The probability that both children are girls, given that at least one is a girl, is \dfrac{1}{3} — not \dfrac{1}{2} as many people's first guess.

The two-children sample space with conditioningThe sample space of families with two children, showing four outcomes BB, BG, GB, and GG. The BB outcome is crossed out because the condition is at least one girl. The remaining three outcomes are equally likely, and the GG outcome is highlighted to show it is the favourable outcome, giving probability one third.S: 4 equally likely outcomes, each with probability 1/4BBexcludedBGGBGGfavourableconditioned on E = "at least one G": 3 outcomes survive, each now with prob. 1/3P(F | E) = 1 / 3
Conditioning on *"at least one girl"* removes $BB$ from the sample space. The remaining three outcomes ($BG$, $GB$, $GG$) are still equally likely, so each has probability $1/3$. The favourable outcome $GG$ is just one of them, giving $P(F \mid E) = 1/3$.

The answer \dfrac{1}{3} often surprises people. The naive guess is \dfrac{1}{2}"the other child is either a boy or a girl, so the probability is one half." But that reasoning treats BG and GB as the same outcome, which they are not in a sample space that distinguishes birth order. The correct reasoning keeps them separate, notices that the condition "at least one girl" eliminates only BB, and finds that the overlap GG is one out of three equally likely surviving outcomes. Conditional probability is full of little shocks like this — it is why careful bookkeeping matters.

Common confusions

A few things students reliably get wrong about conditional probability.

Going deeper

You can compute conditional probabilities from the definition and the multiplication theorem is usable for any problem you will meet in a first course. The rest of this section is about the law of total probability and the connection to the axiomatic framework.

Law of total probability

Suppose you can't directly compute P(A), but you can compute P(A \mid B) and P(A \mid B^c) — the probability of A under each of two scenarios. Then you can recover P(A) by averaging:

P(A) = P(A \mid B) \cdot P(B) + P(A \mid B^c) \cdot P(B^c).

The proof is one line. Partition A as A = (A \cap B) \cup (A \cap B^c). These two pieces are disjoint, so their probabilities add:

P(A) = P(A \cap B) + P(A \cap B^c).

Apply the multiplication theorem to each:

P(A) = P(B) \cdot P(A \mid B) + P(B^c) \cdot P(A \mid B^c).

This formula, and its multi-scenario generalisation P(A) = \sum_i P(A \mid B_i) P(B_i) over any partition \{B_i\} of the sample space, is called the law of total probability. It is the tool that lets you compute the unconditional probability of a complicated event by breaking it into cleaner conditional pieces. It is also the scaffold that Bayes' Theorem is built on.

A concrete use: the test for a rare disease

Suppose a disease affects 1\% of the population. A test for the disease is 99\% accurate in both directions: it returns positive on 99\% of people who have the disease, and negative on 99\% of people who don't. A person takes the test and gets a positive result. What is the probability they actually have the disease?

Let D = "person has the disease," and + = "test is positive." You know P(D) = 0.01, P(+ \mid D) = 0.99, and P(+ \mid D^c) = 0.01. You want P(D \mid +).

By the definition, P(D \mid +) = \dfrac{P(D \cap +)}{P(+)}. The numerator is P(D) \cdot P(+ \mid D) = 0.01 \cdot 0.99 = 0.0099. The denominator, by the law of total probability, is

P(+) = P(+ \mid D) P(D) + P(+ \mid D^c) P(D^c) = 0.99 \cdot 0.01 + 0.01 \cdot 0.99 = 0.0198.

So

P(D \mid +) = \frac{0.0099}{0.0198} = 0.5.

Half. Even with a 99\% accurate test and a positive result, the probability you actually have the disease is only fifty percent, because the disease is rare and a huge number of "false positives" come from the healthy population. This is a standard puzzle in medical statistics and it is why doctors repeat tests. It also sets up the subject of the next article: Bayes' theorem, which is exactly the formula for P(D \mid +) expressed compactly.

Where this leads next

You now know what conditional probability is, how to compute it from the definition, and how it combines with the multiplication theorem to handle sequences of dependent events. The next articles push the idea further.