Bayes' Theorem — padho-wiki

In short

Bayes' theorem tells you how to update a probability after seeing new evidence:

P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}.

The left side is the posterior — your new probability of A after learning B. On the right, P(A) is the prior — your probability of A before. The formula is the single most important tool for inference, the backbone of every spam filter, and the reason a positive medical test for a rare disease does not mean you have it.

A rare disease affects one person in every thousand. A highly accurate test exists: if you have the disease, the test comes back positive 99\% of the time. If you don't have the disease, the test comes back positive only 1\% of the time. You take the test. It is positive. What is the probability that you actually have the disease?

Most people's first guess is something like "the test is 99\% accurate, so there's a 99\% chance I have the disease." Hold that answer in your head — because by the end of this article you will see that the correct answer is about 9\%, not 99\%. Not 99, nine. The difference is not a rounding error. The difference is that most people, including most doctors in informal surveys, answer this kind of question wrong by a factor of ten, and the reason is that they are skipping the step of "what fraction of the population even has the disease in the first place." That step is the prior probability, and Bayes' theorem is the formula that forces you to use it.

You have already met the one ingredient you need: conditional probability. Bayes' theorem is a two-line rearrangement of the definition of conditional probability, and its importance is wildly out of proportion to how hard it is to derive. Once you have it, you can do things that seem almost magical — take a piece of evidence, any piece of evidence, and read off the revised probability of any hypothesis you care about. Spam filters do this for every email, in real time. Medical systems do it for every test result. Search engines do it for every query. Machine learning does it constantly. This one formula, under different names and in different notations, is running in the background of a huge part of modern civilisation.

The formula in one line

Start from the definition of conditional probability in two directions. For events A and B with P(A), P(B) > 0,

P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \qquad P(B \mid A) = \frac{P(A \cap B)}{P(A)}.

From the second equation, P(A \cap B) = P(A) \cdot P(B \mid A). Substitute that into the first:

P(A \mid B) = \frac{P(A) \cdot P(B \mid A)}{P(B)}.

That is Bayes' theorem. One line of algebra, starting from two applications of the definition of conditional probability. The reason it is so useful is that the two conditional probabilities on the left and right — P(A \mid B) and P(B \mid A) — are usually not equal, and one of them is usually much easier to compute than the other. The formula flips from "the one you know" to "the one you want."

Bayes' theorem

Let A and B be events with P(A), P(B) > 0. Then

P(A \mid B) \;=\; \frac{P(B \mid A) \cdot P(A)}{P(B)}.

Reading the four pieces:

P(A) — the prior probability of A, before evidence B is considered.
P(B \mid A) — the likelihood of the evidence B under hypothesis A.
P(B) — the marginal probability of the evidence (the normalising constant).
P(A \mid B) — the posterior probability of A, after evidence B is incorporated.

Filling in the denominator

In most problems, you are not handed P(B) directly. You have to compute it yourself. The tool is the law of total probability: if A has complement A^c, then

P(B) = P(B \mid A) \cdot P(A) + P(B \mid A^c) \cdot P(A^c).

Substituting this back into Bayes' theorem gives the expanded form,

P(A \mid B) \;=\; \frac{P(B \mid A) \cdot P(A)}{P(B \mid A) \cdot P(A) + P(B \mid A^c) \cdot P(A^c)}.

This is the version you actually use in practice, because it is written entirely in terms of things you are usually told: the prior P(A), the likelihood of the evidence under the hypothesis P(B \mid A), and the likelihood of the evidence under the alternative hypothesis P(B \mid A^c). Every medical test problem is of this form. Every spam filter problem is of this form. Once you recognise it, you can compute posteriors in your sleep.

The fully general version, for a partition of the sample space into many alternative hypotheses A_1, A_2, \ldots, A_n:

P(A_k \mid B) \;=\; \frac{P(B \mid A_k) \cdot P(A_k)}{\sum_{i=1}^{n} P(B \mid A_i) \cdot P(A_i)}.

The numerator is the probability that A_k and B both occurred. The denominator is the total probability of B, summed over every possible cause A_i. The ratio is the fraction of B-outcomes that were caused by A_k specifically — which is exactly "given that B happened, what's the probability it was because of A_k?"

Prior and posterior — the language of updating

The words prior and posterior are the reason Bayes' theorem feels like inference and not just algebra. Here is what they mean, in plain language.

Your prior is your probability estimate for a hypothesis before you see any evidence. It is whatever you believed about the world at the start of the problem — gathered from background knowledge, from base rates, from common sense. In the rare-disease problem, the prior for "this person has the disease" is 0.001, because that is the fraction of the whole population that has it. In a spam filter, the prior for "this email is spam" is around 0.5, because roughly half of all emails are spam. The prior is your starting point.

Then you see some evidence — a test result, a word in an email, a witness statement. Bayes' theorem takes the prior and the likelihood of the evidence, and produces a posterior: your updated probability for the hypothesis, after incorporating the evidence. The posterior is never the same as the prior (unless the evidence was completely uninformative, which means P(B \mid A) = P(B \mid A^c), and that is a special case).

Bayes' theorem as an updating pipeline. The prior is your starting estimate of $A$. The evidence arrives, with its known likelihood $P(B \mid A)$. The theorem combines them into the posterior — your new estimate. The posterior can then act as the prior for the next update, and the cycle repeats.

The posterior can be used as the prior for the next round of updating. If another piece of evidence arrives, you plug the old posterior in as the new prior, run Bayes' theorem again, and get a doubly updated posterior. This is how spam filters handle email that contains many words: each word updates the probability, and the final probability is the result of all the updates chained together. The process is sometimes called Bayesian updating, and it is a formal, quantitative model of "changing your mind in response to evidence."

Example 1: the rare disease test

Example 1: a positive test for a rare disease

A disease affects 0.1\% of the population. A diagnostic test has these properties: if a person has the disease, the test returns positive 99\% of the time (true positive rate). If a person doesn't have the disease, the test returns positive 1\% of the time (false positive rate). A randomly chosen person takes the test and receives a positive result. What is the probability that they have the disease?

Step 1. Name the events and identify the prior.

Let D = "person has the disease" and T = "test is positive." The prior — the probability before seeing the test — is the base rate of the disease in the population:

P(D) = 0.001, \qquad P(D^c) = 0.999.

Why: the word "randomly chosen" is the cue to use the base rate as the prior. Before the test, there is no evidence, so the probability that this person has the disease is whatever fraction of the population has it.

Step 2. Write down the likelihoods.

P(T \mid D) = 0.99, \qquad P(T \mid D^c) = 0.01.

The first is the true positive rate — the probability the test correctly flags a sick person. The second is the false positive rate — the probability the test wrongly flags a healthy person.

Why: the problem gives you both of these directly. You will need both to fill in the denominator of Bayes' theorem.

Step 3. Compute the marginal P(T) via the law of total probability.

P(T) = P(T \mid D) \cdot P(D) + P(T \mid D^c) \cdot P(D^c)

= 0.99 \cdot 0.001 + 0.01 \cdot 0.999

= 0.00099 + 0.00999 = 0.01098.

Why: you need the total probability that the test is positive, regardless of disease status. It has two sources — true positives from sick people and false positives from healthy people — and you sum them.

Step 4. Apply Bayes' theorem.

P(D \mid T) = \frac{P(T \mid D) \cdot P(D)}{P(T)} = \frac{0.99 \cdot 0.001}{0.01098} = \frac{0.00099}{0.01098} \approx 0.0902.

Result: The posterior probability of having the disease, given a positive test, is about 9\% — far below the 99\% most people guess.

Bayes' theorem made concrete. Out of 10,000 people, 10 have the disease and 9 or 10 of them test positive; 9990 are healthy but 100 of them still test positive (the $1\%$ false positive rate). Among all 110 positive tests, only 10 are true positives. The posterior probability of disease given a positive test is about $10/110 \approx 9\%$.

Why is the answer so low? Because the disease is rare. The denominator of Bayes' theorem contains both the true positives (ten out of ten thousand) and the false positives (about one hundred out of ten thousand). The false positives vastly outnumber the true positives — because the healthy population is a thousand times larger than the sick population, and even a tiny false positive rate adds up over such a large group. The posterior can only be accurate if the prior is included, and it is exactly the prior that the intuitive answer forgets.

This is a real effect with real consequences. In 1978, doctors were asked this same kind of question about a breast cancer screening test, and the majority gave answers that were an order of magnitude wrong. Repeated studies since then have shown the same pattern. The only reliable way to get the right answer is to write down the prior and run the theorem.

Example 2: detecting a biased coin

Example 2: which coin is it?

You have two coins in a bag. One is fair (probability of heads = 0.5) and the other is biased (probability of heads = 0.9). You pick a coin at random — so each is equally likely — toss it once, and it lands heads. What is the probability that you picked the biased coin?

Step 1. Name the events. Let F = "picked the fair coin" and B = "picked the biased coin." Let H = "the toss lands heads." The priors:

P(F) = 0.5, \qquad P(B) = 0.5.

Why: you pick at random, so both coins are equally likely before the toss.

Step 2. Write down the likelihoods.

P(H \mid F) = 0.5, \qquad P(H \mid B) = 0.9.

Why: these come straight from the description of each coin.

Step 3. Compute P(H) by the law of total probability.

P(H) = P(H \mid F) \cdot P(F) + P(H \mid B) \cdot P(B) = 0.5 \cdot 0.5 + 0.9 \cdot 0.5 = 0.25 + 0.45 = 0.70.

Step 4. Apply Bayes' theorem to find P(B \mid H).

P(B \mid H) = \frac{P(H \mid B) \cdot P(B)}{P(H)} = \frac{0.9 \cdot 0.5}{0.70} = \frac{0.45}{0.70} \approx 0.643.

Result: After seeing one heads, the probability that you picked the biased coin has risen from 50\% to about 64\%.

Now suppose you toss the same coin again and it lands heads a second time. Update again, using the old posterior 0.643 as the new prior for the biased coin, and 0.357 for the fair coin:

P(H_2) = 0.5 \cdot 0.357 + 0.9 \cdot 0.643 = 0.1785 + 0.5787 = 0.7572,

P(B \mid HH) = \frac{0.9 \cdot 0.643}{0.7572} = \frac{0.5787}{0.7572} \approx 0.764.

A third heads in a row:

P(B \mid HHH) \approx 0.851.

Each heads makes the biased-coin hypothesis more and more likely. After enough heads, you are virtually certain the coin is biased.

Bayesian updating in action. Each additional heads nudges the posterior probability of the biased coin higher. After four heads, the posterior is about $0.91$ — you are highly confident, but not certain, that the coin you picked is the biased one.

This is how Bayesian inference works. Start with a prior. See evidence. Update. See more evidence. Update again. The posterior accumulates information from every observation, and over time it converges on the truth — if the truth is in your list of hypotheses.

Applications

Bayes' theorem is probably the most-used non-arithmetic formula in modern technology. A few places it shows up:

Spam filtering. An email classifier starts with a prior (50\% spam is typical for a fresh filter), then for each word in the email, updates the probability using the likelihood of that word appearing in spam versus ham. After processing all the words, the posterior is either above a threshold (spam folder) or below (inbox).

Medical diagnosis. A doctor looks at symptoms, test results, and patient history, and combines them into a posterior probability of each possible diagnosis. Every round of tests is a Bayes update on the previous round's posterior.

Machine learning. Every probabilistic classifier is secretly running Bayes' theorem in some form. Naive Bayes classifiers are named after the theorem directly; more sophisticated models like Bayesian neural networks use the theorem to update distributions over weights as training data arrives.

Search and retrieval. Search engines rank documents using probabilistic models in which "the user is looking for topic T" is a hypothesis, and the words in the query are evidence that update the probability of each topic.

Legal reasoning. "Probability of guilt given the DNA evidence" is a Bayes' theorem calculation, not a direct read-off of the match probability. Failing to apply Bayes correctly in court has produced genuine miscarriages of justice — confusing P(\text{match} \mid \text{innocent}) with P(\text{innocent} \mid \text{match}) is the same mistake as confusing P(T \mid D) with P(D \mid T) in the medical test example.

Scientific inference. Bayesian statistics treats every unknown parameter as having a probability distribution, and uses Bayes' theorem to update that distribution whenever new data is collected. This is the default mode of modern astronomy, genetics, climate science, and much of physics.

What unites these applications is a single pattern: "I have a hypothesis; I have some prior belief about how likely it is; I have observed some evidence whose likelihood under the hypothesis I can compute; I want a posterior belief." Bayes' theorem is the formal rule for turning those three inputs into that one output, and it is the same rule every time.

Common confusions

A few things students reliably get wrong about Bayes' theorem the first time.

"P(A \mid B) and P(B \mid A) are the same thing." They are almost never equal. The whole point of Bayes' theorem is that they are related — by a ratio of priors — but not equal. Confusing them is sometimes called the prosecutor's fallacy. In the rare-disease example, P(T \mid D) = 0.99 but P(D \mid T) \approx 0.09. Factor of ten difference, same four letters, different meanings.
"The prior doesn't matter if the evidence is strong enough." It matters a lot when the prior is extreme. In the rare-disease example, the prior of 0.001 is what drags the posterior down from 99\% to 9\%. Strong evidence can overcome a low prior, but only if it is much stronger than the prior is low. A 99\% accurate test is not strong enough to overcome a 0.1\% prior — which is a cautionary tale about trusting tests for rare conditions.
"Bayesian and frequentist probabilities are different things." They use the same formula; what differs is the interpretation. Frequentists interpret probabilities as long-run frequencies and are cautious about assigning probabilities to hypotheses directly. Bayesians interpret probabilities as degrees of belief and freely assign them to any proposition. The algebra of Bayes' theorem is the same in both camps, but the philosophy of what a "prior" means is different.
"You can always compute a posterior without a prior." You cannot. Bayes' theorem requires a prior — every single application of it is feeding in a prior somewhere, even if you don't notice. When the prior is not stated, someone has silently assumed one (usually a "uniform" prior where every hypothesis is equally likely), and that assumption can be wrong.

Going deeper

If you just need to run Bayes' theorem on exam problems, you have everything you need. The rest of this section is about why the theorem is more than a formula — it is a philosophy of inference, and the subject of Bayesian statistics is built on top of it.

Odds form

There is a second way to write Bayes' theorem that turns out to be enormously useful. Write the theorem for A and for A^c:

P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}, \qquad P(A^c \mid B) = \frac{P(B \mid A^c) P(A^c)}{P(B)}.

Divide the first by the second. The denominators P(B) cancel, and you get

\frac{P(A \mid B)}{P(A^c \mid B)} = \frac{P(B \mid A)}{P(B \mid A^c)} \cdot \frac{P(A)}{P(A^c)}.

The ratio on the left is the posterior odds — the ratio of the probability of A to the probability of A^c, given the evidence. The ratio on the right splits into two parts: the likelihood ratio P(B \mid A) / P(B \mid A^c), and the prior odds P(A) / P(A^c). So in words:

Posterior odds = likelihood ratio × prior odds.

This is Bayes' theorem in odds form, and it is often easier to compute with than the probability form. It avoids the law of total probability in the denominator entirely. And when evidence arrives in stages, you can just multiply the likelihood ratios together — one for each piece of evidence — to get the cumulative update, as long as the pieces of evidence are conditionally independent given the hypothesis.

Extended form for many hypotheses

When there are n mutually exclusive and exhaustive hypotheses A_1, \ldots, A_n, Bayes' theorem becomes

P(A_k \mid B) = \frac{P(B \mid A_k) P(A_k)}{\sum_{i=1}^n P(B \mid A_i) P(A_i)}.

This is what powers every naive Bayes spam classifier, every speech recognition system, every genetic inference algorithm. The partition \{A_1, \ldots, A_n\} is the list of possible causes, each P(A_i) is the prior, each P(B \mid A_i) is the likelihood of the evidence under cause i, and the denominator normalises so that the posteriors across all causes sum to one.

Why the formula is exactly this

Stepping back: Bayes' theorem is the unique way to update probabilities consistently with the axioms of probability. This is a theorem in its own right — proved by Richard Cox in the 1940s — and it says that if your beliefs are numeric (a number for each proposition) and you want them to obey a small set of common-sense rules (transitivity, compatibility with logical operations), then Bayes' theorem is forced on you. You cannot update probabilities in any other way and still satisfy the axioms. It is not a choice of convention. It is the only rule consistent with both probability and logic.

That is why the theorem feels powerful out of proportion to its algebraic simplicity. It is not just a formula that happens to be useful. It is the only way to reason quantitatively about uncertain evidence that is coherent with the rest of probability theory.

Where this leads next

You now know what Bayes' theorem says, how to derive it, how to apply it, and why it is everywhere. The next articles push you into actual statistical inference — using probability to answer questions about real data.

Introduction to Inference — using probability to draw conclusions from observed data.
Sampling — the theoretical backbone of how sample data informs posterior probabilities.
Conditional Probability — the definition Bayes' theorem is built on.
Independent Events — the simplification that makes multi-evidence Bayesian updates tractable.
Classical Probability — where the numbers that enter the theorem usually come from.