In short
The axiomatic approach defines probability using three rules (axioms): every probability is at least zero, the probability of the entire sample space is one, and the probability of a union of mutually exclusive events is the sum of their individual probabilities. Every other result in probability theory — complement rules, addition formulas, Bayes' theorem — is a logical consequence of these three axioms.
Toss a fair coin. You say the probability of heads is \frac{1}{2}. Toss a fair die. The probability of rolling a 6 is \frac{1}{6}. These numbers feel natural — you have been using them since class 9, computed as "favourable outcomes divided by total outcomes."
But here is the problem. That formula — P(A) = \frac{n(A)}{n(S)} — only works when all outcomes are equally likely. What if the coin is biased? What if the die is loaded? What if the experiment is "wait for the next bus" and the outcomes are not discrete at all, but a continuous range of arrival times? The classical definition breaks down in all these cases. You need something deeper — a definition that does not assume equal likelihood, one that works for every probability problem you will ever encounter.
The answer is to stop trying to define what probability is in terms of counting, and instead lay down a small set of rules — axioms — that any reasonable assignment of probabilities must satisfy. Then prove everything else from those rules. This is the axiomatic approach to probability, and it is the foundation of the entire subject.
Three rules. That is all.
The setup: you have a random experiment (toss a coin, roll a die, pick a card, measure a waiting time). The set of all possible outcomes is the sample space S. An event is any subset of S — it is the collection of outcomes you care about. A probability function P is a rule that assigns a number P(A) to each event A.
What properties should P have? Think about what would make sense. If you toss a die, P(\text{roll a 3}) should not be negative — there is no such thing as a "negative chance." The probability that something happens should be 1 — the die will land on some face, guaranteed. And if two events cannot both happen at once (like rolling a 2 and rolling a 5 on a single toss), the probability of one or the other should be the sum of the two probabilities.
Those three intuitions are the axioms.
Kolmogorov's Axioms of Probability
Let S be a sample space and let P be a function that assigns a real number P(A) to every event A \subseteq S. Then P is a probability function if it satisfies:
Axiom 1 (Non-negativity). For every event A,
Axiom 2 (Normalization). For the entire sample space,
Axiom 3 (Countable additivity). If A_1, A_2, A_3, \ldots are mutually exclusive events (that is, A_i \cap A_j = \varnothing for all i \neq j), then
That is the entire foundation. Every theorem in probability — from the complement rule to Bayes' theorem to the central limit theorem — is proved from these three axioms and nothing else. The axioms do not tell you what number to assign to each event. They only tell you what rules those numbers must follow. Whether P(\text{heads}) = 0.5 or 0.7 depends on the coin. But whatever the number is, it must be non-negative, the probabilities of all outcomes must add to 1, and mutually exclusive events must add.
Building the first results
The power of axioms is that you can prove things. Here are the essential properties of probability, each derived directly from the three axioms.
Property 1: The probability of the empty set is zero
The impossible event — the event where nothing in S happens — has probability zero.
Claim: P(\varnothing) = 0.
Proof. The sample space S and the empty set \varnothing are disjoint (they share no outcomes, since \varnothing has no outcomes at all). So by Axiom 3:
But S \cup \varnothing = S, so:
Subtract P(S) from both sides:
This is not an axiom — it is a consequence. The axioms force the impossible event to have probability zero. There was no choice in the matter.
Property 2: The complement rule
This is the single most useful identity in basic probability. If A is an event, its complement A' (also written \bar{A} or A^c) is the event "A does not happen."
Claim: P(A') = 1 - P(A).
Proof. The events A and A' are mutually exclusive: they share no outcomes (an outcome either belongs to A or it does not). Together, they cover the entire sample space: A \cup A' = S. So by Axiom 3:
And by Axiom 2, P(A \cup A') = P(S) = 1. Combining:
This is extraordinarily useful. Whenever computing P(A) directly is hard, compute P(A') instead and subtract from 1. For instance, the probability that at least one of three dice shows a 6 is harder to compute directly (you have to account for overlaps), but P(\text{no sixes at all}) = (5/6)^3, so P(\text{at least one six}) = 1 - (5/6)^3 = 1 - 125/216 = 91/216.
Property 3: Probability is bounded between 0 and 1
Claim: For every event A, 0 \leq P(A) \leq 1.
Proof. Axiom 1 gives P(A) \geq 0. The complement rule gives P(A') = 1 - P(A). But Axiom 1 also says P(A') \geq 0, so 1 - P(A) \geq 0, which means P(A) \leq 1. Combining both: 0 \leq P(A) \leq 1.
No probability can ever be negative or greater than 1. If you ever compute P = 1.3 or P = -0.2, you have made an error — the axioms guarantee it.
Property 4: Monotonicity
If event A is a subset of event B — meaning every outcome in A is also in B — then A cannot be more probable than B.
Claim: If A \subseteq B, then P(A) \leq P(B).
Proof. Since A \subseteq B, you can split B into two disjoint pieces: B = A \cup (B \cap A'). The sets A and B \cap A' are disjoint (they share no outcomes), so by Axiom 3:
By Axiom 1, P(B \cap A') \geq 0. So P(B) \geq P(A).
This matches intuition perfectly. If rolling a prime (\{2, 3, 5\}) is a subset of rolling an odd number (\{1, 3, 5\}) — wait, it is not, because 2 is prime but not odd. Let's use a correct example. Rolling a 6 is a subset of rolling an even number (\{2, 4, 6\}). And indeed, P(\{6\}) = 1/6 \leq P(\{2, 4, 6\}) = 3/6.
Property 5: The addition rule for two events
This is so important that the next article is devoted entirely to it. Here is the preview.
Claim: For any two events A and B,
Proof. Split the union A \cup B into three mutually exclusive pieces:
By Axiom 3 (the three pieces are disjoint):
Now observe that A itself splits into two disjoint parts: A = (A \cap B') \cup (A \cap B), so P(A) = P(A \cap B') + P(A \cap B), giving P(A \cap B') = P(A) - P(A \cap B).
Similarly, B = (A' \cap B) \cup (A \cap B), so P(A' \cap B) = P(B) - P(A \cap B).
Substituting both into the equation above:
The subtraction at the end corrects for double-counting: when you add P(A) and P(B), the overlap region A \cap B gets counted twice, so you subtract it once to get the right answer.
Seeing it with numbers
Let's make this concrete with two worked examples.
Example 1: A loaded die
A six-sided die is loaded so that the probability of each face is proportional to the number on that face. Find the probability of rolling an even number.
Step 1. Set up the probability assignment. The probabilities must be proportional to the face values, so P(\{k\}) = ck for some constant c, where k = 1, 2, 3, 4, 5, 6.
Why: "proportional to the face value" means P(\{1\}) : P(\{2\}) : \cdots : P(\{6\}) = 1 : 2 : 3 : 4 : 5 : 6. Introducing a constant c captures this.
Step 2. Use Axiom 2 to find c. The probabilities of all outcomes must sum to 1:
Why: Axiom 2 forces P(S) = 1. This is the constraint that pins down c.
Step 3. Write out the individual probabilities:
Why: each probability is non-negative (Axiom 1 is satisfied) and they sum to 1 (Axiom 2 is satisfied). This is a valid probability assignment.
Step 4. Find P(\text{even}). The event "even" is \{2, 4, 6\}. These are mutually exclusive outcomes, so by Axiom 3:
Why: on a fair die, P(\text{even}) = 1/2. On this loaded die, larger numbers are more likely, and since 2 + 4 + 6 > 1 + 3 + 5, even numbers are collectively more probable.
Result: P(\text{even}) = \dfrac{4}{7} \approx 0.571.
The bar chart makes the answer visible: the three red bars (even faces) take up more than half the total height, because the larger (and hence more probable) faces include 4 and 6, both of which are even.
Example 2: Complement rule in action
A bag contains 10 balls numbered 1 through 10. Two balls are drawn at random without replacement. Find the probability that at least one ball is greater than 8.
Step 1. Identify the complement. "At least one ball greater than 8" is the complement of "both balls are \leq 8." The complement is easier to count.
Why: "at least one" problems almost always become simpler through complements. Instead of tracking overlapping cases (exactly one > 8, or both > 8), compute the one case where none is > 8.
Step 2. Count the sample space. Two balls from 10, order does not matter: \binom{10}{2} = 45.
Why: each pair of distinct balls is one outcome, and all pairs are equally likely (random draw).
Step 3. Count the complement event. Both balls \leq 8 means both are chosen from the 8 balls \{1, 2, \ldots, 8\}: \binom{8}{2} = 28.
Why: to avoid any ball greater than 8, you are restricted to the 8 smaller balls.
Step 4. Apply the complement rule.
Why: this is Property 2 in action. P(A) = 1 - P(A'), derived from the axioms.
Result: P(\text{at least one ball} > 8) = \dfrac{17}{45} \approx 0.378.
The picture shows why complements are so powerful: instead of listing the 17 pairs that satisfy "at least one > 8" (which requires tracking two sub-cases), you list the 28 pairs that don't — a single, clean count — and subtract.
Common confusions
A few things students reliably get wrong about the axiomatic approach.
-
"Probability zero means impossible." Not quite. P(\varnothing) = 0 and the empty event is indeed impossible. But in continuous probability (which you will meet later), individual points can have probability zero without being impossible. For instance, the probability of a dart hitting exactly the point (0.3141592\ldots, 0.2718\ldots) on a dartboard is zero, but the dart does land somewhere. In the finite/countable case — the only case in this article — probability zero does mean impossible.
-
"The axioms tell you what the probabilities are." They do not. The axioms are rules that any valid probability assignment must satisfy. They say nothing about whether P(\text{heads}) = 0.5 or 0.7 — that depends on the physical setup. The axioms are the grammar; the specific probability values are the vocabulary.
-
"Axiom 3 only works for two events." No — it works for any finite or countably infinite collection of mutually exclusive events. The case of two events is a special case (A_3 = A_4 = \cdots = \varnothing).
-
"P(A) + P(B) = P(A \cup B) always." Only when A and B are mutually exclusive. If they overlap, you need the addition formula: P(A \cup B) = P(A) + P(B) - P(A \cap B). Forgetting the overlap correction is one of the most common errors in probability.
Going deeper
If you came here to understand what the axioms of probability are and what basic properties follow from them, you have it — you can stop here. The rest of this section is for readers who want the historical context, the subtleties of the axiom system, and the connection to measure theory.
Why axioms, and not a definition?
Before the axiomatic approach, there were two competing "definitions" of probability. The classical definition (P(A) = n(A)/n(S), due to Laplace) requires equally likely outcomes — which is circular, because "equally likely" means "each has the same probability." The frequentist definition (P(A) is the long-run relative frequency) is experimentally grounded but mathematically vague — it doesn't say how long the "long run" must be, or why the relative frequency should converge.
The axiomatic approach sidesteps both problems. It does not try to define what probability "really is." Instead, it says: whatever probability means to you — a degree of belief, a long-run frequency, a physical symmetry — the number you assign must satisfy three axioms. Then the entire theory follows from those rules. This freed mathematicians from philosophical debates and let them build a rigorous, general theory.
Sigma-algebras and measure theory
In the definition above, we quietly said "every event A \subseteq S." For finite sample spaces (dice, cards, coins), this is fine — there are finitely many subsets, and you can assign a probability to each. But for continuous sample spaces (like the interval [0, 1]), the collection of subsets is vast, and it turns out you cannot consistently assign probabilities to all of them. The resolution is to restrict attention to a special collection of subsets called a sigma-algebra (\sigma-algebra), which is closed under complements and countable unions. Probability is then a function defined only on this sigma-algebra, not on all subsets.
This is the starting point of measure theory — the branch of mathematics where probability lives when you need full generality. For the problems in school-level probability (finite or countable sample spaces), you will never need sigma-algebras. But knowing they exist tells you that the three axioms are not the whole story — they are the visible part of a deeper structure.
Kolmogorov's contribution
The axiomatization of probability was published in 1933 in a monograph titled Grundbegriffe der Wahrscheinlichkeitsrechnung ("Foundations of the Theory of Probability"). The key insight was that probability is a special case of a measure — a function that assigns sizes to sets — satisfying the extra constraint that the total measure is 1. This made probability a branch of measure theory, and suddenly all the powerful tools of real analysis became available to probabilists. The entire modern theory of stochastic processes, random variables, and statistical inference rests on this foundation.
Finite additivity vs countable additivity
Axiom 3 says the additivity rule works for countably many disjoint events, not just finitely many. This distinction matters. Finite additivity (the rule holds for any finite number of disjoint events) is weaker and can lead to pathological probability models that violate our intuition about limits. Countable additivity is the stronger requirement that makes limits work properly: if A_1 \subseteq A_2 \subseteq A_3 \subseteq \cdots is an increasing sequence of events with A_n \to A, then countable additivity guarantees P(A_n) \to P(A). Without it, probability would not connect to convergence, and the entire theory of large-sample statistics would collapse.
Where this leads next
You now have the foundation. Every result from here on is built on the three axioms you just learned.
- Addition Theorem — the full proof of P(A \cup B) = P(A) + P(B) - P(A \cap B), extended to three events and then to n events via inclusion-exclusion.
- Conditional Probability — what happens to probabilities when you learn new information, and the multiplication theorem.
- Independent Events — when knowing one event tells you nothing about another, and why independence is not the same as mutual exclusivity.
- Bayes' Theorem — how to reverse a conditional probability, and why it changes how you think about evidence.
- Random Variables - Discrete — attaching numerical values to outcomes, which is how probability connects to statistics.