In short

The axiomatic approach defines probability using three rules (axioms): every probability is at least zero, the probability of the entire sample space is one, and the probability of a union of mutually exclusive events is the sum of their individual probabilities. Every other result in probability theory — complement rules, addition formulas, Bayes' theorem — is a logical consequence of these three axioms.

Toss a fair coin. You say the probability of heads is \frac{1}{2}. Toss a fair die. The probability of rolling a 6 is \frac{1}{6}. These numbers feel natural — you have been using them since class 9, computed as "favourable outcomes divided by total outcomes."

But here is the problem. That formula — P(A) = \frac{n(A)}{n(S)} — only works when all outcomes are equally likely. What if the coin is biased? What if the die is loaded? What if the experiment is "wait for the next bus" and the outcomes are not discrete at all, but a continuous range of arrival times? The classical definition breaks down in all these cases. You need something deeper — a definition that does not assume equal likelihood, one that works for every probability problem you will ever encounter.

The answer is to stop trying to define what probability is in terms of counting, and instead lay down a small set of rules — axioms — that any reasonable assignment of probabilities must satisfy. Then prove everything else from those rules. This is the axiomatic approach to probability, and it is the foundation of the entire subject.

Three rules. That is all.

The setup: you have a random experiment (toss a coin, roll a die, pick a card, measure a waiting time). The set of all possible outcomes is the sample space S. An event is any subset of S — it is the collection of outcomes you care about. A probability function P is a rule that assigns a number P(A) to each event A.

What properties should P have? Think about what would make sense. If you toss a die, P(\text{roll a 3}) should not be negative — there is no such thing as a "negative chance." The probability that something happens should be 1 — the die will land on some face, guaranteed. And if two events cannot both happen at once (like rolling a 2 and rolling a 5 on a single toss), the probability of one or the other should be the sum of the two probabilities.

Those three intuitions are the axioms.

Kolmogorov's Axioms of Probability

Let S be a sample space and let P be a function that assigns a real number P(A) to every event A \subseteq S. Then P is a probability function if it satisfies:

Axiom 1 (Non-negativity). For every event A,

P(A) \geq 0

Axiom 2 (Normalization). For the entire sample space,

P(S) = 1

Axiom 3 (Countable additivity). If A_1, A_2, A_3, \ldots are mutually exclusive events (that is, A_i \cap A_j = \varnothing for all i \neq j), then

P(A_1 \cup A_2 \cup A_3 \cup \cdots) = P(A_1) + P(A_2) + P(A_3) + \cdots

That is the entire foundation. Every theorem in probability — from the complement rule to Bayes' theorem to the central limit theorem — is proved from these three axioms and nothing else. The axioms do not tell you what number to assign to each event. They only tell you what rules those numbers must follow. Whether P(\text{heads}) = 0.5 or 0.7 depends on the coin. But whatever the number is, it must be non-negative, the probabilities of all outcomes must add to 1, and mutually exclusive events must add.

Building the first results

The power of axioms is that you can prove things. Here are the essential properties of probability, each derived directly from the three axioms.

Property 1: The probability of the empty set is zero

The impossible event — the event where nothing in S happens — has probability zero.

Claim: P(\varnothing) = 0.

Proof. The sample space S and the empty set \varnothing are disjoint (they share no outcomes, since \varnothing has no outcomes at all). So by Axiom 3:

P(S \cup \varnothing) = P(S) + P(\varnothing)

But S \cup \varnothing = S, so:

P(S) = P(S) + P(\varnothing)

Subtract P(S) from both sides:

P(\varnothing) = 0

This is not an axiom — it is a consequence. The axioms force the impossible event to have probability zero. There was no choice in the matter.

Property 2: The complement rule

This is the single most useful identity in basic probability. If A is an event, its complement A' (also written \bar{A} or A^c) is the event "A does not happen."

Claim: P(A') = 1 - P(A).

Proof. The events A and A' are mutually exclusive: they share no outcomes (an outcome either belongs to A or it does not). Together, they cover the entire sample space: A \cup A' = S. So by Axiom 3:

P(A \cup A') = P(A) + P(A')

And by Axiom 2, P(A \cup A') = P(S) = 1. Combining:

1 = P(A) + P(A')
P(A') = 1 - P(A)

This is extraordinarily useful. Whenever computing P(A) directly is hard, compute P(A') instead and subtract from 1. For instance, the probability that at least one of three dice shows a 6 is harder to compute directly (you have to account for overlaps), but P(\text{no sixes at all}) = (5/6)^3, so P(\text{at least one six}) = 1 - (5/6)^3 = 1 - 125/216 = 91/216.

Property 3: Probability is bounded between 0 and 1

Claim: For every event A, 0 \leq P(A) \leq 1.

Proof. Axiom 1 gives P(A) \geq 0. The complement rule gives P(A') = 1 - P(A). But Axiom 1 also says P(A') \geq 0, so 1 - P(A) \geq 0, which means P(A) \leq 1. Combining both: 0 \leq P(A) \leq 1.

No probability can ever be negative or greater than 1. If you ever compute P = 1.3 or P = -0.2, you have made an error — the axioms guarantee it.

Property 4: Monotonicity

If event A is a subset of event B — meaning every outcome in A is also in B — then A cannot be more probable than B.

Claim: If A \subseteq B, then P(A) \leq P(B).

Proof. Since A \subseteq B, you can split B into two disjoint pieces: B = A \cup (B \cap A'). The sets A and B \cap A' are disjoint (they share no outcomes), so by Axiom 3:

P(B) = P(A) + P(B \cap A')

By Axiom 1, P(B \cap A') \geq 0. So P(B) \geq P(A).

This matches intuition perfectly. If rolling a prime (\{2, 3, 5\}) is a subset of rolling an odd number (\{1, 3, 5\}) — wait, it is not, because 2 is prime but not odd. Let's use a correct example. Rolling a 6 is a subset of rolling an even number (\{2, 4, 6\}). And indeed, P(\{6\}) = 1/6 \leq P(\{2, 4, 6\}) = 3/6.

Property 5: The addition rule for two events

This is so important that the next article is devoted entirely to it. Here is the preview.

Claim: For any two events A and B,

P(A \cup B) = P(A) + P(B) - P(A \cap B)

Proof. Split the union A \cup B into three mutually exclusive pieces:

A \cup B = (A \cap B') \cup (A \cap B) \cup (A' \cap B)
Venn diagram showing three disjoint regions of A union BTwo overlapping circles labelled A and B inside a rectangle labelled S. The overlap region is labelled A intersect B. The part of A outside B is labelled A intersect B complement. The part of B outside A is labelled A complement intersect B. S A ∩ B' A ∩ B A' ∩ B A B
The union $A \cup B$ splits into three mutually exclusive pieces: the part of $A$ outside $B$, the overlap $A \cap B$, and the part of $B$ outside $A$. Axiom 3 says the probability of the union is the sum of these three pieces.

By Axiom 3 (the three pieces are disjoint):

P(A \cup B) = P(A \cap B') + P(A \cap B) + P(A' \cap B)

Now observe that A itself splits into two disjoint parts: A = (A \cap B') \cup (A \cap B), so P(A) = P(A \cap B') + P(A \cap B), giving P(A \cap B') = P(A) - P(A \cap B).

Similarly, B = (A' \cap B) \cup (A \cap B), so P(A' \cap B) = P(B) - P(A \cap B).

Substituting both into the equation above:

P(A \cup B) = [P(A) - P(A \cap B)] + P(A \cap B) + [P(B) - P(A \cap B)]
P(A \cup B) = P(A) + P(B) - P(A \cap B)

The subtraction at the end corrects for double-counting: when you add P(A) and P(B), the overlap region A \cap B gets counted twice, so you subtract it once to get the right answer.

Seeing it with numbers

Let's make this concrete with two worked examples.

Example 1: A loaded die

A six-sided die is loaded so that the probability of each face is proportional to the number on that face. Find the probability of rolling an even number.

Step 1. Set up the probability assignment. The probabilities must be proportional to the face values, so P(\{k\}) = ck for some constant c, where k = 1, 2, 3, 4, 5, 6.

Why: "proportional to the face value" means P(\{1\}) : P(\{2\}) : \cdots : P(\{6\}) = 1 : 2 : 3 : 4 : 5 : 6. Introducing a constant c captures this.

Step 2. Use Axiom 2 to find c. The probabilities of all outcomes must sum to 1:

c(1 + 2 + 3 + 4 + 5 + 6) = 1 \implies 21c = 1 \implies c = \frac{1}{21}

Why: Axiom 2 forces P(S) = 1. This is the constraint that pins down c.

Step 3. Write out the individual probabilities:

P(\{1\}) = \frac{1}{21}, \quad P(\{2\}) = \frac{2}{21}, \quad P(\{3\}) = \frac{3}{21}, \quad P(\{4\}) = \frac{4}{21}, \quad P(\{5\}) = \frac{5}{21}, \quad P(\{6\}) = \frac{6}{21}

Why: each probability is non-negative (Axiom 1 is satisfied) and they sum to 1 (Axiom 2 is satisfied). This is a valid probability assignment.

Step 4. Find P(\text{even}). The event "even" is \{2, 4, 6\}. These are mutually exclusive outcomes, so by Axiom 3:

P(\{2, 4, 6\}) = P(\{2\}) + P(\{4\}) + P(\{6\}) = \frac{2}{21} + \frac{4}{21} + \frac{6}{21} = \frac{12}{21} = \frac{4}{7}

Why: on a fair die, P(\text{even}) = 1/2. On this loaded die, larger numbers are more likely, and since 2 + 4 + 6 > 1 + 3 + 5, even numbers are collectively more probable.

Result: P(\text{even}) = \dfrac{4}{7} \approx 0.571.

Bar chart of loaded die probabilitiesA bar chart showing the probability of each face of the loaded die. Face 1 has probability 1/21, face 2 has 2/21, face 3 has 3/21, face 4 has 4/21, face 5 has 5/21, and face 6 has 6/21. The bars for even faces (2, 4, 6) are coloured red. The total red area is 12/21 or 4/7. P Face 3/21 6/21 1 2 3 4 5 6 Red bars = even faces, total = 4/7
Each bar represents the probability of one face of the loaded die. The red bars (even faces) together make up $4/7$ of the total probability — more than half, because the heavier faces are disproportionately even.

The bar chart makes the answer visible: the three red bars (even faces) take up more than half the total height, because the larger (and hence more probable) faces include 4 and 6, both of which are even.

Example 2: Complement rule in action

A bag contains 10 balls numbered 1 through 10. Two balls are drawn at random without replacement. Find the probability that at least one ball is greater than 8.

Step 1. Identify the complement. "At least one ball greater than 8" is the complement of "both balls are \leq 8." The complement is easier to count.

Why: "at least one" problems almost always become simpler through complements. Instead of tracking overlapping cases (exactly one > 8, or both > 8), compute the one case where none is > 8.

Step 2. Count the sample space. Two balls from 10, order does not matter: \binom{10}{2} = 45.

Why: each pair of distinct balls is one outcome, and all pairs are equally likely (random draw).

Step 3. Count the complement event. Both balls \leq 8 means both are chosen from the 8 balls \{1, 2, \ldots, 8\}: \binom{8}{2} = 28.

Why: to avoid any ball greater than 8, you are restricted to the 8 smaller balls.

Step 4. Apply the complement rule.

P(\text{both} \leq 8) = \frac{28}{45}
P(\text{at least one} > 8) = 1 - \frac{28}{45} = \frac{17}{45}

Why: this is Property 2 in action. P(A) = 1 - P(A'), derived from the axioms.

Result: P(\text{at least one ball} > 8) = \dfrac{17}{45} \approx 0.378.

Visual split of 45 outcomes into complement and eventA rectangle representing all 45 possible pairs, divided into two regions: 28 pairs where both balls are 8 or less (the complement) and 17 pairs where at least one ball is greater than 8 (the event). S: all 45 pairs Both ≤ 8 28 pairs P = 28/45 At least one > 8 17 pairs P = 17/45
The 45 equally likely pairs split cleanly into two groups. The complement (both $\leq 8$) has 28 pairs. The event (at least one $> 8$) has $45 - 28 = 17$ pairs. The complement rule converts one into the other with a single subtraction.

The picture shows why complements are so powerful: instead of listing the 17 pairs that satisfy "at least one > 8" (which requires tracking two sub-cases), you list the 28 pairs that don't — a single, clean count — and subtract.

Common confusions

A few things students reliably get wrong about the axiomatic approach.

Going deeper

If you came here to understand what the axioms of probability are and what basic properties follow from them, you have it — you can stop here. The rest of this section is for readers who want the historical context, the subtleties of the axiom system, and the connection to measure theory.

Why axioms, and not a definition?

Before the axiomatic approach, there were two competing "definitions" of probability. The classical definition (P(A) = n(A)/n(S), due to Laplace) requires equally likely outcomes — which is circular, because "equally likely" means "each has the same probability." The frequentist definition (P(A) is the long-run relative frequency) is experimentally grounded but mathematically vague — it doesn't say how long the "long run" must be, or why the relative frequency should converge.

The axiomatic approach sidesteps both problems. It does not try to define what probability "really is." Instead, it says: whatever probability means to you — a degree of belief, a long-run frequency, a physical symmetry — the number you assign must satisfy three axioms. Then the entire theory follows from those rules. This freed mathematicians from philosophical debates and let them build a rigorous, general theory.

Sigma-algebras and measure theory

In the definition above, we quietly said "every event A \subseteq S." For finite sample spaces (dice, cards, coins), this is fine — there are finitely many subsets, and you can assign a probability to each. But for continuous sample spaces (like the interval [0, 1]), the collection of subsets is vast, and it turns out you cannot consistently assign probabilities to all of them. The resolution is to restrict attention to a special collection of subsets called a sigma-algebra (\sigma-algebra), which is closed under complements and countable unions. Probability is then a function defined only on this sigma-algebra, not on all subsets.

This is the starting point of measure theory — the branch of mathematics where probability lives when you need full generality. For the problems in school-level probability (finite or countable sample spaces), you will never need sigma-algebras. But knowing they exist tells you that the three axioms are not the whole story — they are the visible part of a deeper structure.

Kolmogorov's contribution

The axiomatization of probability was published in 1933 in a monograph titled Grundbegriffe der Wahrscheinlichkeitsrechnung ("Foundations of the Theory of Probability"). The key insight was that probability is a special case of a measure — a function that assigns sizes to sets — satisfying the extra constraint that the total measure is 1. This made probability a branch of measure theory, and suddenly all the powerful tools of real analysis became available to probabilists. The entire modern theory of stochastic processes, random variables, and statistical inference rests on this foundation.

Finite additivity vs countable additivity

Axiom 3 says the additivity rule works for countably many disjoint events, not just finitely many. This distinction matters. Finite additivity (the rule holds for any finite number of disjoint events) is weaker and can lead to pathological probability models that violate our intuition about limits. Countable additivity is the stronger requirement that makes limits work properly: if A_1 \subseteq A_2 \subseteq A_3 \subseteq \cdots is an increasing sequence of events with A_n \to A, then countable additivity guarantees P(A_n) \to P(A). Without it, probability would not connect to convergence, and the entire theory of large-sample statistics would collapse.

Where this leads next

You now have the foundation. Every result from here on is built on the three axioms you just learned.