Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Safety: the five Raft invariants
Open a notebook and write down what you would actually check, after every single message a Raft cluster sends, to convince yourself the protocol is correct. The honest answer is exactly five things. Lose any one of them and the cluster can commit two different commands at the same log index — a corruption no amount of subsequent agreement can repair.
In short
Raft's correctness is five sentences. Election Safety: at most one leader per term. Leader Append-Only: a leader only appends to its own log. Log Matching: same (index, term) ⇒ identical prefix. Leader Completeness: every committed entry survives in every later leader. State Machine Safety: no two replicas apply different commands at the same index.
The first four are local rules enforced by tiny code paths; the fifth is a theorem you get for free. The subtle one is Leader Completeness, and its enforcement requires the current-term commit rule — a leader may only mark an entry committed if the entry is from its own term. Drop that one clause and the Figure-8 anomaly fires: an entry is replicated to a majority, looks committed, and gets overwritten anyway.
The previous chapter built log replication and the commit index on the gentleman's assumption that the protocol just works. This is where you cash that cheque. Correctness in distributed systems is never "I traced the happy path" — it is "for every interleaving of crashes, partitions, message reorderings, and elections an adversary can throw at me, no replica's state machine ever disagrees with another at any committed index." For a Raft-based ledger that is moving ₹50,000 between PaisaBridge accounts, "disagrees" means the bank is broken — one replica thinks Riya has the money, another thinks Rahul does, and there is no later message that can repair the split.
The Raft paper proves this property via four supporting invariants and a fifth that follows. This chapter does three things. First, it walks the five precisely — statement, mechanism, what each one delivers. Second, it ships a 60-line Python checker you can run after every step of a simulated trace; the checker is what turns "I think it's correct" into "I have asserted it on a million random executions." Third, it walks the Figure-8 anomaly — the one scenario where a naive commit rule, even with everything else in place, breaks Leader Completeness — and shows the one-line fix.
The invariant stack
The five invariants are not parallel. They form a stack: each one is enforced (or proven) using the ones below it. Drawing the dependency makes the rest of the chapter cleaner.
Now walk each one.
1. Election Safety
Statement. For any term
T, at most one server is leader during termT.
Mechanism. A candidate becomes leader only on receiving votes from a strict majority of the cluster — floor(N/2) + 1 votes from N configured servers. Each server casts at most one vote per term, and that vote is persisted to disk (in votedFor) before the response is sent. Any two majorities of N servers must intersect in at least one server (pigeonhole: > N/2 + > N/2 > N). That common server cannot have voted for two different candidates in the same term, so two candidates cannot both reach majority.
What it delivers. Within any single term there is a single source of truth for new entries. The leader's local log decides cluster history for that term; followers do not append on their own.
Why the durable votedFor is non-negotiable: imagine the alternative. A server in Bengaluru votes for candidate A in term 7, the rack loses power, the server restarts thinking it never voted in term 7, and votes for candidate B. Both A and B reach majority. Now there are two leaders in term 7 — both append entries at index 50, the followers diverge, and State Machine Safety dies. The fsync of votedFor before sending the vote response is the wall preventing this; skipping it for "performance" is a not-uncommon mistake in homegrown Raft implementations and produces a cluster that looks correct under no failures and silently corrupts under crash-restart.
This is a per-term invariant only. Across terms the cluster can and does have many leaders — Figure-8 below depends on it. Election Safety just keeps each individual term clean.
2. Leader Append-Only
Statement. A leader never overwrites or deletes entries in its own log; it only appends.
Mechanism. The leader's code path for log mutation has exactly one operation: log.append(entry). There is no truncation method, no overwrite path, no "fix-up" routine. When a server transitions to leader, its log is whatever it was at end-of-candidacy; from that moment until it steps down, the log only grows.
What it delivers. Indices and meanings are stable during a leader's tenure. Once leader L has decided that index k holds (t=4, "PUT x=7"), that decision is not revoked by L. Followers that learned (k, 4, "PUT x=7") from L will not later be told a different story by L.
Why the asymmetry with followers matters: the corresponding statement is not true for followers. A follower's log can be truncated by AppendEntries — that is exactly how the repair loop fixes divergence. Leaders are append-only, followers are mutable. The protocol exploits this asymmetry by treating the leader's log as authoritative and forcing every follower to converge to a prefix of it. If leaders were also mutable, the cluster would have no anchor to converge toward.
3. Log Matching
Statement. If two logs contain an entry with the same
(index, term), then the two logs are identical in all entries up through that index.
Mechanism. Two facts in combination. (a) Within a term there is at most one leader (Election Safety) and that leader is append-only (Leader Append-Only), so a (index, term) pair globally identifies a unique entry — exactly one leader could have written at that index in that term, and that leader wrote at most one thing there. (b) The AppendEntries RPC carries prevLogIndex and prevLogTerm; the follower accepts the new batch only if its log already contains an entry with that exact pair. By induction on log length, if the pair at index k matches and the check at k-1 passes on every accept, then all entries 1..k match.
What it delivers. A single-pair consistency check certifies the entire prefix. Followers can be brought into agreement with the leader by checking twelve bytes of metadata, not by replaying the whole log.
Why both ingredients are needed: the prevLog check alone would not be enough if leaders could overwrite — the check would pass against a "current" version of an entry that the leader had since changed. Leader Append-Only alone would not be enough without the check — followers could append entries received from old leaders and disagree on the prefix even at matching (index, term) pairs because nothing forced the prefix to agree. Together they form the induction: same pair ⇒ same entry ⇒ same predecessors enforced by the check ⇒ same prefix all the way down.
4. Leader Completeness
Statement. If an entry is committed in term
T, then that entry appears in the log of every leader of every term greater thanT.
Mechanism. Two parts working in tandem.
- Up-to-date-log vote restriction. A voter rejects a candidate whose log is "older" than its own. "Older" is defined by the last log entry: compare last log terms first; if equal, compare last log indices. A candidate is granted a vote only if its last entry is at least as recent as the voter's.
- Current-term commit rule. A leader may directly commit only entries from its own current term. Entries from previous terms become committed implicitly, when a current-term entry at a higher index commits — at which point Log Matching forces every entry below it to be committed too.
What it delivers. A new leader's log is guaranteed to contain every entry that any previous leader committed. Committed history is durable across leader changes; the protocol has no way to "un-commit" something just because the leader who committed it died.
The geometric reason the up-to-date rule works: a committed entry sits on a majority. A new leader is elected by a majority. Any two majorities of an N-server cluster overlap in at least one server. That intersecting server has the committed entry (it is in the commit majority) and voted for the new leader (it is in the election majority). The vote restriction forces the candidate's log to be at least as up-to-date as the intersecting voter's, so the candidate's log has the entry too. The current-term rule plugs the one gap that makes this argument actually true under all interleavings — the gap that Figure-8 walks straight through.
A 60-line invariant checker
The cleanest way to internalise the five invariants is to write code that asserts them. Open your editor. Type this in. Do not copy-paste — type it. The point is to feel which fields each invariant touches.
# raft_invariants.py — a checker you run after every step of a Raft trace
from collections import defaultdict
def check_invariants(servers, history):
"""Assert all five invariants on the current state of `servers`.
`servers` is a dict {server_id: ServerState}. ServerState has:
- current_term: int
- state: 'leader' | 'follower' | 'candidate'
- log: list of Entry(term, cmd); index 0 unused, log[1] is index 1
- commit_index: int
- voted_for: dict mapping term -> server_id (durable history of votes)
- applied: list of Entry; what reached the state machine, in order
`history` is the flat sequence of (term, leader_id) facts seen so far,
used to assert Leader Append-Only across leader transitions.
"""
leaders_by_term = defaultdict(set)
for sid, s in servers.items():
if s.state == 'leader':
leaders_by_term[s.current_term].add(sid)
# 1. Election Safety
for term, set_of_leaders in leaders_by_term.items():
assert len(set_of_leaders) <= 1, \
f"two leaders in term {term}: {set_of_leaders}"
# 2. Leader Append-Only
for sid, s in servers.items():
prev = history.get((sid, s.current_term))
if s.state == 'leader' and prev is not None:
# leader's log must be a (non-strict) extension of its earlier log
assert s.log[:len(prev)] == prev, \
f"server {sid} truncated its leader log in term {s.current_term}"
# 3. Log Matching: same (index, term) ⇒ same prefix
for a in servers.values():
for b in servers.values():
for i in range(1, min(len(a.log), len(b.log))):
if a.log[i].term == b.log[i].term:
assert a.log[:i+1] == b.log[:i+1], \
f"log-matching violated at index {i}"
# 4. Leader Completeness: committed entries survive into later leaders
for s in servers.values():
if s.state != 'leader':
continue
for other in servers.values():
for i in range(1, min(s.commit_index, other.commit_index) + 1):
# any earlier-committed (index, term) must appear in this leader
if i < len(other.log) and other.log[i].term < s.current_term:
assert i < len(s.log) and s.log[i] == other.log[i], \
f"leader {s.current_term} missing committed entry {i}"
# 5. State Machine Safety: no two servers applied a different cmd at i
for a in servers.values():
for b in servers.values():
for i in range(min(len(a.applied), len(b.applied))):
assert a.applied[i] == b.applied[i], \
f"state-machine divergence at applied index {i}"
Walk it line by line.
leaders_by_term = defaultdict(set). Election Safety is a per-term property, so we group active leaders by their current term. The set matters: we are asserting that each set has size ≤ 1. With a list the same leader appearing twice in our snapshot (an ordering artefact) would falsely trip the assertion.
assert s.log[:len(prev)] == prev. Leader Append-Only is enforced not by inspecting code paths but by comparing snapshots: the leader's log must be a prefix-extension of any earlier snapshot of its log within the same term. The history dict is keyed by (sid, term), so a server returning to leadership in a later term gets a fresh slate — that is correct, because the constraint is per-tenure.
The double for a, b loop in Log Matching. We compare every pair of logs at every index where their terms agree. This is O(N² · L) and absurd at production scale, but the checker is only meant to run on a randomised trace with N=5 servers and a few hundred log entries; cost is irrelevant. The point is that the assertion expresses the invariant directly: matching (index, term) ⇒ matching prefix.
Leader Completeness. The check looks at every leader and every other server. If the other server has a committed entry from an earlier term (other.log[i].term < s.current_term), the current leader must hold an identical entry at that index. We compare s.log[i] == other.log[i] rather than just term-and-index — Log Matching guarantees that if they have the same (index, term) they have the same payload, but the assertion is more readable when stated directly.
State Machine Safety. Pure pairwise comparison of the applied lists. Nothing clever; no use of the other invariants. If a divergence ever shows up here, it is the invariant the user actually cares about, and the bug is in the upstream commit-or-replicate machinery.
Run this checker after every transition (every AppendEntries, every vote, every commit-index advance) of a randomised Raft simulator and you will get a property-based test for the protocol. With the current-term commit rule in place, ten million random executions will not trip the assertions. Remove the rule — that one if entry.term != self.current_term: continue line in try_advance_commit_index — and the State Machine Safety assertion fires within a few thousand executions, on the Figure-8 interleaving.
$ python simulate.py --servers 5 --trials 100000 --seed 42
ok: 100000/100000 traces, 0 invariant violations
$ python simulate.py --servers 5 --trials 100000 --seed 42 --buggy-commit
trace 3147: AssertionError: state-machine divergence at applied index 2
server S5 applied (term=3, "Y")
server S2 applied (term=2, "X")
The buggy-commit run is reproducing Figure-8.
The Figure-8 anomaly
Figure-8 is the scenario the Raft paper uses (its actual figure 8) to motivate the current-term commit rule. It shows a sequence where an entry replicated to a majority can still be lost if the leader's commit rule is naive. The scenario is adversarial — it requires specific failures, partitions, and elections to align — but every step is a legal Raft transition.
The Figure-8 trace, step by step
Setup. Five servers S1..S5 running a Raft-backed ledger for a small fintech in Pune. Cluster size is 5, majority is 3.
Term 2 — S1 leader. S1 wins and a client writes operation X (transfer ₹1 lakh from Asha to Dipti). S1 appends X at index 2 (term 2), ships it to S2 who acknowledges. X is now on {S1, S2}. Before S1 ships further, the rack S1 is on loses power.
Term 3 — S5 leader. S5, which never received X, wins the election. The up-to-date check passes against S3 and S4 (their logs are empty at this index, so anything beats them). A client connected to S5 sends operation Y (refund ₹1 lakh to Asha because the merchant cancelled). S5 appends Y at index 2 (term 3) but is partitioned from {S2, S3, S4} before replicating it.
Term 4 — S1 leader again. S1's rack is back. A new election happens; S1 wins term 4 (its log has X at index 2, term 2 — beats the empty logs of S3 and S4). As leader of term 4, S1 re-replicates X to S3. X is now on {S1, S2, S3} — three of five, a majority.
Now the question. Naive Raft would see the replica count and commit X. Buggy Raft would tell the application: "X is committed; the transfer happened." Asha's account would be marked debited.
Term 5 — disaster. S1 crashes again before doing anything else in term 4. S5 reconnects, runs for office in term 5. S5's log has Y at index 2 with term 3. S2 / S3 / S4 have last log term = 2 (entry X). S5's last log term is 3, strictly greater. The up-to-date check says S5 is more up-to-date, so S5 wins votes from S3 and S4 (and votes for itself). S5 becomes leader of term 5 and ships its log to everyone — Y overwrites X at index 2 on every server.
X was "committed" in term 4 and silently lost in term 5. The application was told the transfer happened. It did not. State Machine Safety violated, and the bank is now broken.
The fix. When S1 is leader in term 4, the entry X is from term 2 — a previous term. The current-term commit rule says: a leader may directly commit only entries from its own current term. So in term 4, S1 is forbidden from committing X even though three replicas hold it. X commits only if S1 first replicates a fresh term-4 entry (call it Z) to a majority and that current-term entry commits — at which point Log Matching guarantees X (sitting below Z in the log) is on the same majority and is implicitly committed.
The implementation is one conditional in try_advance_commit_index:
def try_advance_commit_index(self):
"""Recompute commit_index after a follower acknowledged AppendEntries."""
for n in range(self.commit_index + 1, len(self.log)):
if self.log[n].term != self.current_term:
continue # the Figure-8 fix: never directly commit a previous-term entry
replicas = 1 + sum(1 for f in self.followers if self.match_index[f] >= n)
if replicas > len(self.cluster) // 2:
self.commit_index = n
The continue is the entire fix. Without it, S1 in term 4 finds n=2 with three replicas and sets commit_index = 2. With it, self.log[2].term == 2 != 4 == self.current_term, the loop skips, X stays uncommitted, and Asha's account is not marked debited until S1 has actually anchored a current-term entry on top.
Common confusions
-
"Election Safety alone prevents split-brain." No. It only prevents two leaders in the same term. Two leaders in different terms is normal — Figure-8 has four of them. What stops a deposed leader from committing entries is the durable
votedForand term comparison: any AppendEntries from a leader with a smaller term is rejected, so the deposed leader cannot reach majority on anything. -
"Log Matching means logs are always identical." No. Log Matching says identical prefixes given a matching
(index, term)pair somewhere. Followers regularly diverge from the leader on the suffix beyond the last matching pair; the AppendEntries repair loop is what truncates and overwrites that divergent suffix until the prefixes line up. -
"An entry on a majority is committed." No — and this is the subtle part. An entry from the current term, on a majority, is committed. An entry from a previous term, on a majority, is not yet committed by the current leader; it becomes committed only when a current-term entry above it commits. The Figure-8 anomaly is what happens when you ignore this distinction.
-
"Leader Completeness is enforced by the up-to-date vote restriction." Half-true. The up-to-date check is necessary but not sufficient on its own; you also need the current-term commit rule. Without the latter, the up-to-date check is meaningful only against entries that genuinely got committed under the rule, and "committed" without the rule is a moving target. The two work as a pair.
-
"State Machine Safety is what Raft enforces." Technically no. Raft enforces invariants 1–4; State Machine Safety follows as a theorem from 2, 3, and 4. This matters when you write tests: assertions on invariants 1–4 are enforcement assertions (they should never fail at runtime if your code is right); the assertion on invariant 5 is a correctness assertion on the whole protocol (it should never fail if invariants 1–4 are correctly enforced).
-
"This proof depends on synchronous clocks." It does not. Nothing in the safety argument uses wall-clock time. Liveness — "eventually a leader gets elected" — uses randomised election timeouts to break symmetry, but that is a progress argument, not a safety argument. Raft's safety holds under arbitrary message delays, reorderings, and partitions, as long as messages are not corrupted (which TLS plus checksums handle at a lower layer).
Going deeper
Why the proof is by induction on the log
The Log Matching argument is presented as a static "same pair ⇒ same prefix" rule, but the proof is inductive. Base case: the empty log matches the empty log trivially. Inductive step: assume any two logs agreeing at indices 1..k-1 actually contain identical entries there. AppendEntries with prevLogIndex = k-1 only succeeds if both logs have the same (k-1, term_{k-1}); the inductive hypothesis then says the prefix 1..k-1 is identical; the new entry at k is whatever the leader sent, which is the same to all followers; so prefix 1..k is identical. The induction works because Election Safety pins down "what the leader sent" to a single thing per (index, term). Drop Election Safety and the induction breaks because two leaders in the same term could have sent different entries with the same (index, term).
Where the Raft paper's proof handwaves
§5.4 of the paper is admirably concrete but glosses one corner: the precise statement of "up-to-date." The paper says "the log with the later last term is more up-to-date; if the terms are equal, the longer log is more up-to-date." Implementations that compare lengths-as-indices versus lengths-as-counts have been observed to disagree by 1, which usually does not matter but in adversarial traces can. Heidi Howard's TLA+ specification pins this down precisely; if you are implementing Raft for production, read the spec, not just the paper.
Verdi: a Coq-verified Raft
The University of Washington's Verdi project produced a Raft implementation whose safety properties are machine-checked in Coq. The verification effort (Wilcox, Woos, et al., 2015) found subtle issues in the prose proof — most notably around how membership changes interact with the current-term commit rule. If you want the highest confidence Raft you will read this year, it is theirs. The performance overhead is real but the bugs found-and-fixed make for fascinating reading.
Multi-Paxos has the same fix, expressed differently
Multi-Paxos faces the same hazard: a freshly elected proposer cannot certify a value chosen in a previous round without first running "phase 1" to learn what previous proposers proposed. In Raft, the analogous step is implicit in becoming leader (the up-to-date check ensures you have the committed history) but the commit rule is what enforces that you cannot certify a previous-term entry as committed until you have made some progress in your own term. Howard and Mortier's Paxos vs Raft (2020) shows the two protocols are isomorphic at the safety layer; Raft just compiles the same fix into a more readable conditional.
Operational side: fencing and lease-based reads
Safety as proved here covers writes — committed log entries — but production clusters must also serve reads safely. A leader that is partitioned can still believe it is the leader and answer reads from a stale state machine. The standard fix is ReadIndex: the leader records its current commit index, sends a heartbeat round to confirm leadership, and waits until its apply_index reaches that recorded index before answering. A weaker variant uses leases, which trade a clock-skew assumption for a faster read path. Neither changes the five invariants; both add a sixth concern (linearizable reads) that is layered on top. See etcd's read-only request handling for production code paths.
Where this leads next
- Membership changes (joint consensus) — the five invariants must keep holding while the cluster's notion of "majority" changes mid-flight. Joint consensus and single-server changes are the two ways to do that.
- Log compaction and snapshots — when logs are truncated, Log Matching has to be extended: the snapshot's last-included index and term act as a synthetic entry for consistency-check purposes.
- Paxos, Multi-Paxos, ZAB — what Raft simplified — the same five safety properties, expressed in three different vocabularies.
- The replicated state machine abstraction — why State Machine Safety is the property the application layer cares about, and how this entire stack delivers it.
References
- Diego Ongaro and John Ousterhout, In Search of an Understandable Consensus Algorithm (Extended), §5.4 (Safety) and figure 8. The original paper; the safety proof is in §5.4 and Figure 8 is exactly what this chapter walks through.
- Diego Ongaro, Consensus: Bridging Theory and Practice (PhD thesis, Stanford 2014). Chapter 3 has the most complete proof of the five invariants; chapter 4 extends them across membership changes.
- Heidi Howard, TLA+ specifications and proofs of Raft. Machine-checked safety with a careful treatment of where the prose proof handwaves.
- Doug Woos et al., Planning for Change in a Formal Verification of the Raft Consensus Protocol (CPP 2016). The Verdi project's Coq-verified Raft, including issues uncovered in the original prose proof.
- Heidi Howard and Richard Mortier, Paxos vs Raft: Have we reached consensus on distributed consensus? (PaPoC 2020). Compares the safety mechanisms of the two algorithms and argues they are isomorphic.
- etcd authors, etcd-io/raft README and design notes. Production-grade implementation; see
node.goandraft.gofor the commit-index advancement code that ships the current-term rule. - Internal: Log replication and the commit index — the chapter immediately preceding this one; required reading for the AppendEntries machinery the safety argument rides on top of.