Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Safety: the five Raft invariants

Open a notebook and write down what you would actually check, after every single message a Raft cluster sends, to convince yourself the protocol is correct. The honest answer is exactly five things. Lose any one of them and the cluster can commit two different commands at the same log index — a corruption no amount of subsequent agreement can repair.

In short

Raft's correctness is five sentences. Election Safety: at most one leader per term. Leader Append-Only: a leader only appends to its own log. Log Matching: same (index, term) ⇒ identical prefix. Leader Completeness: every committed entry survives in every later leader. State Machine Safety: no two replicas apply different commands at the same index.

The first four are local rules enforced by tiny code paths; the fifth is a theorem you get for free. The subtle one is Leader Completeness, and its enforcement requires the current-term commit rule — a leader may only mark an entry committed if the entry is from its own term. Drop that one clause and the Figure-8 anomaly fires: an entry is replicated to a majority, looks committed, and gets overwritten anyway.

The previous chapter built log replication and the commit index on the gentleman's assumption that the protocol just works. This is where you cash that cheque. Correctness in distributed systems is never "I traced the happy path" — it is "for every interleaving of crashes, partitions, message reorderings, and elections an adversary can throw at me, no replica's state machine ever disagrees with another at any committed index." For a Raft-based ledger that is moving ₹50,000 between PaisaBridge accounts, "disagrees" means the bank is broken — one replica thinks Riya has the money, another thinks Rahul does, and there is no later message that can repair the split.

The Raft paper proves this property via four supporting invariants and a fifth that follows. This chapter does three things. First, it walks the five precisely — statement, mechanism, what each one delivers. Second, it ships a 60-line Python checker you can run after every step of a simulated trace; the checker is what turns "I think it's correct" into "I have asserted it on a million random executions." Third, it walks the Figure-8 anomaly — the one scenario where a naive commit rule, even with everything else in place, breaks Leader Completeness — and shows the one-line fix.

The invariant stack

The five invariants are not parallel. They form a stack: each one is enforced (or proven) using the ones below it. Drawing the dependency makes the rest of the chapter cleaner.

The five Raft invariants and their enforcing mechanismsFive horizontally stacked rectangles labelled with the five invariants from bottom to top: Election Safety, Leader Append-Only, Log Matching, Leader Completeness, State Machine Safety. Each is paired with a box on the right naming the mechanism that enforces it. The top invariant is highlighted as a consequence rather than a separately enforced rule.Five invariants — what enforces what1. Election SafetyAt most one leader per term2. Leader Append-OnlyA leader never overwrites or deletes its own log3. Log MatchingSame (index, term) ⇒ identical prefix4. Leader CompletenessCommitted entries appear in every future leader5. State Machine SafetyNo two replicas apply different commands at same index(consequence of 2 + 3 + 4)Majority votes + at-most-one-vote-per-term(durable votedFor, fsync before reply)Leader code: only appends to log(no truncation paths in leader state)prevLogIndex / prevLogTerm match in AppendEntries(induction over log length)Up-to-date-log vote restriction+ current-term commit rule (Figure-8 fix)solid arrows: enforced by mechanism on the right · dashed: contributes upward
The dependency picture. The bottom four invariants are each enforced by one local rule. The top one is a theorem composed from the three above Election Safety; that is what the dashed up-arrows mean. Read the article bottom-up — each layer assumes the one below.

Now walk each one.

1. Election Safety

Statement. For any term T, at most one server is leader during term T.

Mechanism. A candidate becomes leader only on receiving votes from a strict majority of the cluster — floor(N/2) + 1 votes from N configured servers. Each server casts at most one vote per term, and that vote is persisted to disk (in votedFor) before the response is sent. Any two majorities of N servers must intersect in at least one server (pigeonhole: > N/2 + > N/2 > N). That common server cannot have voted for two different candidates in the same term, so two candidates cannot both reach majority.

What it delivers. Within any single term there is a single source of truth for new entries. The leader's local log decides cluster history for that term; followers do not append on their own.

Why the durable votedFor is non-negotiable: imagine the alternative. A server in Bengaluru votes for candidate A in term 7, the rack loses power, the server restarts thinking it never voted in term 7, and votes for candidate B. Both A and B reach majority. Now there are two leaders in term 7 — both append entries at index 50, the followers diverge, and State Machine Safety dies. The fsync of votedFor before sending the vote response is the wall preventing this; skipping it for "performance" is a not-uncommon mistake in homegrown Raft implementations and produces a cluster that looks correct under no failures and silently corrupts under crash-restart.

This is a per-term invariant only. Across terms the cluster can and does have many leaders — Figure-8 below depends on it. Election Safety just keeps each individual term clean.

2. Leader Append-Only

Statement. A leader never overwrites or deletes entries in its own log; it only appends.

Mechanism. The leader's code path for log mutation has exactly one operation: log.append(entry). There is no truncation method, no overwrite path, no "fix-up" routine. When a server transitions to leader, its log is whatever it was at end-of-candidacy; from that moment until it steps down, the log only grows.

What it delivers. Indices and meanings are stable during a leader's tenure. Once leader L has decided that index k holds (t=4, "PUT x=7"), that decision is not revoked by L. Followers that learned (k, 4, "PUT x=7") from L will not later be told a different story by L.

Why the asymmetry with followers matters: the corresponding statement is not true for followers. A follower's log can be truncated by AppendEntries — that is exactly how the repair loop fixes divergence. Leaders are append-only, followers are mutable. The protocol exploits this asymmetry by treating the leader's log as authoritative and forcing every follower to converge to a prefix of it. If leaders were also mutable, the cluster would have no anchor to converge toward.

3. Log Matching

Statement. If two logs contain an entry with the same (index, term), then the two logs are identical in all entries up through that index.

Mechanism. Two facts in combination. (a) Within a term there is at most one leader (Election Safety) and that leader is append-only (Leader Append-Only), so a (index, term) pair globally identifies a unique entry — exactly one leader could have written at that index in that term, and that leader wrote at most one thing there. (b) The AppendEntries RPC carries prevLogIndex and prevLogTerm; the follower accepts the new batch only if its log already contains an entry with that exact pair. By induction on log length, if the pair at index k matches and the check at k-1 passes on every accept, then all entries 1..k match.

What it delivers. A single-pair consistency check certifies the entire prefix. Followers can be brought into agreement with the leader by checking twelve bytes of metadata, not by replaying the whole log.

Why both ingredients are needed: the prevLog check alone would not be enough if leaders could overwrite — the check would pass against a "current" version of an entry that the leader had since changed. Leader Append-Only alone would not be enough without the check — followers could append entries received from old leaders and disagree on the prefix even at matching (index, term) pairs because nothing forced the prefix to agree. Together they form the induction: same pair ⇒ same entry ⇒ same predecessors enforced by the check ⇒ same prefix all the way down.

4. Leader Completeness

Statement. If an entry is committed in term T, then that entry appears in the log of every leader of every term greater than T.

Mechanism. Two parts working in tandem.

  • Up-to-date-log vote restriction. A voter rejects a candidate whose log is "older" than its own. "Older" is defined by the last log entry: compare last log terms first; if equal, compare last log indices. A candidate is granted a vote only if its last entry is at least as recent as the voter's.
  • Current-term commit rule. A leader may directly commit only entries from its own current term. Entries from previous terms become committed implicitly, when a current-term entry at a higher index commits — at which point Log Matching forces every entry below it to be committed too.

What it delivers. A new leader's log is guaranteed to contain every entry that any previous leader committed. Committed history is durable across leader changes; the protocol has no way to "un-commit" something just because the leader who committed it died.

The geometric reason the up-to-date rule works: a committed entry sits on a majority. A new leader is elected by a majority. Any two majorities of an N-server cluster overlap in at least one server. That intersecting server has the committed entry (it is in the commit majority) and voted for the new leader (it is in the election majority). The vote restriction forces the candidate's log to be at least as up-to-date as the intersecting voter's, so the candidate's log has the entry too. The current-term rule plugs the one gap that makes this argument actually true under all interleavings — the gap that Figure-8 walks straight through.

A 60-line invariant checker

The cleanest way to internalise the five invariants is to write code that asserts them. Open your editor. Type this in. Do not copy-paste — type it. The point is to feel which fields each invariant touches.

# raft_invariants.py — a checker you run after every step of a Raft trace
from collections import defaultdict

def check_invariants(servers, history):
    """Assert all five invariants on the current state of `servers`.

    `servers` is a dict {server_id: ServerState}. ServerState has:
      - current_term: int
      - state: 'leader' | 'follower' | 'candidate'
      - log: list of Entry(term, cmd); index 0 unused, log[1] is index 1
      - commit_index: int
      - voted_for: dict mapping term -> server_id (durable history of votes)
      - applied: list of Entry; what reached the state machine, in order
    `history` is the flat sequence of (term, leader_id) facts seen so far,
    used to assert Leader Append-Only across leader transitions.
    """
    leaders_by_term = defaultdict(set)
    for sid, s in servers.items():
        if s.state == 'leader':
            leaders_by_term[s.current_term].add(sid)

    # 1. Election Safety
    for term, set_of_leaders in leaders_by_term.items():
        assert len(set_of_leaders) <= 1, \
            f"two leaders in term {term}: {set_of_leaders}"

    # 2. Leader Append-Only
    for sid, s in servers.items():
        prev = history.get((sid, s.current_term))
        if s.state == 'leader' and prev is not None:
            # leader's log must be a (non-strict) extension of its earlier log
            assert s.log[:len(prev)] == prev, \
                f"server {sid} truncated its leader log in term {s.current_term}"

    # 3. Log Matching: same (index, term) ⇒ same prefix
    for a in servers.values():
        for b in servers.values():
            for i in range(1, min(len(a.log), len(b.log))):
                if a.log[i].term == b.log[i].term:
                    assert a.log[:i+1] == b.log[:i+1], \
                        f"log-matching violated at index {i}"

    # 4. Leader Completeness: committed entries survive into later leaders
    for s in servers.values():
        if s.state != 'leader':
            continue
        for other in servers.values():
            for i in range(1, min(s.commit_index, other.commit_index) + 1):
                # any earlier-committed (index, term) must appear in this leader
                if i < len(other.log) and other.log[i].term < s.current_term:
                    assert i < len(s.log) and s.log[i] == other.log[i], \
                        f"leader {s.current_term} missing committed entry {i}"

    # 5. State Machine Safety: no two servers applied a different cmd at i
    for a in servers.values():
        for b in servers.values():
            for i in range(min(len(a.applied), len(b.applied))):
                assert a.applied[i] == b.applied[i], \
                    f"state-machine divergence at applied index {i}"

Walk it line by line.

leaders_by_term = defaultdict(set). Election Safety is a per-term property, so we group active leaders by their current term. The set matters: we are asserting that each set has size ≤ 1. With a list the same leader appearing twice in our snapshot (an ordering artefact) would falsely trip the assertion.

assert s.log[:len(prev)] == prev. Leader Append-Only is enforced not by inspecting code paths but by comparing snapshots: the leader's log must be a prefix-extension of any earlier snapshot of its log within the same term. The history dict is keyed by (sid, term), so a server returning to leadership in a later term gets a fresh slate — that is correct, because the constraint is per-tenure.

The double for a, b loop in Log Matching. We compare every pair of logs at every index where their terms agree. This is O(N² · L) and absurd at production scale, but the checker is only meant to run on a randomised trace with N=5 servers and a few hundred log entries; cost is irrelevant. The point is that the assertion expresses the invariant directly: matching (index, term) ⇒ matching prefix.

Leader Completeness. The check looks at every leader and every other server. If the other server has a committed entry from an earlier term (other.log[i].term < s.current_term), the current leader must hold an identical entry at that index. We compare s.log[i] == other.log[i] rather than just term-and-index — Log Matching guarantees that if they have the same (index, term) they have the same payload, but the assertion is more readable when stated directly.

State Machine Safety. Pure pairwise comparison of the applied lists. Nothing clever; no use of the other invariants. If a divergence ever shows up here, it is the invariant the user actually cares about, and the bug is in the upstream commit-or-replicate machinery.

Run this checker after every transition (every AppendEntries, every vote, every commit-index advance) of a randomised Raft simulator and you will get a property-based test for the protocol. With the current-term commit rule in place, ten million random executions will not trip the assertions. Remove the rule — that one if entry.term != self.current_term: continue line in try_advance_commit_index — and the State Machine Safety assertion fires within a few thousand executions, on the Figure-8 interleaving.

$ python simulate.py --servers 5 --trials 100000 --seed 42
ok: 100000/100000 traces, 0 invariant violations

$ python simulate.py --servers 5 --trials 100000 --seed 42 --buggy-commit
trace 3147: AssertionError: state-machine divergence at applied index 2
   server S5 applied (term=3, "Y")
   server S2 applied (term=2, "X")

The buggy-commit run is reproducing Figure-8.

The Figure-8 anomaly

Figure-8 is the scenario the Raft paper uses (its actual figure 8) to motivate the current-term commit rule. It shows a sequence where an entry replicated to a majority can still be lost if the leader's commit rule is naive. The scenario is adversarial — it requires specific failures, partitions, and elections to align — but every step is a legal Raft transition.

The Figure-8 unsafe-commit scenarioFive servers S1 through S5 across four terms (2, 3, 4, 5). In term 2 S1 is leader and replicates entry X to S2. In term 3 S5 wins and writes Y at the same index but only on its own log. In term 4 S1 returns and re-replicates X to S3, reaching a numerical majority. If S1 commits X here on replica count alone, in term 5 S5 wins again on the up-to-date check and overwrites X with Y across the cluster.Figure-8: why majority replication alone is not enough to committerm 2term 3term 4term 5S1 (L)S2S3S4S5X t2X t2S1 crashesdownX t2Y t3S5 wins term 3X t2X t2X t2Y t3S1 returns, term 4re-ships X to S33/5 have X — dangerdownY t3Y t3Y t3Y t3S5 wins term 5last log term: 3 > 2overwrites X with YFix: in term 4, S1 must NOT commit X by replica count alone — X is from term 2, not S1's current term.
Figure-8 across five servers and four terms. Solid-bordered boxes are X (term 2); thick-bordered boxes are Y (term 3). By term 4, X is on three of five servers — a numerical majority — yet committing it on count alone is unsafe, because S5 can still legally win term 5 (its last log term, 3, beats the followers' last log term, 2) and force its log onto everyone, overwriting X.

The Figure-8 trace, step by step

Setup. Five servers S1..S5 running a Raft-backed ledger for a small fintech in Pune. Cluster size is 5, majority is 3.

Term 2 — S1 leader. S1 wins and a client writes operation X (transfer ₹1 lakh from Asha to Dipti). S1 appends X at index 2 (term 2), ships it to S2 who acknowledges. X is now on {S1, S2}. Before S1 ships further, the rack S1 is on loses power.

Term 3 — S5 leader. S5, which never received X, wins the election. The up-to-date check passes against S3 and S4 (their logs are empty at this index, so anything beats them). A client connected to S5 sends operation Y (refund ₹1 lakh to Asha because the merchant cancelled). S5 appends Y at index 2 (term 3) but is partitioned from {S2, S3, S4} before replicating it.

Term 4 — S1 leader again. S1's rack is back. A new election happens; S1 wins term 4 (its log has X at index 2, term 2 — beats the empty logs of S3 and S4). As leader of term 4, S1 re-replicates X to S3. X is now on {S1, S2, S3} — three of five, a majority.

Now the question. Naive Raft would see the replica count and commit X. Buggy Raft would tell the application: "X is committed; the transfer happened." Asha's account would be marked debited.

Term 5 — disaster. S1 crashes again before doing anything else in term 4. S5 reconnects, runs for office in term 5. S5's log has Y at index 2 with term 3. S2 / S3 / S4 have last log term = 2 (entry X). S5's last log term is 3, strictly greater. The up-to-date check says S5 is more up-to-date, so S5 wins votes from S3 and S4 (and votes for itself). S5 becomes leader of term 5 and ships its log to everyone — Y overwrites X at index 2 on every server.

X was "committed" in term 4 and silently lost in term 5. The application was told the transfer happened. It did not. State Machine Safety violated, and the bank is now broken.

The fix. When S1 is leader in term 4, the entry X is from term 2 — a previous term. The current-term commit rule says: a leader may directly commit only entries from its own current term. So in term 4, S1 is forbidden from committing X even though three replicas hold it. X commits only if S1 first replicates a fresh term-4 entry (call it Z) to a majority and that current-term entry commits — at which point Log Matching guarantees X (sitting below Z in the log) is on the same majority and is implicitly committed.

The implementation is one conditional in try_advance_commit_index:

def try_advance_commit_index(self):
    """Recompute commit_index after a follower acknowledged AppendEntries."""
    for n in range(self.commit_index + 1, len(self.log)):
        if self.log[n].term != self.current_term:
            continue  # the Figure-8 fix: never directly commit a previous-term entry
        replicas = 1 + sum(1 for f in self.followers if self.match_index[f] >= n)
        if replicas > len(self.cluster) // 2:
            self.commit_index = n

The continue is the entire fix. Without it, S1 in term 4 finds n=2 with three replicas and sets commit_index = 2. With it, self.log[2].term == 2 != 4 == self.current_term, the loop skips, X stays uncommitted, and Asha's account is not marked debited until S1 has actually anchored a current-term entry on top.

Common confusions

  • "Election Safety alone prevents split-brain." No. It only prevents two leaders in the same term. Two leaders in different terms is normal — Figure-8 has four of them. What stops a deposed leader from committing entries is the durable votedFor and term comparison: any AppendEntries from a leader with a smaller term is rejected, so the deposed leader cannot reach majority on anything.

  • "Log Matching means logs are always identical." No. Log Matching says identical prefixes given a matching (index, term) pair somewhere. Followers regularly diverge from the leader on the suffix beyond the last matching pair; the AppendEntries repair loop is what truncates and overwrites that divergent suffix until the prefixes line up.

  • "An entry on a majority is committed." No — and this is the subtle part. An entry from the current term, on a majority, is committed. An entry from a previous term, on a majority, is not yet committed by the current leader; it becomes committed only when a current-term entry above it commits. The Figure-8 anomaly is what happens when you ignore this distinction.

  • "Leader Completeness is enforced by the up-to-date vote restriction." Half-true. The up-to-date check is necessary but not sufficient on its own; you also need the current-term commit rule. Without the latter, the up-to-date check is meaningful only against entries that genuinely got committed under the rule, and "committed" without the rule is a moving target. The two work as a pair.

  • "State Machine Safety is what Raft enforces." Technically no. Raft enforces invariants 1–4; State Machine Safety follows as a theorem from 2, 3, and 4. This matters when you write tests: assertions on invariants 1–4 are enforcement assertions (they should never fail at runtime if your code is right); the assertion on invariant 5 is a correctness assertion on the whole protocol (it should never fail if invariants 1–4 are correctly enforced).

  • "This proof depends on synchronous clocks." It does not. Nothing in the safety argument uses wall-clock time. Liveness — "eventually a leader gets elected" — uses randomised election timeouts to break symmetry, but that is a progress argument, not a safety argument. Raft's safety holds under arbitrary message delays, reorderings, and partitions, as long as messages are not corrupted (which TLS plus checksums handle at a lower layer).

Going deeper

Why the proof is by induction on the log

The Log Matching argument is presented as a static "same pair ⇒ same prefix" rule, but the proof is inductive. Base case: the empty log matches the empty log trivially. Inductive step: assume any two logs agreeing at indices 1..k-1 actually contain identical entries there. AppendEntries with prevLogIndex = k-1 only succeeds if both logs have the same (k-1, term_{k-1}); the inductive hypothesis then says the prefix 1..k-1 is identical; the new entry at k is whatever the leader sent, which is the same to all followers; so prefix 1..k is identical. The induction works because Election Safety pins down "what the leader sent" to a single thing per (index, term). Drop Election Safety and the induction breaks because two leaders in the same term could have sent different entries with the same (index, term).

Where the Raft paper's proof handwaves

§5.4 of the paper is admirably concrete but glosses one corner: the precise statement of "up-to-date." The paper says "the log with the later last term is more up-to-date; if the terms are equal, the longer log is more up-to-date." Implementations that compare lengths-as-indices versus lengths-as-counts have been observed to disagree by 1, which usually does not matter but in adversarial traces can. Heidi Howard's TLA+ specification pins this down precisely; if you are implementing Raft for production, read the spec, not just the paper.

Verdi: a Coq-verified Raft

The University of Washington's Verdi project produced a Raft implementation whose safety properties are machine-checked in Coq. The verification effort (Wilcox, Woos, et al., 2015) found subtle issues in the prose proof — most notably around how membership changes interact with the current-term commit rule. If you want the highest confidence Raft you will read this year, it is theirs. The performance overhead is real but the bugs found-and-fixed make for fascinating reading.

Multi-Paxos has the same fix, expressed differently

Multi-Paxos faces the same hazard: a freshly elected proposer cannot certify a value chosen in a previous round without first running "phase 1" to learn what previous proposers proposed. In Raft, the analogous step is implicit in becoming leader (the up-to-date check ensures you have the committed history) but the commit rule is what enforces that you cannot certify a previous-term entry as committed until you have made some progress in your own term. Howard and Mortier's Paxos vs Raft (2020) shows the two protocols are isomorphic at the safety layer; Raft just compiles the same fix into a more readable conditional.

Operational side: fencing and lease-based reads

Safety as proved here covers writes — committed log entries — but production clusters must also serve reads safely. A leader that is partitioned can still believe it is the leader and answer reads from a stale state machine. The standard fix is ReadIndex: the leader records its current commit index, sends a heartbeat round to confirm leadership, and waits until its apply_index reaches that recorded index before answering. A weaker variant uses leases, which trade a clock-skew assumption for a faster read path. Neither changes the five invariants; both add a sixth concern (linearizable reads) that is layered on top. See etcd's read-only request handling for production code paths.

Where this leads next

References

  1. Diego Ongaro and John Ousterhout, In Search of an Understandable Consensus Algorithm (Extended), §5.4 (Safety) and figure 8. The original paper; the safety proof is in §5.4 and Figure 8 is exactly what this chapter walks through.
  2. Diego Ongaro, Consensus: Bridging Theory and Practice (PhD thesis, Stanford 2014). Chapter 3 has the most complete proof of the five invariants; chapter 4 extends them across membership changes.
  3. Heidi Howard, TLA+ specifications and proofs of Raft. Machine-checked safety with a careful treatment of where the prose proof handwaves.
  4. Doug Woos et al., Planning for Change in a Formal Verification of the Raft Consensus Protocol (CPP 2016). The Verdi project's Coq-verified Raft, including issues uncovered in the original prose proof.
  5. Heidi Howard and Richard Mortier, Paxos vs Raft: Have we reached consensus on distributed consensus? (PaPoC 2020). Compares the safety mechanisms of the two algorithms and argues they are isomorphic.
  6. etcd authors, etcd-io/raft README and design notes. Production-grade implementation; see node.go and raft.go for the commit-index advancement code that ships the current-term rule.
  7. Internal: Log replication and the commit index — the chapter immediately preceding this one; required reading for the AppendEntries machinery the safety argument rides on top of.