Safety: the Five Raft Invariants

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Raft's correctness rests on five invariants. They are not independent; they form a stack where each one builds on the layer below.

Election Safety — at most one leader per term. Enforced by majority votes plus the at-most-one-vote-per-term rule.
Leader Append-Only — a leader never overwrites or deletes entries in its own log. Enforced trivially in the leader's code.
Log Matching — if two logs share an entry at the same (index, term), then they are identical for all entries up to that index. Enforced by the AppendEntries prevLogIndex/prevLogTerm consistency check.
Leader Completeness — any entry committed in term T appears in the log of every leader for every term greater than T. Enforced by the up-to-date-log restriction on votes (last log term, then last log index).
State Machine Safety — no two servers ever apply a different command at the same log index. A consequence of 2, 3, and 4.

The fourth one is the subtle one. A naive "majority replication = commit" rule violates it: an entry from a previous term can be replicated to a majority and still get overwritten by a future leader. The Figure-8 scenario shows this in five servers and four terms, and the fix is the current-term commit rule — a leader may commit only entries from its own current term; previous-term entries get committed implicitly when a current-term entry above them commits. One conditional in try_advance_commit_index is the difference between a correct Raft and a paper that gets cited as a counterexample.

State Machine Safety is what your application actually wants. Everything else exists to deliver it.

The previous two chapters built Raft as an algorithm: chapter 100 elected a single leader per term, chapter 101 replicated the log and advanced a commit index. They worked under the gentleman's assumption that the protocol is correct. This chapter is where you check.

Correctness in distributed systems is rarely "I traced through one happy path and it worked." It is "for every interleaving of message delays, crashes, restarts, partitions, and recoveries the adversary can throw at me, the system never violates property P." For Raft, P is State Machine Safety: no two replicas of the same state machine ever disagree about the command at any log index. If the cluster commits "transfer Rs. 5000 from A to B" at index 47 on three nodes and "transfer Rs. 5000 from B to A" at index 47 on the other two, the bank is broken — and no amount of subsequent agreement can repair it.

The Raft paper's §5.4 proves State Machine Safety holds. It does so by establishing four prior invariants, each enforced by a small, local mechanism, and then showing the fifth falls out as a consequence. This chapter walks the stack from the bottom — the easy ones — up to the hard one, then shows the Figure-8 anomaly that forces the trickiest part of the rule, and finishes with the Python that gets it right.

The invariant stack

The five invariants and their enforcing mechanisms. Invariants 1-4 are each enforced by one local rule (right column). Invariant 5 — the property your application actually depends on — is a theorem derived from 2, 3, and 4. The dashed up-arrows show the dependency: Leader Append-Only plus Log Matching plus Leader Completeness together prove State Machine Safety.

We will walk each one bottom up.

1. Election Safety

Statement. For any term T, at most one server is leader during term T.

Mechanism. A candidate must collect votes from a strict majority of the cluster (floor(N/2) + 1 of the N configured servers) to become leader. Each server casts at most one vote per term, persisted to disk in votedFor before the response is sent. Any two majorities of an N-server cluster intersect in at least one server (pigeonhole on N/2 + N/2 = N, so any two sets of size > N/2 must share at least one element). That common server cannot have voted for both candidates in the same term, so two candidates cannot both reach majority.

Safety property delivered. Within any single term there is a single source of truth for new log entries. The leader's local log decides the cluster's history for that term; followers do not append on their own.

Why durable votedFor is non-negotiable: imagine the alternative. A server votes for candidate A in term 7, crashes, restarts thinking it has not voted in term 7, and votes for B. Both A and B reach majority. Term 7 now has two leaders; both append entries at index k; the followers diverge; State Machine Safety dies. The fsync of votedFor before sending the vote response is the wall preventing this. Skipping it for "performance" — a not-uncommon mistake in homegrown Raft implementations — produces a cluster that is correct under no failures and silently corrupt under crash-restart.

This is a per-term invariant. Across terms multiple leaders can and will exist (in fact the Figure-8 scenario relies on it). Election Safety only says: within one term, one leader. It is not enough on its own to guarantee anything about the log; it just gives the next invariants a clean foundation to stand on.

2. Leader Append-Only

Statement. A leader never overwrites or deletes entries in its own log. It only appends.

Mechanism. The leader's code path for log mutation has exactly one operation: append. There is no truncation method, no overwrite path, no "fix up" routine. When a server transitions to leader, its log is whatever it was as a candidate; from that moment until it steps down, the only way the log changes is by log.append(entry) from a fresh client request.

Safety property delivered. During a leader's tenure, indices are stable and meanings are stable. Once the leader has decided that index k holds entry (t=4, "PUT x=7"), that decision will not be revoked while the leader still leads. Followers that have learned (k, 4, "PUT x=7") from this leader will not later be told a different story by the same leader.

Why this is trivial but important: the corresponding statement is not true for followers. A follower's log can be truncated by AppendEntries — that is precisely how the repair loop in chapter 101 fixes divergence. So the asymmetry matters: leaders are append-only, followers are mutable. The protocol exploits this asymmetry by treating the leader's log as the cluster's authority and forcing every follower to converge to a prefix of it. If leaders were also mutable, the cluster would have no anchor.

Leader Append-Only is the easiest invariant to enforce — just don't write code that does the wrong thing — and it is a necessary input to the Log Matching argument. If leaders could rewrite history, then "shares an (index, term)" would not pin down a unique entry, because the same (index, term) pair could mean different things at different points in the same term.

3. Log Matching

Statement. If two logs contain an entry with the same index and term, then the two logs are identical in all entries up through that index.

Mechanism. Two facts together. (a) Within a single term there is at most one leader (Election Safety) and that leader is append-only (Leader Append-Only), so a given (index, term) pair globally identifies a unique log entry — there is exactly one leader who could have written at that index in that term, and that leader wrote at most one thing there. (b) The AppendEntries RPC's prevLogIndex/prevLogTerm check forces a follower to verify that the entry immediately before the new batch matches the leader's. By induction on the log length, if (index_k, term_k) matches and the check at (index_{k-1}, term_{k-1}) passes on every accept, then all entries 1..k match.

Safety property delivered. Followers can be brought into agreement with the leader by checking a single pair, not by replaying the whole log. This is what makes the AppendEntries repair loop O(divergence) instead of O(history).

Why the proof needs both ingredients: just having the prevLog check would not be enough if leaders could overwrite — the check would pass against a "current" version of an entry that the leader had since changed. Just having Leader Append-Only would not be enough without the check — followers could append entries from old leaders and disagree on the prefix even at matching (index, term) pairs because nothing forced the prefix to agree. Together they give the induction: same (index, term) ⇒ same entry ⇒ same predecessors enforced by the check ⇒ same prefix all the way down.

Log Matching is what makes Raft implementable. Without it, the leader would have to send the entire log on every RPC or maintain a Merkle tree to do consistency checks. With it, one index and one term — twelve bytes of metadata — certify everything before them.

4. Leader Completeness

Statement. If an entry is committed in term T, then that entry appears in the log of every leader of every term greater than T.

Mechanism. Two parts working together:

Up-to-date-log vote restriction. A voter rejects a candidate whose log is "older" than its own. "Older" is defined by the last log entry: compare last log terms first; if equal, compare last log indices. Only a candidate with a last entry at least as recent as the voter's gets a vote.
Current-term commit rule. A leader may directly commit only entries from its own current term. Entries from previous terms become committed implicitly, when a current-term entry at a higher index commits (which by Log Matching forces every entry below it to be committed too).

Safety property delivered. A new leader's log is guaranteed to contain every entry that any previous leader committed. This means committed history is durable across leader changes — the protocol has no way to "un-commit" something just because the leader who committed it died.

Why the up-to-date restriction works: a committed entry is, by definition, on a majority of servers. A new leader needs votes from a majority. Any two majorities intersect in at least one server. That intersecting server has the committed entry. The vote restriction forces the candidate's log to be at least as up-to-date as the intersecting voter's, so the candidate's log also has the entry (or a strictly newer one above it that, by Log Matching, implies the entry). Therefore every winning candidate has every committed entry. The current-term rule plugs the gap that the vote restriction alone leaves — the gap that Figure-8 walks straight through.

This is the invariant that takes work. The other three are local enforcement; this one needs the global argument about majority intersection plus the subtle current-term rule. The Figure-8 scenario below is exactly what happens when you have the up-to-date restriction but forget the current-term rule.

The Figure-8 anomaly

The Raft paper's Figure 8 shows a sequence where an entry replicated to a majority can still be lost if the commit rule is naive. The scenario is deliberately adversarial — it requires specific failures and elections to align — but it is a real interleaving the algorithm must handle.

Figure-8 across five servers and four terms. The blue entry X (term 2) is from S1's first leadership; Y (red, term 3) is from S5's. By term 4, S1 has replicated X to three of five servers — a numerical majority. But because X is from a previous term, committing it on the basis of replica count alone is unsafe: S5 can still legally win term 5 (its last log term, 3, beats S1's followers' last log term, 2) and force its log onto everyone, overwriting X. The current-term commit rule prevents the unsafe commit.

The Figure-8 scenario, step by step

Setup. Five servers S1..S5. Cluster size 5, majority is 3.

Term 2. S1 wins the election and becomes leader. A client sends operation X. S1 appends X at index 2 (with term 2), ships it to S2 who acknowledges. S1 now has X on two servers — itself and S2. Before it can ship to anyone else, S1 crashes.

Term 3. S5, which never received X, wins the election. How? S5 requests votes from S3 and S4, neither of which has X either. The up-to-date check passes (S5's last log term is 0 — empty log — equal to S3 and S4's last log term of 0; in this scenario assume S5's log was at least as up-to-date by index too, e.g. through a slightly different setup). S5 becomes leader of term 3. A client sends operation Y. S5 appends Y at index 2 (with term 3), but before replicating it to anyone, S5 is partitioned away.

Term 4. S1 recovers and rejoins. A new election happens; S1 wins term 4 (its log has X at index 2 with term 2, which beats S3/S4's empty logs). As leader of term 4, S1 re-replicates X to S3. Now X is on three servers: S1, S2, S3. A naive Raft would commit X here — three of five replicas have it, that is a majority.

Term 5 (the disaster). S1 crashes again before doing anything else in term 4. S5 reconnects and runs for office in term 5. S5's log has Y at index 2 with term 3. S2/S3/S4's last log term is 2 (entry X). S5's last log term is 3 — strictly greater. The up-to-date check says S5 is more up-to-date, so S5 wins votes from S3 and S4 (and votes for itself), reaches majority, becomes leader. As leader of term 5, S5 ships its log to everyone; Y overwrites X at index 2 on S1, S2, S3, S4.

X was "committed" in term 4 and lost in term 5. State Machine Safety violated.

The fix. When S1 is leader in term 4, the entry X is from term 2 — a previous term. The current-term commit rule says: a leader may directly commit only entries from its own current term. So in term 4, S1 is forbidden from committing X even though three replicas have it. X only commits if S1 first replicates a fresh entry from term 4 (call it Z) to a majority and that current-term entry commits — at which point Log Matching guarantees X (which sits below Z in the log) is on the same majority and is therefore implicitly committed too.

The implementation is one if statement in the function that recomputes the commit index after each AppendEntries response:

def try_advance_commit_index(self):
    """Recompute commit_index after a follower acknowledged an AppendEntries.

    Walks indices above the current commit_index. For each, counts how many
    replicas have it (1 for self plus followers whose match_index >= n).
    Advances commit_index only if that count is a strict majority AND the
    entry is from the leader's current term. The second clause is the
    Figure-8 fix.
    """
    for n in range(self.commit_index + 1, len(self.log)):
        if self.log[n].term != self.current_term:
            continue  # safety rule: never commit a previous-term entry directly
        replicas = 1 + sum(
            1 for f in self.followers
            if self.match_index[f] >= n
        )
        if replicas > len(self.cluster) // 2:
            self.commit_index = n

The continue is the entire fix. Without it, in our scenario S1 in term 4 reaches the loop, finds n=2 with three replicas, and sets commit_index = 2 — committing X. With it, S1 sees self.log[2].term == 2 != 4 == self.current_term, skips, and X stays uncommitted. Only when S1 replicates a term-4 entry to a majority does the loop advance commit_index to that higher index — and Log Matching means everyone with the term-4 entry also has X, so X is safe by then.

Note the continue rather than break: there could be a higher-index entry from the current term that is committable, even when a lower one is not. But because of Log Matching the implementation can equivalently walk top-down or skip — etcd does it slightly differently for performance, but the safety property is the same.

Why the up-to-date rule alone is not enough — the geometric picture

The Leader Completeness proof in one picture. Any committed entry sits on a majority (blue). Any new leader is elected by a majority (green). Pigeonhole forces an overlap (yellow). The up-to-date vote restriction means the overlap server's log was no older than the candidate's — so the candidate has the entry. The current-term commit rule is the extra ingredient that makes "committed" mean what we just argued; without it, the blue circle could include entries that the algorithm thinks are committed but the up-to-date check might not protect.

The current-term rule is precisely the rule that makes this geometric argument work. Without it, an entry replicated to a majority but from a previous term could be the committed one, and a future candidate's last-log-term comparison could leave it out of the up-to-date check (because the last-log-term is from a yet-later term that the future candidate has but the previous-term entry's holders don't). The current-term rule guarantees that anything ever marked committed has a current-term entry above it on the committing majority, which means the overlap server's last log term is at least as recent as that current term, which closes the loop.

5. State Machine Safety

Statement. If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.

Mechanism. None directly — this is a theorem, not an enforced rule. The proof composes the previous invariants:

A server applies entry at index i only when its apply_index reaches i, which only happens when its commit_index >= i.
A server's commit_index advances to i either because it is the leader and saw majority replication of a current-term entry at i (or above, with Log Matching propagating downward), or because it received leaderCommit >= i from a leader.
By Leader Completeness, every leader after the term that committed i has the same entry at i.
By Log Matching, every server's log at index i, once it agrees with any leader's log at index i, agrees with every leader's log at index i from then on.
By Leader Append-Only, no leader rewrites its own log at i while leading.
Therefore every server that ever applies index i applies the same entry there.

Safety property delivered. The whole point. The state machine layer — your key-value store, your coordination service, your account ledger — sees the same sequence of operations on every replica, in the same order, ever. Reads from any replica are linearizable with respect to the same total order (modulo lease and read-index machinery, which is a separate concern at the read layer).

Why this is the only invariant the application layer cares about: the previous four are means; this is the end. A correctness audit of a Raft implementation reduces to "do these five hold?" And in practice it reduces further to "do invariants 1, 2, 3, and 4 hold under all the failure modes I can model?", because 5 is mechanical from them. This is also why TLA+ specifications of Raft (e.g., Heidi Howard's) prove invariant 4 as the hardest theorem and treat 5 as a corollary.

What this leaves out (the going-deeper)

The five invariants are the safety story for the core Raft algorithm — leader election plus log replication. Production Raft adds two more concerns the original paper handles in separate sections.

Membership changes

When you add or remove servers from the cluster, the very definition of "majority" changes mid-flight. If you go from 3 servers to 5 in one atomic step, there is a window where two disjoint majorities can exist — {S1, S2} (majority of the old 3) and {S3, S4, S5} (majority of the new 5) — and both can elect leaders for the same term. Raft solves this with joint consensus (a configuration that requires majorities of both old and new groups) or with single-server membership changes (add or remove one node at a time so the majority sets always overlap). Either way, the same five invariants must continue to hold across the transition; the proof just gets more careful about which set "majority" refers to. See Ongaro's PhD thesis chapter 4 for the full treatment.

Log compaction (snapshots)

Logs cannot grow forever. Production Raft snapshots the state machine periodically and discards log entries below the snapshot index, then ships snapshots (via InstallSnapshot RPC) to followers that are too far behind to catch up by AppendEntries. The Log Matching argument has to be extended: a follower whose log starts at index s (because everything below is in a snapshot) cannot run the consistency check at indices below s. The fix is that the snapshot itself includes the last included index and term, and the follower treats that as a "synthetic entry" for consistency-check purposes. The five invariants still hold; "log" is read as "log plus snapshot prefix."

Read-only requests

A leader serving a read does not need to write to the log, but it does need to be sure it is still the leader (otherwise an old leader might serve stale reads from its outdated state machine). The standard solution is ReadIndex: on receiving a read, the leader records its current commit index, sends a heartbeat round to confirm leadership, and waits until its apply_index reaches that recorded index before answering. A weaker variant uses leases (the leader assumes it is still leader for the duration of a lease that expires before any other server can become candidate). Neither changes the five invariants; both add a sixth concern (linearizable reads) that is layered on top.

Verification

The Raft paper's prose proof has been mechanised. Heidi Howard and others have produced TLA+ specifications of the core protocol that machine-check the five invariants under all interleavings up to a small bound. Project Verdi at UW used Coq to produce a formally verified Raft implementation. These efforts have found subtle bugs in earlier informal descriptions — most famously around membership change. If you are implementing Raft for production, reading the verification literature is more useful than reading the paper alone; the verification artifacts pin down exactly which corners the prose handwaves.

Why the current-term rule is sometimes called "Phase 2 commit" in the literature

Multi-Paxos has the same problem and the same fix, expressed differently. In Paxos, a freshly elected proposer cannot commit a value chosen in a previous round without first running a "phase 1" that learns what previous proposers proposed. In Raft, the analogous step is implicit in becoming leader (the up-to-date check ensures you have the committed history) but the commit rule is what enforces that you cannot certify a previous-term entry as committed until you have made some progress in your own term. The two protocols are isomorphic in this respect; Raft just makes the rule operational in a way you can read off the code.

What to take away

If you remember nothing else: State Machine Safety is the goal; the other four invariants are the scaffolding. Election Safety stops two leaders coexisting in one term. Leader Append-Only stops a leader from rewriting its own history. Log Matching makes single-pair consistency checks suffice for whole-prefix agreement. Leader Completeness ensures committed history survives leader changes — and the current-term commit rule is what makes Leader Completeness actually true in the face of Figure-8.

The next chapter, Failover, Fencing, and Split-Brain, looks at the operational side: how a real Raft cluster handles partition healing, how clients are kept from talking to deposed leaders, and what fencing tokens add when consensus interacts with external resources that are not themselves under Raft's control.

References

Diego Ongaro and John Ousterhout, In Search of an Understandable Consensus Algorithm (Extended), §5.4 (Safety) and §8 (Correctness). The original Raft paper; the safety proof is in §5.4 and the Figure-8 scenario is figure 8.
Diego Ongaro, Consensus: Bridging Theory and Practice (PhD thesis, Stanford 2014). Chapter 3 has the most complete proof of the five invariants; chapter 4 extends them to cluster membership changes.
Heidi Howard, TLA+ specifications and proofs of Raft and her thesis Distributed Consensus Revised. Machine-checked safety with a careful treatment of where the prose proof handwaves.
etcd authors, Raft safety documentation and the etcd-io/raft README. Production-grade implementation; see node.go and the commit-index advancement logic.
Doug Woos et al., Planning for Change in a Formal Verification of the Raft Consensus Protocol. The Verdi project's Coq-verified Raft, which discovered subtle issues in earlier informal proofs.
Heidi Howard and Richard Mortier, Paxos vs Raft: Have we reached consensus on distributed consensus? (arXiv 2020). Compares the safety mechanisms of the two algorithms and argues they are isomorphic at the safety layer.