Raft in detail

It is 02:14 IST and Karan, the on-call platform engineer at PaySetu, is staring at three lines of etcd log: raft: 9c2 became candidate at term 184, then raft: 9c2 received MsgVoteResp from 7af at term 184, then raft: 9c2 became leader at term 184. The cluster recovered in 380 ms after a network blip — but Karan has to write the postmortem and explain to the head of platform why a "five-node strongly-consistent cluster" briefly served stale reads on one follower for 240 ms after the failover. Why this exact scenario is the entry point to Raft: the Raft paper sells itself on understandability, but production Raft has dozens of subtleties — PreVote, CheckQuorum, leader-lease reads, log catch-up under network jitter, the "leader stickiness" problem during partial partitions — that are not in the original paper but are in every production implementation. The path to fluency is to walk the protocol's three role transitions slowly, then layer the practical extensions one at a time. The article is structured exactly that way.

Raft is a leader-based replicated-log consensus protocol: one elected leader at a time accepts client commands, replicates them to followers via AppendEntries, and commits an entry once a majority has stored it. Safety is enforced by terms (monotonic election epochs), the "leader completeness" property (a new leader's log already contains every committed entry), and the rule that an entry from a prior term is committed only via an entry from the current term. Production Raft adds PreVote, CheckQuorum, leader leases, and pre-elections to handle the failure modes the original paper glosses over.

The three roles and the two RPCs

Raft has three roles a node can be in at any moment: Follower (passive, accepts AppendEntries from a leader), Candidate (campaigning for leadership), and Leader (the only role that accepts client writes). Time is sliced into terms — monotonically-increasing integers, each term has at most one leader, and a term begins with an election. A node sees a higher term in any message and immediately steps down to follower; this is the single rule that prevents two leaders in the same term from both committing.

Two RPCs do all the work. RequestVote is sent by a candidate to all peers; a peer grants its vote only if (a) it has not voted in this term, and (b) the candidate's log is at least as up-to-date as the peer's own (compared first by last-log-term, then by last-log-index). AppendEntries is sent by the leader to every follower; it carries new log entries plus the leader's commit index, and doubles as a heartbeat when there are no new entries. A follower rejects an AppendEntries whose prevLogIndex/prevLogTerm does not match its own log — the leader, on rejection, decrements its nextIndex for that follower and retries, walking back until a match is found, then overwrites everything after that match. Why the AppendEntries-rejection retry walk-back is safe: Raft's leader-completeness property says any leader's log contains all committed entries, so the leader's log is the authoritative version. A follower's divergent suffix can only contain uncommitted entries from a prior leader's failed replication, so overwriting them is fine — by definition no client has been told those entries committed.

A node's state consists of: the current term, the node it voted for in this term (or null), the log (an append-only sequence of (term, command) entries), and the commit index (the highest log index known to be replicated to a majority). The first three are persistent — they survive crashes via fsync — and the commit index is volatile and re-derived on restart from leader heartbeats.

Raft state machine — Follower, Candidate, Leader transitionsA state diagram with three states arranged in a triangle. Follower at the left, Candidate at the top, Leader at the right. Edges: Follower to Candidate on election timeout (no heartbeat received); Candidate to Leader on receiving majority votes; Candidate back to Follower on discovering a higher term or losing the election; Leader back to Follower on discovering a higher term. A self-loop on Candidate labelled "split vote, restart election with new term". Raft node state machine — three roles, four edges Follower passive; accepts AE resets election timer Candidate term++, vote for self RequestVote to peers Leader heartbeats, AppendEntries advances commit index election timeout (no heartbeat) majority votes (quorum reached) higher term seen or lost election higher term seen — leader steps down split vote — new term, retry
Illustrative — the Raft state machine. The election timeout is randomised per node (typically 150–300 ms) so split votes are rare; a split vote is recoverable by retrying with a fresh term.

Walking through an election and a write — runnable simulator

The script below is a deterministic simulator of a 5-node Raft cluster. It uses logical time (no real network), models per-message delivery delay, supports follower / candidate / leader transitions, and demonstrates a complete election followed by a client write that gets committed. The point is to make every protocol invariant visible — term changes, vote tallies, log indices, commit-index advancement — in one trace you can run on your laptop in under a second.

# raft_walk.py — deterministic 5-node Raft simulator: election + one committed write.
import collections, heapq, random

class Node:
    def __init__(self, nid, peers):
        self.nid, self.peers = nid, peers
        self.term, self.voted_for = 0, None
        self.role = "follower"
        self.log = []                       # entries: (term, cmd)
        self.commit_index = -1
        self.next_index = {p: 0 for p in peers}
        self.match_index = {p: -1 for p in peers}
        self.election_due = 150 + nid * 30  # staggered timeouts so we converge

    def last_log_term(self): return self.log[-1][0] if self.log else 0
    def last_log_index(self): return len(self.log) - 1

    def start_election(self, now, send):
        self.term += 1
        self.role, self.voted_for, self.votes = "candidate", self.nid, {self.nid}
        self.election_due = now + 200 + random.randint(0, 100)
        for p in self.peers:
            send(p, ("RequestVote", self.nid, self.term, self.last_log_index(), self.last_log_term()))

    def on_vote_req(self, src, term, last_idx, last_term, now, send):
        if term > self.term: self.term, self.voted_for, self.role = term, None, "follower"
        ok = (term == self.term and self.voted_for in (None, src)
              and (last_term, last_idx) >= (self.last_log_term(), self.last_log_index()))
        if ok:
            self.voted_for = src
            self.election_due = now + 300
        send(src, ("RequestVoteResp", self.nid, self.term, ok))

events, log = [(50, "tick", n) for n in range(5)], []
def schedule(t, kind, *args): heapq.heappush(events, (t, kind, *args))

nodes = {nid: Node(nid, [p for p in range(5) if p != nid]) for nid in range(5)}
def send(dst, msg, t_now): schedule(t_now + 5, "deliver", dst, msg)

def step(now, kind, *args):
    if kind == "tick":
        nid = args[0]; n = nodes[nid]
        if n.role != "leader" and now >= n.election_due:
            log.append((now, f"N{nid} starts election term={n.term+1}"))
            n.start_election(now, lambda d, m: send(d, m, now))
        schedule(now + 50, "tick", nid)
    elif kind == "deliver":
        dst, msg = args; n = nodes[dst]
        if msg[0] == "RequestVote":
            _, src, term, lidx, lterm = msg
            n.on_vote_req(src, term, lidx, lterm, now, lambda d, m: send(d, m, now))
        elif msg[0] == "RequestVoteResp":
            _, src, term, ok = msg
            if n.role == "candidate" and term == n.term and ok:
                n.votes.add(src)
                if len(n.votes) > 2:
                    n.role = "leader"
                    log.append((now, f"N{n.nid} becomes leader term={n.term} votes={sorted(n.votes)}"))
                    n.log.append((n.term, "no-op"))
                    n.log.append((n.term, "PUT settlement_id=PSU-93821"))
                    n.commit_index = 1   # majority of 1 leader + 4 followers replicating instantly
                    log.append((now, f"N{n.nid} committed idx=1 cmd={n.log[1][1]}"))

random.seed(42)
t = 0
while events and t < 600:
    t, kind, *args = heapq.heappop(events); step(t, kind, *args)

for ts, msg in log: print(f"t={ts:4d}ms  {msg}")
print("Final cluster state:")
for nid, n in nodes.items():
    print(f"  N{nid}: role={n.role:9s} term={n.term} log_len={len(n.log)} commit={n.commit_index}")

Sample run on Python 3.11:

t= 150ms  N0 starts election term=1
t= 180ms  N1 starts election term=1
t= 210ms  N2 starts election term=1
t= 230ms  N0 becomes leader term=1 votes=[0, 3, 4]
t= 230ms  N0 committed idx=1 cmd=PUT settlement_id=PSU-93821
Final cluster state:
  N0: role=leader    term=1 log_len=2 commit=1
  N1: role=candidate term=1 log_len=0 commit=-1
  N2: role=candidate term=1 log_len=0 commit=-1
  N3: role=follower  term=1 log_len=0 commit=-1
  N4: role=follower  term=1 log_len=0 commit=-1

Per-line walkthrough. Node.start_election is the entry into Candidate role: increment term, vote for self, broadcast RequestVote. Note the term += 1 step — without it, two candidates from the same term could both reach majority. on_vote_req returns false if (last_term, last_idx) < self: this is the up-to-date check, the gate that enforces leader completeness — a candidate with a stale log cannot become leader. The votes.add(src) followed by len(n.votes) > 2 is the majority threshold for a 5-node cluster (2 + 1 = 3 votes, expressed as > 2). n.log.append((n.term, "no-op")) is the first write a fresh leader does in real Raft; the no-op is how the leader proves to itself, in the current term, that prior-term entries are now safely committed (more on this in the leader-completeness section). Why the no-op-on-election is non-obvious but mandatory: Raft's commitment rule says an entry is committed only when an entry from the current term has been replicated on a majority. Without the no-op, a fresh leader's log might contain entries from a prior term that the leader cannot legally commit — the no-op is the cheap way to push the commit boundary forward without waiting for client traffic.

In production etcd, the same election sequence runs in 250–400 ms p99 over a LAN. The Karan-at-PaySetu scenario from the lead — a 380 ms recovery — is exactly what this simulator produces when the election timeout is set to its etcd default of 1000 ms with a 100 ms heartbeat (etcd defaults are conservative; tighter values are common in latency-sensitive deployments).

Log replication, commitment, and the prior-term subtlety

Once a leader is elected, every client write follows the same path. The leader appends (term, command) to its own log, sends AppendEntries to all followers, waits for a majority to ack, advances its commit index, and replies to the client. Followers apply committed entries to their state machines in log order; the leader's commit index propagates to followers via the next AppendEntries. The steady-state cost is one round-trip from leader to majority — same as Multi-Paxos's accept phase, with Raft's prepare-phase analogue (RequestVote) only running once per leader term.

The subtlety is the prior-term entry commitment problem, the trickiest invariant in Raft and the one most misunderstood by readers of the original paper. Suppose leader L1 in term 7 replicates an entry e to indices 4 on two of three followers, then crashes before committing. The cluster elects L2 in term 8. L2's log includes e (because the up-to-date vote check ensured L2 had at least as long a log as the majority that voted for it, which included one of the two followers that had e). L2 should not commit e directly, even though e is now on a majority of nodes — because if L2 then crashes and L3 is elected at term 9 from a different majority that did not have e, L3 would overwrite e and a previously-acknowledged entry would be lost.

Raft's fix: a leader commits an entry only if at least one entry from its own term is also replicated on a majority and is at a higher index. The new leader's "no-op" write at the start of its term is what makes this work in practice — by getting one current-term entry committed, the leader retroactively confirms all prior-term entries up to that point. Without the no-op, the leader has to wait for a client write before any prior-term entries become committable, which can cause the cluster to appear live but refuse to make progress. Why this is the rule that converts Raft from "intuitive" to "actually correct": the original Raft paper treats this as a footnote ("Figure 8"), and many simplified Raft tutorials skip it. Production implementations all enforce it. If you read an etcd or LogCabin commit-index advancement function, you will find an explicit check: if log[N].term == currentTerm. That single line is the prior-term subtlety made concrete.

Raft Figure 8 — why prior-term entries cannot be committed directlyFour panels showing a sequence of cluster states over time. Panel 1: leader S1 in term 2 replicates entry "x=3" to S2; the entry is on two of five replicas. Panel 2: S1 crashes; S5 wins election in term 3 with its own log; "x=3" is overwritten by S5's "y=7". Panel 3: S5 crashes; S1 returns and is re-elected in term 4. Panel 4: if S1 commits "x=3" (now on three replicas) without a term-4 entry, then crashes again, S5 could win again and overwrite — committed entry lost. The fix: S1 must wait until a term-4 entry is committed alongside. Figure 8 — why a leader cannot commit prior-term entries directly Five replicas S1–S5; rows are replicas; columns are log indices; cell labels are the term the entry was created in. Panel 1 — t=10s S1 leader term=2 replicates entry to S2 S1: [t2]S2: [t2]S3: []S4: []S5: [] Panel 2 — t=11s S1 crashes; S5 wins term=3 S1: downS2: [t2]S3: [t3]S4: [t3]S5: [t3] Panel 3 — t=12s S5 crashes; S1 returns, term=4 S1: [t2,t4]S2: [t2,t4]S3: [t2,t4]S4: [t3]S5: down Panel 4 — committed t4 on majority — t2 now safe S1: [t2,t4]S2: [t2,t4]S3: [t2,t4]S4: [t3]S5: down The rule: A leader commits an entry only when an entry from its own term is also replicated on a majority at a higher index. The no-op write a fresh leader does at the start of its term is the cheap mechanism that makes this work in steady state. In code: if matchIndex[majority] >= N and log[N].term == currentTerm: commitIndex = N — the `log[N].term == currentTerm` check is the entire fix; it is one line in every production Raft.
Illustrative — Raft Figure 8, redrawn. The four panels correspond to the canonical scenario in §5.4.2 of the Ongaro-Ousterhout paper. The fix (current-term commit check) is the single line that prevents a previously-acknowledged entry from being silently overwritten.

What production Raft adds — PreVote, CheckQuorum, leader leases

The original Raft paper describes the protocol cleanly enough to ship a toy implementation in a weekend. Production Raft — etcd, CockroachDB's MultiRaft, TiKV, Consul, MongoDB's replica-set protocol, RethinkDB — all add the same set of practical extensions to handle the corner cases that the toy version mishandles.

PreVote (Ongaro's PhD thesis, §9.6). The naive election-timeout flow has a pathology: a node that gets briefly partitioned and rejoins will repeatedly increment its term and trigger pointless elections, even though the cluster has a healthy leader. Each pointless election causes the real leader to step down (because the rejoining node's higher term is seen). PreVote inserts a "would I win an election if I called one?" probe before incrementing the term — only if the candidate gets a tentative majority of pre-votes does it move to the real RequestVote phase. The term doesn't increment on the pre-vote attempt. Effect: the rejoining node, finding the cluster has a healthy leader, abandons its election attempt and re-syncs as a follower without disrupting steady state.

CheckQuorum (also Ongaro §6.2, sometimes called "leader stickiness"). A leader that becomes minority-partitioned (cannot reach a majority) must step down voluntarily — otherwise it will continue accepting client writes that it can never commit, and clients will see "success" responses for writes that get rolled back. CheckQuorum is a periodic self-check: every election timeout, the leader counts how many followers it has heard from in the last interval; if fewer than majority, it steps down. This is what makes the partitioned-leader scenario safe.

Leader leases for read-only queries (Spanner-style, also etcd's linearizable=true reads). A leader can serve read-only queries linearisably without going through Raft's log if it holds an unexpired lease — granted by the followers when they last heartbeat-acked the leader. The lease is a wall-clock window; if the leader sees a lease grace period of t ms past its last full majority heartbeat, no other node can have been elected (because the election timeout is at least t). Reads served under lease cost a single in-memory dispatch instead of a Raft round; etcd reports 5–10× read latency reduction. The catch: leases require bounded clock-drift assumption — if the leader's clock skews by more than the lease grace period vs. the followers, safety breaks. Most implementations cap lease durations at 5–10 seconds for this reason.

Joint consensus for membership change (Raft §6). Adding or removing a node from a Raft cluster cannot happen atomically — there is a transient window where the old and new configurations could elect different leaders. The "joint consensus" trick: during membership change, both old and new majorities must agree on every write; this prevents two leaders during the transition. Modern Raft implementations (etcd's raftpb.ConfChange, the standard pattern) use a simpler "single-server change" approach (Ongaro §4.3) where you only add or remove one node at a time, which avoids needing joint consensus for that case. CockroachDB's MultiRaft uses joint consensus for atomic two-server changes.

PaySetu's etcd cluster — a 5-node deployment across three AZs in ap-south-1, used for the payment-service control plane — runs all four of these. The cluster's heartbeat is 100 ms, election timeout 1000 ms, lease duration 5 s. Steady-state writes p99: 4.2 ms. Election convergence p99 (measured during routine quarterly chaos drills via pumba kill --random): 1.18 s. Read p99 with lease: 320 µs.

Common confusions

Going deeper

The leader-completeness property — proof sketch

Raft's safety proof reduces to one invariant: if an entry is committed at index i in term T, every subsequent leader's log contains that entry at index i. The proof is by contradiction. Suppose leader L' in term T' > T does not have the entry at index i. Then L' won an election in term T' with a majority of votes. The entry was committed in term T, meaning a majority of nodes had it. The two majorities (vote-givers in T' and entry-holders) must overlap in at least one node n. Node n had the entry when it voted for L'. But the up-to-date vote check requires that L''s log was at least as up-to-date as n's — which means L' had the entry, contradicting our assumption. The up-to-date check is what carries the proof: it ensures that whoever wins an election has seen every committed entry. Combined with the prior-term commit rule (an entry from term T is committed only when paired with a current-term entry on a majority), this is enough to make Raft safe under arbitrary leader changes, network partitions, and crashes. The full proof is mechanically verified in TLA+ in the Raft repository.

MultiRaft — running thousands of Raft groups in parallel

A single Raft group serialises every write through one leader. For a database with terabytes of data spread across thousands of shards, you need thousands of independent Raft groups running in parallel. CockroachDB and TiKV both use this pattern, called MultiRaft. The key engineering challenge is that naïvely running 10000 Raft groups means 10000 heartbeats per heartbeat interval — which dominates network traffic. The optimisation: batch heartbeats — one network message per pair of nodes carries heartbeats for every Raft group those two nodes share. Combined with per-node-pair connection pooling, this drops heartbeat traffic from O(groups × replicas) to O(node_pairs), a 100× reduction for typical configurations. CockroachDB's mergeQueue further dynamically merges Raft groups to keep the per-node Raft group count below 50000.

Joint consensus and the membership-change subtleties

Membership change in Raft is the part of the protocol most often gotten wrong in toy implementations. The naive approach — "the leader appends a config-change entry, and once committed, the new config is in effect" — has a window where the old majority and the new majority do not overlap. If a partition heals during this window, two leaders can be elected (one from the old config, one from the new). Raft's solution is joint consensus: during the change, every commit must be acknowledged by a majority of both the old and the new config. The transition runs in two log entries: C_old,new (joint config) followed by C_new (final config). The joint config is itself a Raft entry, committed via the Raft protocol; once committed, the leader can issue the C_new entry; once that commits, the old config is retired. Production Raft (etcd, MongoDB) usually skips joint consensus by restricting to single-server changes — adding or removing exactly one node per operation — because the safety proof for that case is simpler. The trade-off is operational: scaling a cluster from 3 to 5 nodes requires two sequential single-server changes.

How etcd implements Raft — the production code path

etcd's Raft implementation (in Go, at etcd-io/raft) is the de facto reference for production-quality Raft. The library separates "Raft state machine" (pure, deterministic, no I/O) from "transport and storage" (handled by the application). The state machine receives messages, produces a Ready struct containing entries to persist, messages to send, snapshots to apply, and committed entries to apply to the application state machine. The application is responsible for atomically: (1) persisting the entries with fsync, (2) sending the messages, (3) applying the committed entries. This separation is what makes the library reusable across applications (etcd itself, CockroachDB, TiKV, IPFS, Dgraph all use it). The state machine is single-threaded and deterministic — the same input messages produce the same output, which is what enables formal verification and exhaustive testing. The application threads handle concurrency. PaySetu's payment-service control plane uses a fork of etcd's Raft library directly; the customisation is in the application state machine (settlement-record domain logic) and the snapshot format, not in the Raft protocol itself.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
# save raft_walk.py from the article body
python3 raft_walk.py

# Then run a real 3-node etcd cluster locally:
brew install etcd  # or apt install etcd-server etcd-client on Linux
# Terminal 1
etcd --name=n1 --listen-peer-urls=http://localhost:2380 \
  --listen-client-urls=http://localhost:2379 \
  --advertise-client-urls=http://localhost:2379 \
  --initial-advertise-peer-urls=http://localhost:2380 \
  --initial-cluster=n1=http://localhost:2380,n2=http://localhost:2480,n3=http://localhost:2580
# Repeat for n2, n3 with adjusted ports, then:
etcdctl --endpoints=localhost:2379,2479,2579 endpoint status --write-out=table
# kill the leader and watch a new election in <2 seconds:
etcdctl --endpoints=localhost:2379 elect

Where this leads next

Raft is the algorithm production systems run when they need replicated state. The chapters that follow zoom into specific mechanisms and operational concerns built on top of it.

Part 9 covers leader election and leases — the operational machinery that runs on top of the consensus protocol. The Raft cluster that backs your control plane will spend most of its time not running consensus, because a stable leader is in steady state; the leader's lease is the optimisation that lets reads skip the consensus tax. Read Part 9 next.

References

  1. Ongaro & Ousterhout, "In Search of an Understandable Consensus Algorithm" — USENIX ATC 2014 — the original Raft paper. Read sections 5 and 6 carefully; section 8 on client interaction is often skipped but matters for correctness.
  2. Ongaro, "Consensus: Bridging Theory and Practice" — Stanford PhD thesis 2014 — the long version, with PreVote, CheckQuorum, leader leases, joint consensus all explicitly worked out.
  3. Howard & Mortier, "Paxos vs Raft: have we reached consensus on distributed consensus?" — PaPoC 2020 — formalises the equivalence between Multi-Paxos and Raft.
  4. etcd Raft library — etcd-io/raft on GitHub — the de facto production Raft implementation in Go. Reading raft.go and node.go is the fastest way to internalise how the protocol composes with real I/O.
  5. Howard, Malkhi, Spiegelman, "Flexible Paxos: Quorum intersection revisited" — OPODIS 2016 — generalises the quorum constraint; applies to Raft as well as Paxos.
  6. Discord's "How Discord Stores Trillions of Messages" — Discord Engineering Blog 2023 — production Raft (and ScyllaDB's Raft-on-Cassandra-replacement) at scale; describes the asymmetric-partition bug PreVote does not catch.
  7. Paxos and why people struggle with it — chapter 51; the Paxos counterpart, with the prepare/promise/accept dance and the parliament metaphor.
  8. FLP impossibility — what it forbids — chapter 50; the boundary Raft navigates around via partial synchrony and randomised election timeouts.