In short
Raft makes one strong simplifying choice: at any moment there is at most one leader, and all client writes flow through that leader. The leader appends operations to its log and ships them to followers; the cluster's job, when no leader exists, is to elect one as quickly as possible without electing two.
A term is Raft's logical clock — a monotonically increasing integer that labels each leadership epoch. Term 1 had leader A; A died; term 2 had leader B; B got partitioned; term 3 has leader C. Every server stamps its messages with currentTerm, and a server seeing a higher term immediately steps down to follower and updates its own term. Terms partition the timeline into disjoint epochs; at most one leader exists per term is the central safety invariant of the election layer.
Each server is in exactly one of three states: Follower (passive — replies to leaders and candidates), Candidate (actively soliciting votes), Leader (sending heartbeats and replicating entries). A follower that hears nothing for an election timeout (typically randomized in 150-300 ms) bumps its term, becomes a candidate, votes for itself, and broadcasts RequestVote to every peer. A candidate that collects votes from a majority becomes leader. A candidate that sees a higher term, or a heartbeat from a current leader, reverts to follower.
Two rules keep elections safe. (1) At most one vote per term — a server records votedFor durably and refuses second votes within the same term, which forces any winner to have a majority of distinct votes. (2) Up-to-date log restriction — a voter rejects candidates whose log is older than its own (last log term, then last log index), so a candidate winning election necessarily has every committed entry. Randomized timeouts make split votes statistically rare; when one occurs, the term simply advances and a new round begins.
You are running a five-node Raft cluster. Three machines are in a Mumbai DC, two in Bengaluru. Things have been quiet for hours — node 1 is the leader, sending heartbeats every 50 ms, and the four followers are dutifully appending log entries and acking. Then node 1's NIC dies. The followers stop hearing heartbeats. Within 200 ms, one of them will notice the silence, decide the leader is dead, and try to take over.
What happens next is the entire content of this chapter. Done badly, two nodes try to lead simultaneously, the log forks, and you spend tomorrow morning doing data-loss postmortems. Done correctly — and Raft's design is what "correctly" means here — exactly one node wins, the cluster picks up where it left off, clients see a 200-300 ms blip, and the on-call engineer goes back to sleep.
The previous chapter (the replicated state machine abstraction) established why we need consensus. This chapter starts the how: get a single, agreed-upon leader. The next chapter handles what that leader does once elected — replicate log entries.
Why a leader at all
Multi-Paxos and Raft both centre on a single leader because the alternative — every replica accepting writes independently and reconciling — collapses into the very ordering problem consensus is meant to solve. With a serial leader, the log-replication algorithm reduces to: leader picks an index, appends, ships to followers, waits for majority ack. No conflict resolution, no concurrent writes to the same slot, no reordering games. The leader's local log is, by construction, the prefix every follower will eventually adopt.
The cost of having a leader is that the leader becomes a bottleneck (one machine handles all writes) and a single point of progress (no leader, no writes). Raft accepts both costs in exchange for a dramatically simpler protocol. The bottleneck is mitigated by sharding — one leader per shard, not per cluster — and by the observation that consensus is rarely the throughput bottleneck for OLTP-shaped workloads. The progress problem is what election solves: when a leader dies, get a new one fast.
Why "fast" matters: every millisecond without a leader is a millisecond clients cannot write. The election timeout (150-300 ms) bounds the unavailability window from the moment of leader death to the moment a successor commits its first entry. Production targets are typically under 1 second total recovery; some systems push under 200 ms with tighter timeouts at the cost of more spurious elections under load.
Terms as a logical clock
Before electing leaders we need a way to talk about which leadership era we are in. Raft uses a single integer, currentTerm, that every server maintains. Every term has at most one leader. When elections fail or a leader dies, the term increases and a new election begins. Terms label disjoint slices of the timeline; messages always carry the sender's term, and any server receiving a message with term > currentTerm immediately updates its own term and reverts to follower.
Two properties of terms drop out of this design.
Term comparison detects staleness. A delayed message carrying term=5 arriving at a server in term=8 is from a dead epoch and is discarded. A leader from term 5 receiving any message stamped term 8 — be it an election plea, a heartbeat from a newer leader, or even a refused vote — knows it is no longer leader and steps down. This single rule handles most of the awkward "old leader wakes up after partition heals" scenarios without special cases.
Terms order disjoint events. Two events that happened in different terms have a clear ordering: lower term comes first. Within the same term events on different machines are not directly comparable, but Raft's design ensures only one machine matters per term — the leader — so the within-term ordering is just the leader's local log order. This is why terms work as a logical clock for consensus even though they don't track wall time.
Every server persists currentTerm to disk before responding to any RPC that could affect it. Why durable: if a server crashes after voting in term 7 and restarts thinking it is still in term 5, it might vote a second time in term 7 for a different candidate, breaking the at-most-one-leader-per-term invariant. The fsync on currentTerm and votedFor is mandatory; skipping it for "performance" produces silent safety violations under crash-restart.
The three states
Every Raft server is in exactly one of three states. The state machine of transitions is small enough to draw without omission.
- Follower — passive. Responds to RPCs from leaders and candidates. Never initiates anything except an election when its timeout fires. This is the default state on startup.
- Candidate — actively trying to become leader. Has incremented
currentTerm, voted for itself, and sentRequestVoteto all peers. Counts incoming vote responses. - Leader — has won an election. Sends periodic heartbeats (empty
AppendEntries) to all followers; handles client requests by appending to its log and replicating.
Transitions are driven by three events: election timeout firing, discovering a higher term, and counting a majority of votes.
startup / discover higher term
↓
┌─────────┐
│Follower │ ←─────────────────── (any state) sees term > currentTerm
└─────────┘
│ election timeout fires
│ (no heartbeat for 150-300 ms)
↓
┌──────────┐
│Candidate │ ─── sees current leader's heartbeat ──→ Follower
└──────────┘ ─── another candidate wins ────────→ Follower
│
│ wins majority of votes
↓
┌────────┐
│ Leader │ ─── steps down on higher term ──→ Follower
└────────┘
Three transitions to remember:
- Follower → Candidate on election timeout. A follower that has not heard from a leader in
electionTimeoutms assumes the leader is gone, increments its term, and starts an election. - Candidate → Leader on majority vote. Once a candidate has
floor(N/2) + 1votes (including its own), it is the unique leader of the new term. - Anything → Follower on seeing a higher term. The "step down" rule is universal — no Raft server ever stays in candidate or leader state when it learns a newer term exists. This is the mechanism that prevents two leaders from coexisting once the partition heals and they exchange messages.
The election timeout — and why it is randomized
A follower decides "the leader is gone" by waiting for an election timeout without receiving any RPC from a leader (or granting a vote, which also resets the timer). Choose this badly and you are in trouble.
Too short — say 20 ms: a normal GC pause or a slow heartbeat trips it, the cluster spends most of its energy on spurious elections, and clients see thrashing leadership. The leader cannot keep up with sending heartbeats faster than the slowest follower's timeout.
Too long — say 5 seconds: when a leader actually dies, clients are blocked on writes for up to 5 seconds before any follower notices and starts an election. Dead time the user feels.
The Raft paper recommends 150-300 ms with heartbeats every ~50 ms, which gives a 3-6× safety factor over the heartbeat interval. Production tunings vary: etcd defaults to 1000 ms for cross-region forgiveness; CockroachDB uses 200 ms because it expects fast intra-datacenter networks.
But the bigger question is not the magnitude — it is the randomization. Every follower picks its own election timeout uniformly at random from a range, e.g. [150ms, 300ms], and re-randomizes after each election attempt.
The math: if all five followers use the same fixed 150 ms timeout, every leader death triggers a five-way split vote because every follower fires at exactly t=150 ms and each votes for itself. With randomization in [150, 300], the gap between the first-firer and the second-firer is roughly range/N. For range=150 ms, N=5, that gap averages 30 ms — long enough for the first candidate's RequestVote RPCs to reach the others before their own timers expire, soliciting their votes and pre-empting their elections.
Why this works statistically: split votes still occur (when two timers fire close together), but their probability decreases roughly as e^(-range/RTT). For range=150 ms and intra-DC RTT~1 ms, the per-election split probability is well under 1%. When a split does happen, the term simply advances and the next round of randomized timers tries again — convergence is rapid.
The election protocol
A follower's election timer fires. Here is the entire protocol.
Step 1: Become candidate. Increment currentTerm from T to T+1. Set state = Candidate. Set votedFor = self. Persist currentTerm and votedFor to disk. Reset the election timer (with a fresh random value).
Step 2: Vote for self. Count one vote.
Step 3: Send RequestVote. Broadcast RequestVote(term=T+1, candidateId=self, lastLogIndex, lastLogTerm) to all other servers in parallel.
Step 4: Count responses. As RequestVoteResponse(term, voteGranted) arrives:
- If
term > T+1, step down to follower, updatecurrentTerm = term, abort. - If
voteGranted, increment vote count. If count ≥ majority, transition to Leader, immediately send heartbeats to all peers (claims leadership and prevents new elections). - Otherwise (rejected), continue waiting.
Step 5: Timeout. If the election timer fires again before a winner emerges, restart the election: increment term to T+2, broadcast again. This is what happens after a split vote.
The voter side is symmetrically simple. On RequestVote(term, candidateId, lastLogIndex, lastLogTerm):
- If
term < currentTerm, reject (stale candidate). - If
term > currentTerm, step down to follower and updatecurrentTerm = term. Continue. - If
votedForis not null and notcandidateId, reject (already voted this term). - If candidate's log is not at least as up-to-date as ours, reject.
- Otherwise, grant the vote, set
votedFor = candidateId, persist, reset election timer.
The "up-to-date" check in step 4 compares (lastLogTerm, lastLogIndex) lexicographically: a log with a higher last term is more up-to-date; same last term, the longer log wins. This is what guarantees an elected leader has every committed entry — chapter 102 (safety arguments and the leader-completeness property) proves this formally; for now the rule is enough.
Split votes — when randomization is not enough
With five followers and randomized timeouts, two of them might still time out within a few milliseconds of each other. Each becomes a candidate, votes for itself, and broadcasts RequestVote.
When neither candidate reaches majority, both eventually time out, increment to term 6, and try again. Because each picks a fresh random timeout, the chance of another simultaneous start is small. The expected number of rounds before a winner emerges is roughly 1 / (1 - p) where p is the per-round split probability — for a well-tuned cluster p < 0.05 and elections converge in 1-2 rounds.
This is the practical reason Raft uses odd-sized clusters. With four nodes (majority=3), one failure plus one slow node leaves only two votes available — exactly what a single candidate needs is harder to reach. With five (majority=3) you can afford one failure and still split 2-2 between two candidates and try again.
Voter rules
Two voter rules are doing the safety work:
At most one vote per term. A server records votedFor durably. If a second RequestVote arrives in the same term from a different candidate, the server rejects it. This is what guarantees at-most-one-leader-per-term: a leader needs a majority of distinct votes, and no server's vote is counted twice. Two simultaneous candidates cannot both get majorities, because the union of two majorities is more than N votes — pigeonhole says at least one server would have voted twice, which the durable votedFor rule prevents.
Why durable: imagine F3 votes for F1 in term 5, then crashes before fsync of votedFor. On restart, F3 thinks it has not voted yet. F2's RequestVote arrives and F3 votes for F2 too — both F1 and F2 now have majorities, both think they are leader, and the log forks. The fsync on votedFor is the safety barrier; running Raft without it is a correctness bug, not a performance issue.
Up-to-date log restriction. A voter rejects a candidate whose log is older than its own. Specifically, the candidate's (lastLogTerm, lastLogIndex) must be ≥ the voter's by lexicographic comparison.
This rule prevents a candidate that missed recent log entries from becoming leader and overwriting them. Suppose entry 7 was committed by the previous leader (replicated to a majority including F1, F2, F3 but not F4, F5). If F4 starts an election after the leader dies, it does not have entry 7. The voter rule says: F1, F2, F3 all see F4's lastLogIndex < 7 and reject. F4 cannot win. The next election will be won by some node that does have entry 7, and the committed entry survives.
We will return to this rule formally in chapter 102; for now the operational summary is: voters reject stale candidates, so the elected leader's log is always a superset of all committed entries.
Python — start_election and handle_request_vote
The election logic in production code is roughly 30 lines, plus the RPC plumbing.
import asyncio
import random
from dataclasses import dataclass, field
@dataclass
class LogEntry:
term: int
op: tuple
@dataclass
class Server:
server_id: str
peers: list[str]
state: str = 'Follower' # Follower | Candidate | Leader
current_term: int = 0
voted_for: str | None = None
log: list[LogEntry] = field(default_factory=list)
election_deadline: float = 0.0 # absolute monotonic time
rpc: 'RPCClient' = None
storage: 'PersistentStorage' = None # writes currentTerm, votedFor
@property
def last_log_index(self) -> int:
return len(self.log)
@property
def last_log_term(self) -> int:
return self.log[-1].term if self.log else 0
def reset_election_timer(self):
# randomized in [150, 300] ms — the anti-split-vote knob
self.election_deadline = (
asyncio.get_event_loop().time() + random.uniform(0.150, 0.300)
)
async def start_election(self):
"""Follower or candidate noticed timeout; bump term and solicit votes."""
self.state = 'Candidate'
self.current_term += 1
self.voted_for = self.server_id
self.storage.persist(self.current_term, self.voted_for) # fsync!
self.reset_election_timer()
term = self.current_term
votes = 1 # self-vote
# Snapshot log position now; do not let it change mid-election.
last_idx, last_term = self.last_log_index, self.last_log_term
async def ask(peer):
try:
resp = await self.rpc.request_vote(
peer, term=term, candidate_id=self.server_id,
last_log_index=last_idx, last_log_term=last_term,
)
return peer, resp
except (asyncio.TimeoutError, ConnectionError):
return peer, None
pending = [asyncio.create_task(ask(p)) for p in self.peers]
majority = (len(self.peers) + 1) // 2 + 1
for fut in asyncio.as_completed(pending):
peer, resp = await fut
if self.state != 'Candidate' or self.current_term != term:
return # stepped down or moved on; abandon
if resp is None:
continue
if resp.term > self.current_term:
self.step_down(resp.term)
return
if resp.vote_granted:
votes += 1
if votes >= majority:
self.become_leader()
return
def handle_request_vote(self, term, candidate_id, last_log_index, last_log_term):
"""Respond to RequestVote from a candidate."""
# Step down if we see a newer term — universal rule.
if term > self.current_term:
self.current_term = term
self.voted_for = None
self.state = 'Follower'
grant = False
if term < self.current_term:
grant = False # stale candidate
elif self.voted_for not in (None, candidate_id):
grant = False # already voted this term
elif not self._candidate_log_up_to_date(last_log_index, last_log_term):
grant = False # candidate's log is behind
else:
grant = True
self.voted_for = candidate_id
self.reset_election_timer() # granting a vote resets timer
self.storage.persist(self.current_term, self.voted_for) # fsync!
return VoteResponse(term=self.current_term, vote_granted=grant)
def _candidate_log_up_to_date(self, cand_last_idx, cand_last_term) -> bool:
# Up-to-date iff (cand_last_term, cand_last_idx) >= (our_last_term, our_last_idx)
if cand_last_term != self.last_log_term:
return cand_last_term > self.last_log_term
return cand_last_idx >= self.last_log_index
def step_down(self, new_term):
self.current_term = new_term
self.voted_for = None
self.state = 'Follower'
self.storage.persist(self.current_term, self.voted_for)
self.reset_election_timer()
def become_leader(self):
self.state = 'Leader'
# ch.101: initialize nextIndex / matchIndex, start sending heartbeats
Three things in this code are load-bearing safety, not optimisation:
- The
storage.persist(...)calls must fsync before any RPC reply is sent. Skipping them turns at-most-one-vote-per-term into a probabilistic guarantee, which is the same as no guarantee. - The
if self.state != 'Candidate' or self.current_term != term: returncheck after each vote response. If a higher-term message arrived mid-election, we have already stepped down; counting a stale vote and "winning" the old term is a bug. - The
reset_election_timer()insidehandle_request_votewhen we grant a vote. This prevents a follower from racing to start its own election while a candidate is mid-flight, soaking up votes that should go to the candidate.
Five-node cluster, leader fails, election unfolds
Setup: five nodes N1-N5. Leader is N1 in term 4. Log is committed up to index 12, all replicas in sync.
t=0 ms. N1 (leader, term 4) sends a heartbeat. All four followers receive and reset their election timers.
t=10 ms. N1's NIC fails. N1 is alive but unreachable.
t=60 ms. N1's heartbeat would have gone out; doesn't reach anyone.
t=150 ms - 300 ms. Each of N2-N5 has a randomized election deadline. Suppose:
- N2: 187 ms
- N3: 234 ms
- N4: 161 ms ← shortest
- N5: 278 ms
t=161 ms. N4's timer fires first.
- N4:
currentTerm: 4 → 5,state: Follower → Candidate,votedFor: self, fsync. - N4 broadcasts
RequestVote(term=5, candidateId=N4, lastLogIndex=12, lastLogTerm=4). - N4 vote count: 1 (self). Needs 3 for majority.
t=163 ms (intra-DC ~1ms RTT). RequestVote arrives at N2, N3, N5.
- N2:
currentTerm: 4 → 5,votedFor: N4, fsync. Sends grant. Resets election timer. - N3: same. fsync, grant.
- N5: same. fsync, grant.
- N1 also reachable? No, N1's NIC is dead — RequestVote to N1 times out.
t=165 ms. Vote responses arrive at N4.
- From N2: granted. Count = 2.
- From N3: granted. Count = 3 ≥ majority. N4 becomes leader of term 5.
- N4 immediately sends heartbeat
AppendEntries(term=5, ...)to N2, N3, N5.
t=167 ms. Heartbeats arrive. N2, N3, N5 see term=5 from a leader, reset election timers, accept N4 as leader.
t=170 ms onwards. N4 starts processing client writes. The log resumes; entries committed in term 4 are still committed (the up-to-date-log rule guaranteed N4 had them); new entries get term 5.
Total downtime: ~165 ms. From the moment N1's NIC failed (t=10) to the moment N4 was sending heartbeats (t=167) — about 157 ms. Clients that had requests in flight to N1 will see timeouts and retry; the Raft client library transparently routes retries to N4 once a leader is found.
If two timers had fired close together — say N4 at 161 and N2 at 165 — and N2 became a candidate while N4's RequestVotes were still in flight, the result depends on which RPCs arrived first at each voter. Often one candidate wins outright. Sometimes no one wins (split vote), term 5 expires, both step down with election deadline reset, and term 6 begins. Either way, the cluster reaches a steady state in milliseconds, not seconds.
When N1's NIC eventually recovers — say at t=10 seconds — N1 wakes up still believing it is leader of term 4. It sends a heartbeat with term=4. Followers respond with term=5. N1 sees the higher term, immediately steps down to follower in term 5, and asks N4 to ship it the missing log entries. No data loss, no human intervention.
Common confusions
-
"The leader is the most powerful node." No. The leader is just the node currently authorized to write. Any healthy node can win the next election. There is no "primary" or "master" hardware — just a per-term role.
-
"Two candidates means the cluster is broken." No. Two simultaneous candidates is a normal occurrence under randomized timeouts; it gets resolved by a split vote and the next election round. Brokenness would be two leaders in the same term, which the at-most-one-vote-per-term rule prevents.
-
"Election timeout should be tuned to the application latency." No, it should be tuned to the cluster's network RTT and load profile. Application latency is a separate concern (how fast clients see commits). Election timeout is about how fast the cluster recovers from leader failure.
-
"Increasing N improves availability." Up to a point. Adding nodes makes the majority bigger, so each commit waits for more acks, slowing throughput. Five nodes is the sweet spot for most production deployments — tolerates two failures, modest replication overhead. Seven only when you genuinely need to tolerate three simultaneous failures (rare).
-
"The leader can decide to step down voluntarily." Raft does support this — a leader can transfer leadership to a designated successor (TransferLeadership in etcd, used during rolling restarts). But the safety properties are unchanged; it is just a controlled election rather than a timeout-driven one.
-
"PreVote prevents disruption from rejoining nodes." This is a real refinement, not a confusion. A partitioned node bumps its term repeatedly trying to win elections; on rejoin, its high term forces the current leader to step down even though no real failure occurred. Raft's PreVote optimization (a non-binding "would you vote for me?" check before incrementing the term) prevents this disruption. Most production Raft implementations include PreVote.
Going deeper
Asymmetric election timeouts and the "Pre-Vote" extension
Raft's basic protocol has the disruption issue noted above: a partitioned follower repeatedly increments its term, and on partition heal forces the live leader to step down. The PreVote phase, added in Ongaro's PhD thesis and standard in production implementations (etcd, Consul, CockroachDB), splits the election into two phases: a "would you vote for me?" probe that does not bump the term, followed by the real RequestVote only if the probe succeeds. A partitioned follower's PreVote is rejected (no current leader, no recent heartbeats means voters reply no), so its term never advances and it cannot disrupt the live cluster. Pre-Vote slightly increases election latency in the common case and dramatically improves stability under partial partitions.
Leadership leases and read scaling
A leader sitting idle still has to send heartbeats to retain leadership, and a follower will assume it's gone the moment heartbeats stop. Some Raft variants (CockroachDB's leases, etcd's leader lease) extend this with a time-bounded leadership lease — a leader, once elected, holds the leadership exclusively until the lease expires, even without explicit heartbeats. The benefit: leader-local reads can return without consulting a quorum, because the lease guarantees no other leader exists. Cost: tighter clock synchronization assumptions. Spanner takes the same idea further with TrueTime.
The Joint Consensus protocol for membership changes
Adding or removing a server from the cluster is non-trivial: changing N changes the majority, and the transition window can have two overlapping majorities that disagree. Raft's joint consensus (Section 6 of the Raft paper) uses a transitional configuration that requires majorities of both old and new configurations during the change. Production implementations more often use the simpler one-server-at-a-time approach, where each addition or removal can never produce two disjoint majorities.
Why Multi-Paxos uses a "stable leader" optimization
Pure Paxos has no leader concept; every consensus instance starts fresh. Multi-Paxos (the production form) recognizes that having a stable leader who runs phase-2 directly avoids the phase-1 round, halving message complexity. The "stable leader" in Multi-Paxos is conceptually the same as Raft's leader, but phrased differently: in Paxos, the leader is the proposer who has won phase-1 for a range of slots; in Raft, the leader is the unique server elected for a term. The mechanics differ; the bottleneck-throughput analysis is similar.
Where this leads next
The election layer is now a black box: at any moment, either there is one elected leader or the cluster is between leaders, and elections converge in milliseconds. With a leader in place, chapter 101 (log replication and commit index) turns to what the leader actually does: take client operations, append them to the log, replicate to followers, and decide when an entry is committed.
Two safety properties from this chapter feed forward into log replication:
- At most one leader per term — proved by at-most-one-vote-per-term plus majority quorum.
- Elected leader has every committed entry — established informally by the up-to-date-log rule; chapter 102 proves it formally as the Leader Completeness Property.
Together they mean the leader's log is, by construction, a safe extension of the cluster's committed log. Chapter 101 shows how the leader maintains and extends that log; chapter 102 walks through the partition and crash scenarios where these invariants are tested.
The single sentence to carry forward: terms partition history into disjoint epochs; each term has at most one leader; elected leaders have every committed entry — and randomized timeouts make achieving all three a routine, not an art.
References
- Ongaro and Ousterhout, In Search of an Understandable Consensus Algorithm (Extended Version), USENIX ATC 2014 — the canonical Raft paper. Sections 5.1-5.2 cover election and terms in detail; this chapter follows their structure.
- Ongaro, Consensus: Bridging Theory and Practice, Stanford PhD Dissertation 2014 — the deep version. Chapter 4 covers Pre-Vote, leadership transfer, and election edge cases the conference paper underspecified.
- Howard, Schwarzkopf, Madhavapeddy, Crowcroft, Raft Refloated: Do We Have Consensus?, ACM Operating Systems Review 49(1), 2015 — independent re-implementation; Section 4 walks through the election edge cases that bit the implementers.
- Lamport, Paxos Made Simple, ACM SIGACT News 32(4), 2001 — for comparison, the leaderless and Multi-Paxos formulations of the same problem.
- etcd Raft implementation — github.com/etcd-io/raft — production Go implementation with PreVote, leadership transfer, and the membership-change protocol; the canonical reference for "what does this look like at scale".
- The Raft visualization at raft.github.io — interactive simulator that animates elections, partitions, and split votes; spend ten minutes here before reading the paper for an unreasonably effective intuition boost.