In short

Raft makes one strong simplifying choice: at any moment there is at most one leader, and all client writes flow through that leader. The leader appends operations to its log and ships them to followers; the cluster's job, when no leader exists, is to elect one as quickly as possible without electing two.

A term is Raft's logical clock — a monotonically increasing integer that labels each leadership epoch. Term 1 had leader A; A died; term 2 had leader B; B got partitioned; term 3 has leader C. Every server stamps its messages with currentTerm, and a server seeing a higher term immediately steps down to follower and updates its own term. Terms partition the timeline into disjoint epochs; at most one leader exists per term is the central safety invariant of the election layer.

Each server is in exactly one of three states: Follower (passive — replies to leaders and candidates), Candidate (actively soliciting votes), Leader (sending heartbeats and replicating entries). A follower that hears nothing for an election timeout (typically randomized in 150-300 ms) bumps its term, becomes a candidate, votes for itself, and broadcasts RequestVote to every peer. A candidate that collects votes from a majority becomes leader. A candidate that sees a higher term, or a heartbeat from a current leader, reverts to follower.

Two rules keep elections safe. (1) At most one vote per term — a server records votedFor durably and refuses second votes within the same term, which forces any winner to have a majority of distinct votes. (2) Up-to-date log restriction — a voter rejects candidates whose log is older than its own (last log term, then last log index), so a candidate winning election necessarily has every committed entry. Randomized timeouts make split votes statistically rare; when one occurs, the term simply advances and a new round begins.

You are running a five-node Raft cluster. Three machines are in a Mumbai DC, two in Bengaluru. Things have been quiet for hours — node 1 is the leader, sending heartbeats every 50 ms, and the four followers are dutifully appending log entries and acking. Then node 1's NIC dies. The followers stop hearing heartbeats. Within 200 ms, one of them will notice the silence, decide the leader is dead, and try to take over.

What happens next is the entire content of this chapter. Done badly, two nodes try to lead simultaneously, the log forks, and you spend tomorrow morning doing data-loss postmortems. Done correctly — and Raft's design is what "correctly" means here — exactly one node wins, the cluster picks up where it left off, clients see a 200-300 ms blip, and the on-call engineer goes back to sleep.

The previous chapter (the replicated state machine abstraction) established why we need consensus. This chapter starts the how: get a single, agreed-upon leader. The next chapter handles what that leader does once elected — replicate log entries.

Why a leader at all

Multi-Paxos and Raft both centre on a single leader because the alternative — every replica accepting writes independently and reconciling — collapses into the very ordering problem consensus is meant to solve. With a serial leader, the log-replication algorithm reduces to: leader picks an index, appends, ships to followers, waits for majority ack. No conflict resolution, no concurrent writes to the same slot, no reordering games. The leader's local log is, by construction, the prefix every follower will eventually adopt.

The cost of having a leader is that the leader becomes a bottleneck (one machine handles all writes) and a single point of progress (no leader, no writes). Raft accepts both costs in exchange for a dramatically simpler protocol. The bottleneck is mitigated by sharding — one leader per shard, not per cluster — and by the observation that consensus is rarely the throughput bottleneck for OLTP-shaped workloads. The progress problem is what election solves: when a leader dies, get a new one fast.

Why "fast" matters: every millisecond without a leader is a millisecond clients cannot write. The election timeout (150-300 ms) bounds the unavailability window from the moment of leader death to the moment a successor commits its first entry. Production targets are typically under 1 second total recovery; some systems push under 200 ms with tighter timeouts at the cost of more spurious elections under load.

Terms as a logical clock

Before electing leaders we need a way to talk about which leadership era we are in. Raft uses a single integer, currentTerm, that every server maintains. Every term has at most one leader. When elections fail or a leader dies, the term increases and a new election begins. Terms label disjoint slices of the timeline; messages always carry the sender's term, and any server receiving a message with term > currentTerm immediately updates its own term and reverts to follower.

Term progression timelineA horizontal timeline divided into four coloured terms. Term 1 has a leader A who runs for some time then dies. Term 2 begins with an election that succeeds, leader B runs and is partitioned. Term 3 has a failed election (no leader) due to split vote. Term 4 has another election and leader C runs steadily.timeterm 1leader A electedruns, then NIC dieselec.term 2leader B electednetwork partitionsterm 3split voteno leaderelec.term 4leader C electedruns steadilyA diesB partitionedterm 3 timed outcurrentTerm increases monotonically across the clusterSome terms have a leader, some have none (split vote, ongoing election). Every term is a disjoint epoch.
Terms partition the lifetime of the cluster into disjoint epochs. A term either has exactly one leader or no leader at all (when an election fails or is in progress). The term counter advances with every election attempt, regardless of outcome — term 3 ended with no leader and the cluster moved to term 4 to try again.

Two properties of terms drop out of this design.

Term comparison detects staleness. A delayed message carrying term=5 arriving at a server in term=8 is from a dead epoch and is discarded. A leader from term 5 receiving any message stamped term 8 — be it an election plea, a heartbeat from a newer leader, or even a refused vote — knows it is no longer leader and steps down. This single rule handles most of the awkward "old leader wakes up after partition heals" scenarios without special cases.

Terms order disjoint events. Two events that happened in different terms have a clear ordering: lower term comes first. Within the same term events on different machines are not directly comparable, but Raft's design ensures only one machine matters per term — the leader — so the within-term ordering is just the leader's local log order. This is why terms work as a logical clock for consensus even though they don't track wall time.

Every server persists currentTerm to disk before responding to any RPC that could affect it. Why durable: if a server crashes after voting in term 7 and restarts thinking it is still in term 5, it might vote a second time in term 7 for a different candidate, breaking the at-most-one-leader-per-term invariant. The fsync on currentTerm and votedFor is mandatory; skipping it for "performance" produces silent safety violations under crash-restart.

The three states

Every Raft server is in exactly one of three states. The state machine of transitions is small enough to draw without omission.

Transitions are driven by three events: election timeout firing, discovering a higher term, and counting a majority of votes.

       startup / discover higher term
                 ↓
            ┌─────────┐
            │Follower │  ←─────────────────── (any state) sees term > currentTerm
            └─────────┘
                 │ election timeout fires
                 │ (no heartbeat for 150-300 ms)
                 ↓
           ┌──────────┐
           │Candidate │  ─── sees current leader's heartbeat ──→  Follower
           └──────────┘  ─── another candidate wins ────────→     Follower
                 │
                 │ wins majority of votes
                 ↓
            ┌────────┐
            │ Leader │   ─── steps down on higher term ──→ Follower
            └────────┘

Three transitions to remember:

  1. Follower → Candidate on election timeout. A follower that has not heard from a leader in electionTimeout ms assumes the leader is gone, increments its term, and starts an election.
  2. Candidate → Leader on majority vote. Once a candidate has floor(N/2) + 1 votes (including its own), it is the unique leader of the new term.
  3. Anything → Follower on seeing a higher term. The "step down" rule is universal — no Raft server ever stays in candidate or leader state when it learns a newer term exists. This is the mechanism that prevents two leaders from coexisting once the partition heals and they exchange messages.

The election timeout — and why it is randomized

A follower decides "the leader is gone" by waiting for an election timeout without receiving any RPC from a leader (or granting a vote, which also resets the timer). Choose this badly and you are in trouble.

Too short — say 20 ms: a normal GC pause or a slow heartbeat trips it, the cluster spends most of its energy on spurious elections, and clients see thrashing leadership. The leader cannot keep up with sending heartbeats faster than the slowest follower's timeout.

Too long — say 5 seconds: when a leader actually dies, clients are blocked on writes for up to 5 seconds before any follower notices and starts an election. Dead time the user feels.

The Raft paper recommends 150-300 ms with heartbeats every ~50 ms, which gives a 3-6× safety factor over the heartbeat interval. Production tunings vary: etcd defaults to 1000 ms for cross-region forgiveness; CockroachDB uses 200 ms because it expects fast intra-datacenter networks.

But the bigger question is not the magnitude — it is the randomization. Every follower picks its own election timeout uniformly at random from a range, e.g. [150ms, 300ms], and re-randomizes after each election attempt.

Randomized election timeoutsFive horizontal timeout bars, one per follower, each starting at time zero and ending at a different randomized duration between 150 and 300 ms. Follower 3 is shortest at 168 ms, illustrated as the first to fire and become a candidate. The leader's last heartbeat is shown as a vertical line at time zero on the left, after which the leader dies.Each follower picks its own election timeout uniformly in [150ms, 300ms]leader dies (t=0)100ms150ms225ms300msF1214 msF2258 msF3168 ms — fires first, becomes candidateF4191 ms (resets on F3's RequestVote)F5270 ms (resets on F3's RequestVote)
Why randomization matters: with all followers using identical timeouts, every follower would fire at the same instant after a leader death, all start elections, all vote for themselves, and split the vote 1-1-1-1-1 across five candidates. Randomizing across a 150-ms range makes the first-firer almost always finish its election (and grant heartbeats) before the second-firer's timer expires, suppressing the others.

The math: if all five followers use the same fixed 150 ms timeout, every leader death triggers a five-way split vote because every follower fires at exactly t=150 ms and each votes for itself. With randomization in [150, 300], the gap between the first-firer and the second-firer is roughly range/N. For range=150 ms, N=5, that gap averages 30 ms — long enough for the first candidate's RequestVote RPCs to reach the others before their own timers expire, soliciting their votes and pre-empting their elections.

Why this works statistically: split votes still occur (when two timers fire close together), but their probability decreases roughly as e^(-range/RTT). For range=150 ms and intra-DC RTT~1 ms, the per-election split probability is well under 1%. When a split does happen, the term simply advances and the next round of randomized timers tries again — convergence is rapid.

The election protocol

A follower's election timer fires. Here is the entire protocol.

Step 1: Become candidate. Increment currentTerm from T to T+1. Set state = Candidate. Set votedFor = self. Persist currentTerm and votedFor to disk. Reset the election timer (with a fresh random value).

Step 2: Vote for self. Count one vote.

Step 3: Send RequestVote. Broadcast RequestVote(term=T+1, candidateId=self, lastLogIndex, lastLogTerm) to all other servers in parallel.

Step 4: Count responses. As RequestVoteResponse(term, voteGranted) arrives:

Step 5: Timeout. If the election timer fires again before a winner emerges, restart the election: increment term to T+2, broadcast again. This is what happens after a split vote.

The voter side is symmetrically simple. On RequestVote(term, candidateId, lastLogIndex, lastLogTerm):

  1. If term < currentTerm, reject (stale candidate).
  2. If term > currentTerm, step down to follower and update currentTerm = term. Continue.
  3. If votedFor is not null and not candidateId, reject (already voted this term).
  4. If candidate's log is not at least as up-to-date as ours, reject.
  5. Otherwise, grant the vote, set votedFor = candidateId, persist, reset election timer.

The "up-to-date" check in step 4 compares (lastLogTerm, lastLogIndex) lexicographically: a log with a higher last term is more up-to-date; same last term, the longer log wins. This is what guarantees an elected leader has every committed entry — chapter 102 (safety arguments and the leader-completeness property) proves this formally; for now the rule is enough.

Split votes — when randomization is not enough

With five followers and randomized timeouts, two of them might still time out within a few milliseconds of each other. Each becomes a candidate, votes for itself, and broadcasts RequestVote.

Split vote scenarioFive nodes arranged horizontally. F1 and F2 simultaneously become candidates in term 5. Each votes for itself and sends RequestVote to the other three followers. F3 votes for F1 first (arrives first), F4 votes for F2 first, F5 votes for F1. F1 has 3 votes (self plus F3, F5), F2 has 2 votes (self plus F4). Wait — that is a majority for F1. Adjusted: F1 has 2 votes (self plus F3), F2 has 2 votes (self plus F4), F5 has not voted yet because both arrived simultaneously and F5 voted for whoever's RPC came first. The diagram shows the split outcome where neither reaches majority, so the term ends with no leader.Split vote in term 5: two candidates, votes split 2-2-1, neither winsF1candidatevotes: 2self + F3F2candidatevotes: 2self + F4F3→ F1F4→ F2F5offline /no voteNeither candidate reaches the 3-vote majority. Term 5 expires.
Two candidates simultaneously start elections in term 5. Their RequestVotes race to the other followers; F3's vote goes to F1 (F1's request arrived first), F4's vote goes to F2 (F2's first), and F5 is offline. Neither candidate reaches the majority of 3 — the term ends without a leader. Each candidate's election timer fires again, picks a new randomized timeout, and term 6 begins.

When neither candidate reaches majority, both eventually time out, increment to term 6, and try again. Because each picks a fresh random timeout, the chance of another simultaneous start is small. The expected number of rounds before a winner emerges is roughly 1 / (1 - p) where p is the per-round split probability — for a well-tuned cluster p < 0.05 and elections converge in 1-2 rounds.

This is the practical reason Raft uses odd-sized clusters. With four nodes (majority=3), one failure plus one slow node leaves only two votes available — exactly what a single candidate needs is harder to reach. With five (majority=3) you can afford one failure and still split 2-2 between two candidates and try again.

Voter rules

Two voter rules are doing the safety work:

At most one vote per term. A server records votedFor durably. If a second RequestVote arrives in the same term from a different candidate, the server rejects it. This is what guarantees at-most-one-leader-per-term: a leader needs a majority of distinct votes, and no server's vote is counted twice. Two simultaneous candidates cannot both get majorities, because the union of two majorities is more than N votes — pigeonhole says at least one server would have voted twice, which the durable votedFor rule prevents.

Why durable: imagine F3 votes for F1 in term 5, then crashes before fsync of votedFor. On restart, F3 thinks it has not voted yet. F2's RequestVote arrives and F3 votes for F2 too — both F1 and F2 now have majorities, both think they are leader, and the log forks. The fsync on votedFor is the safety barrier; running Raft without it is a correctness bug, not a performance issue.

Up-to-date log restriction. A voter rejects a candidate whose log is older than its own. Specifically, the candidate's (lastLogTerm, lastLogIndex) must be ≥ the voter's by lexicographic comparison.

This rule prevents a candidate that missed recent log entries from becoming leader and overwriting them. Suppose entry 7 was committed by the previous leader (replicated to a majority including F1, F2, F3 but not F4, F5). If F4 starts an election after the leader dies, it does not have entry 7. The voter rule says: F1, F2, F3 all see F4's lastLogIndex < 7 and reject. F4 cannot win. The next election will be won by some node that does have entry 7, and the committed entry survives.

We will return to this rule formally in chapter 102; for now the operational summary is: voters reject stale candidates, so the elected leader's log is always a superset of all committed entries.

Python — start_election and handle_request_vote

The election logic in production code is roughly 30 lines, plus the RPC plumbing.

import asyncio
import random
from dataclasses import dataclass, field

@dataclass
class LogEntry:
    term: int
    op: tuple

@dataclass
class Server:
    server_id: str
    peers: list[str]
    state: str = 'Follower'           # Follower | Candidate | Leader
    current_term: int = 0
    voted_for: str | None = None
    log: list[LogEntry] = field(default_factory=list)
    election_deadline: float = 0.0    # absolute monotonic time
    rpc: 'RPCClient' = None
    storage: 'PersistentStorage' = None  # writes currentTerm, votedFor

    @property
    def last_log_index(self) -> int:
        return len(self.log)

    @property
    def last_log_term(self) -> int:
        return self.log[-1].term if self.log else 0

    def reset_election_timer(self):
        # randomized in [150, 300] ms — the anti-split-vote knob
        self.election_deadline = (
            asyncio.get_event_loop().time() + random.uniform(0.150, 0.300)
        )

    async def start_election(self):
        """Follower or candidate noticed timeout; bump term and solicit votes."""
        self.state = 'Candidate'
        self.current_term += 1
        self.voted_for = self.server_id
        self.storage.persist(self.current_term, self.voted_for)  # fsync!
        self.reset_election_timer()
        term = self.current_term
        votes = 1  # self-vote

        # Snapshot log position now; do not let it change mid-election.
        last_idx, last_term = self.last_log_index, self.last_log_term

        async def ask(peer):
            try:
                resp = await self.rpc.request_vote(
                    peer, term=term, candidate_id=self.server_id,
                    last_log_index=last_idx, last_log_term=last_term,
                )
                return peer, resp
            except (asyncio.TimeoutError, ConnectionError):
                return peer, None

        pending = [asyncio.create_task(ask(p)) for p in self.peers]
        majority = (len(self.peers) + 1) // 2 + 1

        for fut in asyncio.as_completed(pending):
            peer, resp = await fut
            if self.state != 'Candidate' or self.current_term != term:
                return  # stepped down or moved on; abandon
            if resp is None:
                continue
            if resp.term > self.current_term:
                self.step_down(resp.term)
                return
            if resp.vote_granted:
                votes += 1
                if votes >= majority:
                    self.become_leader()
                    return

    def handle_request_vote(self, term, candidate_id, last_log_index, last_log_term):
        """Respond to RequestVote from a candidate."""
        # Step down if we see a newer term — universal rule.
        if term > self.current_term:
            self.current_term = term
            self.voted_for = None
            self.state = 'Follower'

        grant = False
        if term < self.current_term:
            grant = False  # stale candidate
        elif self.voted_for not in (None, candidate_id):
            grant = False  # already voted this term
        elif not self._candidate_log_up_to_date(last_log_index, last_log_term):
            grant = False  # candidate's log is behind
        else:
            grant = True
            self.voted_for = candidate_id
            self.reset_election_timer()  # granting a vote resets timer

        self.storage.persist(self.current_term, self.voted_for)  # fsync!
        return VoteResponse(term=self.current_term, vote_granted=grant)

    def _candidate_log_up_to_date(self, cand_last_idx, cand_last_term) -> bool:
        # Up-to-date iff (cand_last_term, cand_last_idx) >= (our_last_term, our_last_idx)
        if cand_last_term != self.last_log_term:
            return cand_last_term > self.last_log_term
        return cand_last_idx >= self.last_log_index

    def step_down(self, new_term):
        self.current_term = new_term
        self.voted_for = None
        self.state = 'Follower'
        self.storage.persist(self.current_term, self.voted_for)
        self.reset_election_timer()

    def become_leader(self):
        self.state = 'Leader'
        # ch.101: initialize nextIndex / matchIndex, start sending heartbeats

Three things in this code are load-bearing safety, not optimisation:

  1. The storage.persist(...) calls must fsync before any RPC reply is sent. Skipping them turns at-most-one-vote-per-term into a probabilistic guarantee, which is the same as no guarantee.
  2. The if self.state != 'Candidate' or self.current_term != term: return check after each vote response. If a higher-term message arrived mid-election, we have already stepped down; counting a stale vote and "winning" the old term is a bug.
  3. The reset_election_timer() inside handle_request_vote when we grant a vote. This prevents a follower from racing to start its own election while a candidate is mid-flight, soaking up votes that should go to the candidate.

Five-node cluster, leader fails, election unfolds

Setup: five nodes N1-N5. Leader is N1 in term 4. Log is committed up to index 12, all replicas in sync.

t=0 ms. N1 (leader, term 4) sends a heartbeat. All four followers receive and reset their election timers.

t=10 ms. N1's NIC fails. N1 is alive but unreachable.

t=60 ms. N1's heartbeat would have gone out; doesn't reach anyone.

t=150 ms - 300 ms. Each of N2-N5 has a randomized election deadline. Suppose:

  • N2: 187 ms
  • N3: 234 ms
  • N4: 161 ms ← shortest
  • N5: 278 ms

t=161 ms. N4's timer fires first.

  • N4: currentTerm: 4 → 5, state: Follower → Candidate, votedFor: self, fsync.
  • N4 broadcasts RequestVote(term=5, candidateId=N4, lastLogIndex=12, lastLogTerm=4).
  • N4 vote count: 1 (self). Needs 3 for majority.

t=163 ms (intra-DC ~1ms RTT). RequestVote arrives at N2, N3, N5.

  • N2: currentTerm: 4 → 5, votedFor: N4, fsync. Sends grant. Resets election timer.
  • N3: same. fsync, grant.
  • N5: same. fsync, grant.
  • N1 also reachable? No, N1's NIC is dead — RequestVote to N1 times out.

t=165 ms. Vote responses arrive at N4.

  • From N2: granted. Count = 2.
  • From N3: granted. Count = 3 ≥ majority. N4 becomes leader of term 5.
  • N4 immediately sends heartbeat AppendEntries(term=5, ...) to N2, N3, N5.

t=167 ms. Heartbeats arrive. N2, N3, N5 see term=5 from a leader, reset election timers, accept N4 as leader.

t=170 ms onwards. N4 starts processing client writes. The log resumes; entries committed in term 4 are still committed (the up-to-date-log rule guaranteed N4 had them); new entries get term 5.

Total downtime: ~165 ms. From the moment N1's NIC failed (t=10) to the moment N4 was sending heartbeats (t=167) — about 157 ms. Clients that had requests in flight to N1 will see timeouts and retry; the Raft client library transparently routes retries to N4 once a leader is found.

If two timers had fired close together — say N4 at 161 and N2 at 165 — and N2 became a candidate while N4's RequestVotes were still in flight, the result depends on which RPCs arrived first at each voter. Often one candidate wins outright. Sometimes no one wins (split vote), term 5 expires, both step down with election deadline reset, and term 6 begins. Either way, the cluster reaches a steady state in milliseconds, not seconds.

When N1's NIC eventually recovers — say at t=10 seconds — N1 wakes up still believing it is leader of term 4. It sends a heartbeat with term=4. Followers respond with term=5. N1 sees the higher term, immediately steps down to follower in term 5, and asks N4 to ship it the missing log entries. No data loss, no human intervention.

Common confusions

Going deeper

Asymmetric election timeouts and the "Pre-Vote" extension

Raft's basic protocol has the disruption issue noted above: a partitioned follower repeatedly increments its term, and on partition heal forces the live leader to step down. The PreVote phase, added in Ongaro's PhD thesis and standard in production implementations (etcd, Consul, CockroachDB), splits the election into two phases: a "would you vote for me?" probe that does not bump the term, followed by the real RequestVote only if the probe succeeds. A partitioned follower's PreVote is rejected (no current leader, no recent heartbeats means voters reply no), so its term never advances and it cannot disrupt the live cluster. Pre-Vote slightly increases election latency in the common case and dramatically improves stability under partial partitions.

Leadership leases and read scaling

A leader sitting idle still has to send heartbeats to retain leadership, and a follower will assume it's gone the moment heartbeats stop. Some Raft variants (CockroachDB's leases, etcd's leader lease) extend this with a time-bounded leadership lease — a leader, once elected, holds the leadership exclusively until the lease expires, even without explicit heartbeats. The benefit: leader-local reads can return without consulting a quorum, because the lease guarantees no other leader exists. Cost: tighter clock synchronization assumptions. Spanner takes the same idea further with TrueTime.

The Joint Consensus protocol for membership changes

Adding or removing a server from the cluster is non-trivial: changing N changes the majority, and the transition window can have two overlapping majorities that disagree. Raft's joint consensus (Section 6 of the Raft paper) uses a transitional configuration that requires majorities of both old and new configurations during the change. Production implementations more often use the simpler one-server-at-a-time approach, where each addition or removal can never produce two disjoint majorities.

Why Multi-Paxos uses a "stable leader" optimization

Pure Paxos has no leader concept; every consensus instance starts fresh. Multi-Paxos (the production form) recognizes that having a stable leader who runs phase-2 directly avoids the phase-1 round, halving message complexity. The "stable leader" in Multi-Paxos is conceptually the same as Raft's leader, but phrased differently: in Paxos, the leader is the proposer who has won phase-1 for a range of slots; in Raft, the leader is the unique server elected for a term. The mechanics differ; the bottleneck-throughput analysis is similar.

Where this leads next

The election layer is now a black box: at any moment, either there is one elected leader or the cluster is between leaders, and elections converge in milliseconds. With a leader in place, chapter 101 (log replication and commit index) turns to what the leader actually does: take client operations, append them to the log, replicate to followers, and decide when an entry is committed.

Two safety properties from this chapter feed forward into log replication:

  1. At most one leader per term — proved by at-most-one-vote-per-term plus majority quorum.
  2. Elected leader has every committed entry — established informally by the up-to-date-log rule; chapter 102 proves it formally as the Leader Completeness Property.

Together they mean the leader's log is, by construction, a safe extension of the cluster's committed log. Chapter 101 shows how the leader maintains and extends that log; chapter 102 walks through the partition and crash scenarios where these invariants are tested.

The single sentence to carry forward: terms partition history into disjoint epochs; each term has at most one leader; elected leaders have every committed entry — and randomized timeouts make achieving all three a routine, not an art.

References

  1. Ongaro and Ousterhout, In Search of an Understandable Consensus Algorithm (Extended Version), USENIX ATC 2014 — the canonical Raft paper. Sections 5.1-5.2 cover election and terms in detail; this chapter follows their structure.
  2. Ongaro, Consensus: Bridging Theory and Practice, Stanford PhD Dissertation 2014 — the deep version. Chapter 4 covers Pre-Vote, leadership transfer, and election edge cases the conference paper underspecified.
  3. Howard, Schwarzkopf, Madhavapeddy, Crowcroft, Raft Refloated: Do We Have Consensus?, ACM Operating Systems Review 49(1), 2015 — independent re-implementation; Section 4 walks through the election edge cases that bit the implementers.
  4. Lamport, Paxos Made Simple, ACM SIGACT News 32(4), 2001 — for comparison, the leaderless and Multi-Paxos formulations of the same problem.
  5. etcd Raft implementation — github.com/etcd-io/raft — production Go implementation with PreVote, leadership transfer, and the membership-change protocol; the canonical reference for "what does this look like at scale".
  6. The Raft visualization at raft.github.io — interactive simulator that animates elections, partitions, and split votes; spend ten minutes here before reading the paper for an unreasonably effective intuition boost.