In short
Raft did not invent consensus. It is the most recent — and by far the most teachable — entry in a 30-year argument that began with Leslie Lamport's "The Part-Time Parliament" (written 1989, published 1998) and ran through Paxos Made Simple (2001), Multi-Paxos folklore in Google's Chubby and Spanner, and ZAB inside Apache ZooKeeper (2008). Raft (Ongaro and Ousterhout, 2014) is best understood not as a new idea but as a specification: it took the same safety properties Paxos provides and rewrote them as a single, end-to-end protocol an engineer can implement from the paper without folklore.
Single-decree Paxos decides one value among proposers, acceptors, and learners using a two-phase protocol: Prepare/Promise (acceptors promise not to accept lower-numbered proposals and report any value they have already accepted) and Accept/Accepted (the proposer picks a value consistent with what acceptors reported and asks them to accept it). Proposal numbers are the entire ordering mechanism — distinct, monotonic, and totally ordered across proposers. Paxos is correct, but reading the original paper and producing a working multi-decree replicated log from it is a multi-year journey; engineers at Google famously wrote a separate paper, Paxos Made Live, just about the gap between the algorithm and a system.
Multi-Paxos is the production form: pick a stable leader, run Phase 1 once for a range of slots, then for each subsequent value just run Phase 2. The leader optimisation halves message complexity and is conceptually identical to Raft's leader. But Multi-Paxos has no canonical specification — every team's leader election, log replication, membership change, and snapshot story is bespoke.
ZAB (ZooKeeper Atomic Broadcast) is a separate algorithm with the same shape: primary-backup with epoch numbers, tightly integrated with ZooKeeper's snapshot and recovery flow. It exists because the ZooKeeper authors found "use Paxos" insufficient as a design.
What Raft simplified: strong leader from the start (no leaderless mode), election and log replication specified jointly, membership change as a designed sub-protocol (joint consensus), snapshots and log compaction in the spec, and one mental model — terms, logs, leaders — applied uniformly to all sub-problems. The safety properties are the same as Paxos. The pedagogical and engineering experience is dramatically different.
You have learned Raft. You can implement leader election, log replication, the safety invariants, snapshots, and membership change from the previous chapters in this part. Before you build chapter 106's Multi-Raft sharding layer or chapter 107's etcd-vs-ZooKeeper service comparison, you owe yourself a tour of the algorithms Raft replaced. Two reasons. First, etcd and Consul speak Raft, but ZooKeeper speaks ZAB and Spanner speaks Multi-Paxos — if you operate or read the source of distributed systems built before 2014, you will encounter Paxos terminology constantly. Second, the design choices Raft made stop looking arbitrary once you see the alternatives. "Strong leader" is not just one valid choice; it is the choice that fixed the problem the field had been quietly working around for two decades.
This chapter walks the historical arc. Single-decree Paxos first, with a worked example — three acceptors and two competing proposers — that shows how the protocol survives the worst case the original paper concerns itself with. Then Multi-Paxos as the production optimisation. Then ZAB. Then a precise account of what Raft simplified, with a mapping table that names each Paxos concept and gives its Raft equivalent.
The 30-year context
Leslie Lamport submitted "The Part-Time Parliament" to ACM TOCS in 1990. He framed the algorithm as the legislative protocol of the lost Greek island of Paxos, with archaeologists piecing together how priests had agreed on legal decrees despite frequent absences. Reviewers asked him to remove the parable. He refused. The paper was rejected. It was finally published in 1998 with the parable intact, which makes the claim that "people found Paxos hard to understand" somewhat circular.
In 2001 Lamport published Paxos Made Simple — a five-page restatement without the parable — and the algorithm started spreading. Google built Chubby (Burrows, 2006), a coordination service used internally for everything from GFS leader election to Bigtable's lock service, on top of Multi-Paxos. The Chubby team then published Paxos Made Live (Chandra, Griesemer, Redstone, 2007), which is essentially a list of every gap between the theory and a working system: log compaction, membership change, master leases, disk corruption, the works. The paper is a reasoned plea for a better specification.
In 2008, the ZooKeeper authors at Yahoo (Hunt, Konar, Junqueira, Reed) needed a consensus protocol for their coordination service. Rather than implement Multi-Paxos and re-derive Paxos Made Live on their own, they designed ZAB — ZooKeeper Atomic Broadcast — from scratch around their actual requirement: a primary that totally orders state changes and reliably broadcasts them to backups, with crash recovery that preserves order across primary changes. ZAB and Multi-Paxos solve overlapping problems with different vocabulary; Junqueira's 2011 paper formally compares them.
In 2014, Diego Ongaro and John Ousterhout published In Search of an Understandable Consensus Algorithm, the Raft paper. Their stated design constraint was understandability. They ran user studies measuring how well students learned Raft versus Paxos (Raft won by a wide margin), and they specified election, replication, membership change, and snapshots end to end, in a single paper. Within five years etcd, Consul, CockroachDB, TiKV, and dozens of other production systems were running Raft. The paper changed the conversation: consensus stopped being a cited primitive and became an implementable one.
This chapter assumes you know Raft. It walks back through Paxos with that knowledge in hand, which is a far easier path than the historical one.
Single-decree Paxos
Paxos in its original form decides exactly one value. Not a log, not a sequence — one immutable decision. Multi-decree consensus (a replicated log) is built by running many independent Paxos instances, one per log slot, with optimisations.
Three roles participate. A given server can play any combination, and in production all three are typically co-located, but the algorithm separates them.
- Proposer — wants to get a value chosen. Initiates rounds.
- Acceptor — votes on proposals. The persistent state of the system lives in acceptors. A majority of acceptors must accept a value for it to be chosen.
- Learner — observes which value was chosen and acts on it (e.g. applies it to a state machine).
The protocol uses proposal numbers as the ordering mechanism. A proposal number is a globally unique, totally ordered identifier — typically a (round, proposer_id) pair compared lexicographically, so two proposers cannot collide. Higher proposal numbers always supersede lower ones in an acceptor's view.
The full protocol in five steps:
Phase 1a — Prepare. A proposer picks a proposal number n higher than any it has used before and broadcasts Prepare(n) to (at least) a majority of acceptors.
Phase 1b — Promise. An acceptor receiving Prepare(n) checks whether n is greater than any prepare number it has previously promised. If yes, it promises never to accept any proposal numbered less than n, persists n to disk, and replies Promise(n, prevN, prevV) where prevN, prevV is the highest-numbered proposal it has previously accepted (or null if none). If n is not higher, the acceptor ignores the prepare (or replies with the higher number it has seen, as an optimisation).
Phase 2a — Accept. The proposer waits for promises from a majority of acceptors. If any of them reported a previously-accepted (prevN, prevV), the proposer must propose the prevV from the response with the highest prevN. Only if all promises reported null may the proposer choose its own value v. It then broadcasts Accept(n, v) to a majority.
Phase 2b — Accepted. An acceptor receiving Accept(n, v) checks whether it has promised any number greater than n. If not, it accepts: persists (n, v) to disk and replies Accepted(n, v) to the proposer and (typically) to the learners.
Chosen. Once a majority of acceptors have accepted (n, v), the value v is chosen. This is a global property — no single participant can unilaterally observe it, but learners can detect it by counting Accepted messages. Once chosen, v is the decree of this Paxos instance forever; no future round can change it.
Why Phase 1 exists at all: Without it, a proposer that just broadcasts Accept(n, v) could overwrite a value that was already chosen. The Promise phase forces the proposer to learn whether any value might have already been accepted by some acceptors, and to re-propose that value if so. Phase 1 is how Paxos preserves previously-chosen values across competing proposers without needing a leader.
Why proposal numbers must be globally unique: If two proposers could pick the same n, an acceptor could promise both, and the safety argument (that a higher-numbered proposer learns about lower-numbered accepted values) collapses. The standard construction is n = (round_counter, proposer_id) compared lexicographically — proposer_id breaks ties, and each proposer increments its round_counter past any it has heard about.
The protocol's correctness hinges on a single non-obvious fact: any two majorities of N acceptors intersect in at least one acceptor (pigeonhole on N/2 + N/2 = N). If value v was chosen with proposal number n1 by majority M1, and a later proposer with n2 > n1 collects promises from majority M2, then M1 ∩ M2 contains at least one acceptor that accepted (n1, v) and is now promising the n2 proposer — and that acceptor's Promise message will report (n1, v), forcing the new proposer to re-propose v. So once chosen, a value cannot be replaced.
Single-decree Paxos: three acceptors, two competing proposers
The setup. Three acceptors A1, A2, A3. Two proposers P1 and P2 racing to propose different values: P1 wants to propose "X", P2 wants "Y". Majority is 2 of 3.
Proposal numbers will be tuples (round, proposer_id) compared lexicographically. P1 uses round 1, so its number is n1 = (1, P1). P2 uses round 1, so n2 = (1, P2). We will say P2 > P1 for tie-breaking, so n2 > n1. (In practice the round is the load-bearing piece; the proposer id just breaks ties.)
Round 1 — P1 starts first.
P1 broadcasts Prepare((1, P1)) to A1, A2, A3.
A1 has never promised before. It promises (1, P1), persists, replies Promise((1, P1), prev=null).
A2 same — promises, persists, replies Promise((1, P1), prev=null).
A3's network is slow; the Prepare hasn't arrived yet.
P1 has two promises (a majority). Both report prev=null, so P1 is free to propose its own value "X". P1 broadcasts Accept((1, P1), "X") to A1, A2, A3.
A1 receives Accept((1, P1), "X"). It has promised (1, P1), and (1, P1) is not less than its promise, so it accepts. Persists ((1, P1), "X") to disk. Replies Accepted.
Round 1 — P2 interleaves before A2's Accept arrives.
Before the Accept reaches A2, P2 broadcasts Prepare((1, P2)) to A1, A2, A3.
A1 sees (1, P2) > (1, P1). It updates its promise to (1, P2), persists, and replies Promise((1, P2), prev=((1, P1), "X")) — it must report the value it has already accepted.
A2 sees (1, P2) > (1, P1). A2 has not yet accepted anything (the Accept from P1 is still in flight). A2 promises (1, P2), persists, replies Promise((1, P2), prev=null).
A3 finally receives both Prepares; the higher one wins. A3 promises (1, P2), replies Promise((1, P2), prev=null).
P2 has three promises. One of them — from A1 — reports a previously-accepted value "X" with proposal number (1, P1). The Paxos rule is: P2 must propose "X" (the value reported by the highest-numbered prior accept), not its own preferred value "Y".
P2 broadcasts Accept((1, P2), "X") (not "Y"!) to A1, A2, A3.
Meanwhile P1's Accept((1, P1), "X") arrives at A2. A2 has now promised (1, P2), and (1, P1) < (1, P2), so A2 rejects the late Accept. P1's round fails.
Round 1 — P2's accepts complete.
A1 sees Accept((1, P2), "X"). Its promise is (1, P2), the proposal number matches, accepts. Replies Accepted.
A2 same — accepts. Replies Accepted.
A3 same — accepts. Replies Accepted.
The value "X" is now chosen by majority (in fact unanimously). Notice what happened: P2 wanted "Y", but the protocol forced P2 to propose "X" because A1 had already accepted it from P1. Even though P1 never collected a majority of Accepts for "X", the single A1 accept was enough — once P2's Promises included A1's report, P2 was bound to propose "X". The value chosen is unique even when proposers race.
If P2 had started before any acceptor accepted P1's value, P2 would have collected three null-prev Promises and been free to propose "Y". The race would resolve to whichever proposer "got there first" in the sense of getting a value accepted by at least one acceptor whose Promise then reaches the other proposer.
Why Paxos is hard to implement
Single-decree Paxos chooses one value. A real system needs a log — a sequence of values, applied in order to a state machine. The naive approach is to run a separate Paxos instance per log slot. This works in principle but raises a long list of practical issues that the original paper does not address:
- How does a proposer learn which slot to propose into? (You need a separate gap-detection mechanism.)
- What if two proposers propose to the same slot — does that wedge the slot forever? (No — Paxos converges — but the latency hit is real.)
- Where do proposal numbers come from across instances? (Separate per instance? Shared? The paper is silent.)
- What about membership changes? (The paper has a footnote saying "use Paxos to choose the new membership" which is famously unimplementable as written.)
- How do you compact the log? (Not in the paper.)
- What about leader leases for fast reads? (Not in the paper.)
- How do learners discover the chosen value efficiently? (Not in the paper.)
This is why Google's Chubby team wrote Paxos Made Live — a paper-length list of the things they had to invent to ship Paxos. Many of those inventions were folklore that did not make it back into a unified specification, and every team that built on Paxos rebuilt the same wheels in slightly different shapes.
Multi-Paxos
The single most important optimisation is stable leadership. Observation: if one proposer keeps winning Phase 1 round after round, the Phase 1 traffic is wasted — every Promise just says "no prior accepted value" (because no other proposer has accepted anything in the meantime). So have one proposer run Phase 1 once, for an entire range of future log slots, and then for each new client value just run Phase 2 (Accept/Accepted).
This proposer is the leader. It is elected by some external mechanism — sometimes another instance of Paxos, often a simpler bully algorithm or a lease scheme. Once elected, the leader runs a single Phase 1 with a high proposal number that covers all future slots, then for each client request just sends Accept and waits for a majority.
Multi-Paxos with a stable leader is operationally identical to Raft: leader appends to its log, replicates to followers (acceptors), commits when a majority acks. The differences are presentational and definitional, not algorithmic. Where Raft says "term", Multi-Paxos says "proposal number range". Where Raft says "AppendEntries", Multi-Paxos says "Accept". Where Raft says "leader election", Multi-Paxos says "Phase 1 with a new high proposal number".
What Multi-Paxos does not specify: how to elect the leader, how to handle log holes when the leader changes, how to do membership changes, how to take snapshots, how to recover after a leader crash. Every implementation invents these. Paxos Made Moderately Complex (van Renesse and Altinbuken, 2015) is a 50-page attempt to specify Multi-Paxos completely; it is enlightening reading, and it is also half the length of the Raft paper while covering less ground.
ZAB — ZooKeeper Atomic Broadcast
ZooKeeper, written at Yahoo in 2008 to be Hadoop's coordination service, needed exactly-once total-order broadcast: a primary serializes all writes, totally orders them, and broadcasts to backups. Hunt, Junqueira, and Reed designed ZAB for this purpose rather than adapt Multi-Paxos.
ZAB has two modes: broadcast (steady state — primary ships transactions to backups, who ack) and recovery (after a primary crash — backups elect a new primary, which then synchronizes with backups so they all share the same prefix before broadcast resumes).
The vocabulary differs from Paxos and Raft. A transaction (called a "zxid") is a 64-bit identifier with two fields: the upper 32 bits are the epoch number (incremented at each new primary election), the lower 32 bits are the counter within that epoch. Total ordering of transactions is just numerical ordering of zxids. The epoch is ZAB's analogue of Paxos's proposal number and Raft's term.
ZAB's recovery is conceptually simpler than Multi-Paxos's because it is synchronous: before the new primary accepts any client write, it waits for backups to converge to the same prefix. This is more rigid than Raft's approach (Raft lets the new leader resume client writes immediately and asynchronously catches up lagging followers) but easier to reason about. In return ZAB has slightly higher recovery latency.
ZAB is tightly coupled to ZooKeeper's data model (the znode tree) and snapshot format. It is not generally extracted as a reusable library — if you want ZAB, you run ZooKeeper. This is the opposite of Raft, which the etcd team explicitly designed as a reusable Go library (github.com/etcd-io/raft) that other systems embed.
What Raft simplified
Raft's contribution is not a new safety property. Paxos's safety properties are sound; ZAB's are sound; Raft's are sound; they all guarantee the same thing — agreement on a totally ordered log despite minority failures and arbitrary network reordering. What Raft simplified was the specification and the mental model.
Five concrete simplifications, in order of impact:
1. Strong leader from the start. Paxos's original presentation is leaderless — any proposer can start a round at any time. The "stable leader" is an optimisation bolted on top, with no canonical specification. Raft baked the leader in from the first sentence: at any moment there is at most one leader; all client writes flow through the leader; if no leader exists, the cluster is unavailable until election completes. This single decision eliminates entire classes of edge cases (concurrent proposers stepping on each other, log holes from skipped proposers, fancy phase-1 amortisation tricks).
2. Election and log replication specified jointly. Multi-Paxos and ZAB both leave the leader-election protocol underspecified or in a separate document. Raft specifies them as one protocol with shared state — the term integer that election produces is the same term integer that AppendEntries carries. The up-to-date-log restriction on votes is what guarantees Leader Completeness; election and replication are not separate, they are interlocking.
3. Membership changes as a designed sub-protocol. Adding or removing a server from the cluster is a problem every consensus system must solve, but Paxos and ZAB treat it as an exercise. Raft introduced joint consensus (a transition configuration that requires majorities of both old and new memberships), and later production simplifications (one-server-at-a-time changes) are still derived from the joint-consensus framework. The point is not that Raft's membership-change protocol is novel — it is that the Raft paper treats it as part of consensus, not a footnote.
4. Snapshots and log compaction in the spec. Real systems cannot retain logs forever; they snapshot the state machine periodically and discard log prefixes older than the snapshot. Paxos's original paper does not address this; Paxos Made Live spends pages on it. Raft includes InstallSnapshot as a protocol RPC and specifies snapshot semantics, including how a leader sends a snapshot to a lagging follower whose required log prefix has already been compacted.
5. One mental model for everything. Raft uses three concepts — terms, logs, leaders — and applies them uniformly to election, replication, safety, membership change, snapshots, and client interaction. Paxos uses different vocabularies for different problems (proposal numbers for safety, leases for read scaling, separate Paxos instances for membership). ZAB uses transaction-log vocabulary for replication and a separate election vocabulary for recovery. The cognitive load of holding the algorithm in your head is dramatically lower with Raft.
The Raft paper measured this: students learned Raft to a higher degree of confidence in less time than they learned Paxos. The user study results are not subtle.
The mapping table
The same concept appears in all four algorithms under different names. The table below is the Rosetta stone — read across to translate between any two.
Why this mapping matters operationally: When you read the ZooKeeper source and see "epoch", you can think "Raft term" and the algorithm reads cleanly. When you read Spanner's design and see "Paxos group", you can think "Raft cluster running Multi-Paxos as the consensus algorithm". The vocabulary differences are accidents of history; the underlying algorithm is the same.
Where each is used in production
Paxos (and Multi-Paxos): Google's Chubby (the original Multi-Paxos production system) is the foundation for most of Google's coordination needs — used by GFS for primary election, by Bigtable for tablet locking, by every major Google service for leader election. Spanner uses Multi-Paxos as the consensus algorithm for its Paxos groups (one per shard). Megastore (Google's pre-Spanner replicated data store) used Paxos directly. Apache Cassandra's lightweight transactions use a single-decree Paxos. Outside Google, Microsoft's Azure Cosmos DB uses a Paxos variant. Most large cloud providers have an internal Multi-Paxos implementation somewhere.
ZAB: Apache ZooKeeper is the canonical user. Apache Kafka used ZooKeeper for cluster coordination from launch through 2021, when KIP-500 (the "Kafka Raft Metadata mode", or KRaft) replaced ZooKeeper with an internal Raft implementation — a notable migration in the field's algorithm preferences. HBase, Solr, NiFi, and many other Apache projects depend on ZooKeeper and therefore ZAB. The Indian Stock Exchange's NSE has used ZooKeeper for service discovery in some of its trading-system back-ends.
Raft: etcd (the Kubernetes control plane store), HashiCorp Consul, CockroachDB (one Raft group per range, hundreds of thousands per cluster), TiKV (the storage layer of TiDB; used heavily in Indian fintech for OLTP scale-out), MongoDB (replication protocol since 3.2 is "Raft-like"), RethinkDB, Hazelcast, RabbitMQ's quorum queues, Apache Kafka's KRaft mode. The Raft library etcd-io/raft is embedded in dozens of production systems. If you encounter consensus in a system written after 2015, the default assumption should be Raft.
The migration trajectory is unmistakable: new systems pick Raft; older systems with Paxos or ZAB stay on what they have unless a major rewrite forces a re-evaluation. Kafka's KRaft migration is the most prominent example of an existing system moving from ZAB to Raft.
Going deeper
EPaxos and leaderless consensus
Egalitarian Paxos (Moraru, Andersen, Kaminsky, 2013) is one of the more interesting post-Paxos designs. It eliminates the leader bottleneck by letting any replica accept commands, then uses a dependency-tracking mechanism to determine commit order across non-conflicting commands in parallel. EPaxos achieves lower latency than leader-based protocols when the workload has many independent commands, at the cost of significant protocol complexity. It has not seen wide production adoption — the complexity tax is real and the leader bottleneck is rarely the actual constraint.
Flexible Paxos and grid quorums
Flexible Paxos (Howard, Malkhi, Spiegelman, 2016) observes that Paxos's two phases can use different quorum systems, as long as Phase-1 and Phase-2 quorums always intersect. This allows asymmetric configurations — for example, fast Phase-2 (commit) using small quorums in exchange for slower Phase-1 (leader change) using larger quorums. The technique generalises to grid quorums for very large clusters and underlies some Spanner deployment optimisations.
Why Raft does not formally subsume Multi-Paxos
A common claim is "Raft is just Multi-Paxos with a strong leader." This is approximately true but not exactly. Raft has a stricter log-matching property than Multi-Paxos: a Raft follower's log is always a prefix of the leader's, while a Multi-Paxos acceptor's log can have holes (slots accepted out of order). Raft's stronger property simplifies recovery and reasoning but constrains the leader from accepting client requests faster than the slowest follower; Multi-Paxos can pipeline more aggressively. The performance difference is small in practice; the simplicity difference is large.
The Heidi Howard formalisation
Heidi Howard's PhD thesis and subsequent papers, particularly Distributed Consensus Revised (2019), formalise the relationship between Paxos, Raft, and other consensus algorithms in a unified framework. The result is that all of these algorithms are instances of a more general scheme parameterised by quorum systems, leader stability, and log structure. If you want to stop arguing about algorithm choice and start reasoning about consensus from first principles, Howard's framework is the cleanest entry point.
Where this leads next
You now hold the full historical context: Raft is one point in a 30-year design space, distinguished by its specification quality and pedagogical clarity rather than by novel safety properties. The next chapter (Multi-Raft: sharding consensus, CockroachDB style) takes Raft and asks the production-scale question: what if you have a million pieces of data and want to spread the consensus load across hundreds of Raft groups, each replicating a slice of the data independently? Chapter 107 (etcd and ZooKeeper as services) puts the ZAB-vs-Raft choice into operational terms: which one do you actually run, and why.
The single sentence to carry forward: Paxos established the safety properties; Multi-Paxos and ZAB hardened them for production; Raft made them teachable — and the field has voted with its repositories.
References
- Lamport, Paxos Made Simple, ACM SIGACT News 32(4), 2001 — the five-page restatement that finally made the algorithm spread. Read this first; the original "Part-Time Parliament" only after.
- Lamport, The Part-Time Parliament, ACM TOCS 16(2), 1998 — the original Paxos paper, with the parable of the Greek island. Of mostly historical interest now, but a famous read.
- Reed and Junqueira, A simple totally ordered broadcast protocol, LADIS 2008 — the original ZAB paper that describes ZooKeeper's atomic broadcast protocol. The 2011 follow-up Zab: High-performance broadcast for primary-backup systems (DSN 2011) by Junqueira, Reed, and Serafini formalises the algorithm and compares it to Multi-Paxos.
- Ongaro and Ousterhout, In Search of an Understandable Consensus Algorithm (Extended Version), USENIX ATC 2014 — the canonical Raft paper. Section 10 includes the user study comparing Raft and Paxos comprehension; Section 11 discusses the design philosophy and the explicit goal of understandability.
- Lampson, How to Build a Highly Available System Using Consensus, Distributed Algorithms (Springer LNCS 1151), 1996 — Butler Lampson's overview of why every distributed system needs consensus and how Paxos fits in. A classic systems-engineering perspective from before the algorithm wars.
- Howard and Mortier, Paxos vs Raft: Have we reached consensus on distributed consensus?, PaPoC 2020 — a careful comparison from the Cambridge group that argues Raft and Multi-Paxos are algorithmically equivalent modulo presentation, with worked examples. Read this after the Raft paper for the post-game analysis.