Paxos (and why people struggle with it)

It is 03:47 IST during a PaySetu year-end settlement window and Aditi, the on-call SRE, is reading the Paxos paper for the third time this year because the payment-state replicator has logged proposal n=4711 rejected: promise for higher n=4719 already issued 84 times in 6 seconds. She knows the cluster is doing the right thing — refusing a stale proposer's writes — but she cannot map the log line back to "what is the protocol trying to achieve right now". The Paxos paper opens with a parliament on a Greek island; the original "Part-Time Parliament" is so dressed up in metaphor that Lamport rewrote it as "Paxos Made Simple" eleven years later, and engineers still find that one confusing. Why this struggle is structural, not personal: Paxos is built around a single round of three phases (prepare, promise, accept) for one slot of the replicated log, and the protocol's invariants are about what proposers and acceptors must remember across rounds — but the pedagogy almost universally fails to lead with "here is the slot, here is the value being chosen for it, here is the invariant that protects against two values being chosen". Once that frame is in place, the prepare/promise/accept dance is the minimum machinery that enforces the invariant under the FLP escape hatches (partial synchrony plus a failure detector). The struggle is the protocol asking you to hold three things in mind simultaneously: the proposer's view, the acceptor's view, and the safety property that ties them together. Most expositions present them serially.

Paxos is a single-decree consensus protocol: a cluster of acceptors agrees on one value, proposed by one or more proposers, despite asynchronous messages and crash failures. It uses a two-phase round — prepare/promise to claim a ballot number and read prior state, then accept/accepted to commit a value — with the invariant that any value chosen in a higher ballot must be the same as any value already chosen in a lower one. Multi-Paxos chains single-decree rounds into a replicated log; Raft is Multi-Paxos rebranded for understandability. Most engineers struggle because the protocol's invariant is taught after the mechanism, instead of being the lens through which the mechanism is read.

The single-decree protocol — three roles, two phases, one invariant

Paxos has three roles. Proposers propose values they want the cluster to agree on. Acceptors vote on proposals; they are the persistent state of the protocol. Learners observe what was decided and act on it. In real implementations, a single node typically plays all three roles — a Paxos cluster of five physical machines is five proposer-acceptor-learner trios. The roles exist to separate concerns in the safety proof, not because the deployment topology cares.

The protocol runs in rounds indexed by a ballot number n (sometimes called a proposal number or epoch). Ballot numbers are globally unique and totally ordered — the standard implementation pairs a monotonically-increasing local counter with the proposer's node id, so two proposers cannot pick the same n. Each round has two phases.

Phase 1 — prepare/promise. A proposer picks a ballot number n higher than any it has previously used and sends Prepare(n) to a majority of acceptors. An acceptor, on receiving Prepare(n), checks whether it has previously promised any ballot ≥ n. If yes, it rejects. If no, it (1) records n as its highest promised ballot, and (2) replies Promise(n, prev_accepted) where prev_accepted is the highest-ballot value the acceptor has already accepted (or null if none). The promise is a binding contract: the acceptor will not accept any proposal numbered below n from now on.

Phase 2 — accept/accepted. Once the proposer has Promises from a majority, it picks the value to propose. The rule: if any Promise carried a prev_accepted value, the proposer must propose the value with the highest ballot number among them. Only if all Promises carry null may the proposer propose its own desired value. Why this rule is the entire safety property: an acceptor's prev_accepted field tells the new proposer "in some earlier ballot, a value v may have been chosen by a quorum that overlapped with me; if it was, I am one of the witnesses". By picking the highest-ballot prev_accepted value, the proposer guarantees that if any value was already chosen in a lower ballot, the new ballot will choose the same value. Choosing a fresh value would risk the cluster having two different chosen values across two ballots — the violation Paxos exists to prevent. The proposer then sends Accept(n, value) to a majority of acceptors. An acceptor, on receiving Accept(n, value), accepts it if and only if it has not promised a higher ballot than n since (i.e. its current promise is still n). On acceptance, it records (n, value) as its prev_accepted. Once a majority has accepted (n, value), the value is chosen. Learners observe enough Accepted messages to conclude this and act.

The single invariant — once a value v is chosen at ballot n, every higher ballot will also choose v — is what the prepare/promise dance enforces. Two majorities of any odd-sized acceptor set must overlap in at least one acceptor; that overlap acceptor will have witnessed the accept of v at ballot n and will report it in its Promise to the new proposer at ballot n' > n. The new proposer, by the proposal rule, must therefore propose v.

Paxos single-decree round — prepare/promise then accept/acceptedA message-sequence chart showing one proposer P and three acceptors A1, A2, A3 across time flowing downward. The proposer sends Prepare(n=5) to all three acceptors. A1 and A2 reply Promise(n=5, prev=null). A3 is partitioned and does not reply. The proposer, having a majority of two of three promises, sends Accept(n=5, value=v) to all three. A1 and A2 reply Accepted(n=5, v). At the bottom, a label "v is chosen — quorum of two of three accepted" with a green check mark. Paxos single-decree round — one proposer, three acceptors, ballot n=5 Time flows downward. A3 is silently partitioned; the round still completes via the {A1, A2} majority. Proposer A1 A2 A3 (partitioned) Prepare(n=5) Prepare(n=5) Prepare(n=5) — dropped Promise(n=5, prev=null) Promise(n=5, prev=null) majority — proceed Accept(n=5, v) Accept(n=5, v) Accepted(n=5, v) Accepted(n=5, v) v is chosen — {A1, A2} is a majority of {A1, A2, A3}
Illustrative — single-decree Paxos with one proposer, three acceptors, one silently-partitioned acceptor. The {A1, A2} majority is sufficient to choose `v`. A3 will reconcile its state when the partition heals via subsequent learn / catch-up messages.

A measurable demonstration — single-decree Paxos in Python

The script below implements single-decree Paxos with five acceptors and two competing proposers. The simulator runs in deterministic discrete-event time (no real network) so you can reproduce the same trace every time. The point is to show two things: (1) when one proposer wins, the chosen value is locked in; (2) when two proposers race, one is forced — by the prepare/promise rule — to adopt the other's value if any acceptor has already accepted.

# paxos_single_decree.py — single-decree Paxos with 5 acceptors and 2 competing proposers.
import collections, itertools

class Acceptor:
    def __init__(self, aid):
        self.aid = aid
        self.promised_n = -1          # highest ballot promised
        self.accepted_n = -1          # highest ballot accepted
        self.accepted_v = None        # value at accepted_n

    def on_prepare(self, n):
        if n <= self.promised_n: return ("nack", n, self.promised_n)
        self.promised_n = n
        return ("promise", n, self.accepted_n, self.accepted_v)

    def on_accept(self, n, v):
        if n < self.promised_n: return ("nack", n, self.promised_n)
        self.promised_n = n
        self.accepted_n, self.accepted_v = n, v
        return ("accepted", n, v)

class Proposer:
    def __init__(self, pid, value, ballot_seed):
        self.pid, self.value = pid, value
        self.ballot_counter = ballot_seed   # ensures unique ballots per proposer

    def next_ballot(self):
        self.ballot_counter += 5            # leave gaps so the other proposer can interleave
        return self.ballot_counter

    def run(self, acceptors):
        n = self.next_ballot()
        promises = [a.on_prepare(n) for a in acceptors]
        ok = [p for p in promises if p[0] == "promise"]
        if len(ok) <= len(acceptors) // 2: return ("phase1_failed", n)
        # adoption rule: pick the highest-ballot prev-accepted value, else our own
        prev = [(p[2], p[3]) for p in ok if p[3] is not None]
        chosen_v = max(prev)[1] if prev else self.value
        accepts = [a.on_accept(n, chosen_v) for a in acceptors]
        ok2 = [a for a in accepts if a[0] == "accepted"]
        if len(ok2) <= len(acceptors) // 2: return ("phase2_failed", n)
        return ("chosen", n, chosen_v)

acceptors = [Acceptor(i) for i in range(5)]
P_paysetu  = Proposer(pid="paysetu",  value="settlement_v1", ballot_seed=0)
P_kapital  = Proposer(pid="kapital",  value="settlement_v2", ballot_seed=2)

print("Round 1: PaySetu's proposer goes first.")
print(" ", P_paysetu.run(acceptors))
print("Round 2: KapitalKite's proposer races; ballot is higher than PaySetu's.")
print(" ", P_kapital.run(acceptors))
print("Round 3: PaySetu retries with a new ballot.")
print(" ", P_paysetu.run(acceptors))
print("Final acceptor state:")
for a in acceptors:
    print(f"  A{a.aid}: promised_n={a.promised_n}  accepted={(a.accepted_n, a.accepted_v)}")

Sample run on Python 3.11:

Round 1: PaySetu's proposer goes first.
  ('chosen', 5, 'settlement_v1')
Round 2: KapitalKite's proposer races; ballot is higher than PaySetu's.
  ('chosen', 7, 'settlement_v1')
Round 3: PaySetu retries with a new ballot.
  ('chosen', 10, 'settlement_v1')
Final acceptor state:
  A0: promised_n=10  accepted=(10, 'settlement_v1')
  A1: promised_n=10  accepted=(10, 'settlement_v1')
  A2: promised_n=10  accepted=(10, 'settlement_v1')
  A3: promised_n=10  accepted=(10, 'settlement_v1')
  A4: promised_n=10  accepted=(10, 'settlement_v1')

Per-line walkthrough. Acceptor.on_prepare is the safety gate: any ballot n that does not strictly exceed promised_n is rejected. The acceptor then locks in n and reports its prior accepted value, if any. Proposer.run is the round driver: it computes a fresh ballot, gathers promises, applies the adoption rule, and proceeds to phase 2. The adoption rule lives in the prev = [(p[2], p[3]) for p in ok if p[3] is not None] line: the proposer collects every promise that carries a previously-accepted value, then picks the one with the highest ballot via max(prev). This is the single line that enforces "once v is chosen, every later ballot also chooses v". The output's middle line is the lesson: KapitalKite's proposer, despite intending to push settlement_v2, gets forced to propose settlement_v1 because Round 1's promises now report (5, settlement_v1) as their prev_accepted. Why this is what makes Paxos correct: the cluster never has two different chosen values across rounds. Even though KapitalKite's proposer started with a different intent, the protocol coerces it into agreement with the value already locked in. The "competing proposer" is not a threat to safety; it is, at worst, a threat to liveness — if both proposers keep stealing each other's prepares with higher ballots, neither one's accept reaches a majority, and the system livelocks. This is the dueling-proposer pathology that Multi-Paxos solves with a stable leader.

The dueling-proposer livelock is real. If PaySetu's and KapitalKite's proposers each start a fresh prepare every time their previous accept fails, and their ballot increments interleave, the system can spin forever without choosing anything. FLP is showing through. The standard fix is to elect one proposer (the leader) and let it run all rounds; failover happens only when the leader is suspected dead. That is the core of Multi-Paxos.

Multi-Paxos — chaining single decrees into a replicated log

A single Paxos decree chooses one value. A replicated state machine needs to choose a sequence of values — slot 0, slot 1, slot 2, ... — each representing one client command (write x=5, increment counter, commit transaction). Multi-Paxos is the obvious extension: one Paxos instance per slot. The optimisation that makes it efficient is that the prepare phase, once a stable leader is established, can be batched across all future slots — the leader runs prepare once with a high ballot number for "all slots from index ≥ k", and all subsequent commands on that leader's term need only the accept phase. Phase 1 amortises to nearly zero; the steady state is one round-trip per command.

The leader is itself elected via a Paxos round (or a separate leader-election protocol). Once elected at ballot b, the leader assumes it owns all slots and issues Accept(b, slot=i, value=v) for each new client command. Acceptors honour the accepts as long as their promised_n is still b. If a new leader is elected at b' > b, acceptors will start rejecting the old leader's accepts, and the old leader either re-runs phase 1 at a higher ballot or (more commonly) steps down.

The leader change is where Paxos gets subtle. A new leader at b', before issuing any accepts, must run phase 1 to discover what was already accepted in slots that may not yet be chosen. For each slot i, the new leader collects promises from a quorum and applies the adoption rule per slot: if any promise reports a prev_accepted value for slot i, the new leader must re-propose that value at b' for slot i. This is how leader-change recovery works in Multi-Paxos, and it is the same mechanism Raft uses (called "log catch-up" there) but with a tighter implementation that exploits Raft's log-contiguity invariant.

Multi-Paxos — one Paxos instance per log slot, batched prepareA horizontal log of slots 0 through 5, each with a chosen value (or pending). Above each slot is a label showing which leader chose it: leader L1 chose slots 0-3, leader L2 chose slots 4-5 after a failover. The figure also shows that L2, on becoming leader, ran prepare(b=12) once for "all slots ≥ 4" rather than per-slot. Below the log, an annotation says: "phase 1 amortised across infinitely many future slots; steady state = one accept round-trip per command". Multi-Paxos — one Paxos instance per slot, leader amortises phase 1 Leader L1 (ballot b=7) elected; ran prepare(b=7) once for slots ≥ 0 slot 0 SET x=5 b=7 chosen slot 1 DEL y b=7 chosen slot 2 INCR z b=7 chosen slot 3 CAS w b=7 chosen slot 4 PUT a=1 b=12 chosen slot 5 PUT b=2 b=12 chosen Leader failover at slot 4 L2 ran prepare(b=12) once for slots ≥ 4 Steady state cost (per command): 1 round-trip leader→majority→leader (the accept/accepted phase) Prepare phase amortised — runs once per leader term, covers infinitely many slots ahead Cost when leader changes: one prepare phase + log catch-up (re-propose any prev_accepted values for half-chosen slots)
Illustrative — Multi-Paxos log with leader failover between slots 3 and 4. The new leader ran prepare(b=12) once for "all slots ≥ 4"; its accepts henceforth cost one round-trip each. Real implementations (Chubby, Spanner's Paxos groups, Megastore) all use this batched-prepare optimisation.

Why people struggle — and the lens that makes it click

Engineers who bounce off Paxos almost always bounce off the same three things. Naming them out loud is the fastest path to fluency.

Struggle 1 — the parliament metaphor. Lamport's 1990 "Part-Time Parliament" paper presents the protocol as the legislative process of a Greek island where parliamentarians come and go. The metaphor obscures rather than illuminates: the reader spends mental cycles translating "decree" to "value" and "ledger" to "log" instead of internalising the safety invariant. Skip the metaphor. Read "Paxos Made Simple" (2001) for prose, then the original "The Part-Time Parliament" (1998) only if you want the paxology jokes. Or skip Lamport's writeups entirely and read Lampson's "How to Build a Highly Available System Using Consensus" (1996) — the cleanest non-Lamport exposition.

Struggle 2 — the proposer's "adoption rule" feels arbitrary. "Why must the new proposer propose someone else's value?" is the question every reader asks at some point. The answer is the safety invariant: if a value v was chosen at ballot n, a quorum of acceptors accepted it, and any later ballot's quorum must overlap that earlier quorum in at least one acceptor — that overlap acceptor is required to report v in its Promise. The adoption rule is not arbitrary; it is the only way to preserve "once chosen, always chosen" without coordinating across rounds. If you skip it, you can construct a four-message trace where two ballots produce two different chosen values, and the entire protocol collapses.

Struggle 3 — confusing single-decree Paxos with Multi-Paxos. The original Paxos paper is single-decree: it agrees on one value. Production systems use Multi-Paxos: a sequence of slots, each with its own decree, optimised by a stable leader who batches phase 1. Most readings of "Paxos in production" actually mean "Multi-Paxos with a Raft-like leader". When the introduction says "Paxos chooses a value" and the production discussion says "Paxos replicates a log", the reader is right to be confused. The protocols are related but not the same.

The lens that makes it click: read Paxos invariant-first. The protocol exists to enforce the property "once a value is chosen at any ballot, every higher ballot also chooses the same value". The prepare/promise dance is the minimum mechanism that enforces this property under the FLP escape hatches (partial synchrony, randomised election timeouts to avoid dueling-proposer livelock, and a failure detector to suspect dead acceptors). Every part of the protocol — ballot uniqueness, majority overlap, the adoption rule, the persistence requirement — exists to maintain that invariant under crash failures and async messages. Once you have that lens, the prose stops fighting you.

Common confusions

Going deeper

The safety proof — why majority overlap is the magic

The Paxos safety proof is the cleanest impossibility-evasion proof in distributed systems. The theorem: if value v is chosen at ballot n, then for every ballot n' > n that completes, the value chosen at n' is also v. The proof is by induction on n'. Base case: at n' = n + 1, a new proposer runs phase 1. Its prepare quorum Q' and the original accept quorum Q (the one that chose v at n) are both majorities of the same acceptor set, so they intersect in at least one acceptor a. Acceptor a had accepted (n, v) at the time of the original accept, so its prev_accepted is at least (n, v). When the new proposer collects promises from Q', at least one promise (from a) reports prev_accepted = (n, v). The adoption rule forces the new proposer to propose v at n+1. Inductive step: suppose for all ballots in (n, n') the chosen value (if any) is v. The proposer at n' collects promises from a majority. Either some promise reports a prev_accepted value at the highest ballot n'' in (n, n'), in which case by the inductive hypothesis that value is v, and the adoption rule chooses v. Or all promises report null for ballots in (n, n') — but at least one of them must report (n, v) (since the original accept quorum overlaps any majority), and the adoption rule still chooses v. The proof's elegance is that it converts a temporal property (chosen forever) into a structural property (quorum overlap forces witness presence). Lamport's "The Part-Time Parliament" formalises this with TLA+; the original 1998 paper is dense but worth reading once after you understand the protocol mechanically.

Disk persistence — why acceptors must fsync before responding

Both promised_n and (accepted_n, accepted_v) must be persisted to durable storage before the acceptor responds. If an acceptor promises n=5, replies, and then crashes without persisting the promise, on recovery it might promise n=4 (an earlier ballot) and break the safety invariant. This is the "write-ahead log + fsync" cost that every Paxos implementation pays. In production, acceptor disk fsync is the dominant latency cost: a 1 ms NVMe fsync, repeated three times per round (once for prepare, twice for accept on a majority), adds up. Spanner uses dedicated SSD-backed log volumes; etcd uses bbolt with explicit fsync; CockroachDB uses RocksDB's WAL. The cost can be amortised by batching multiple commands per Paxos round, which is what every production system does — Multi-Paxos's leader bundles a few hundred client commands into a single accept.

Cheap Paxos, Fast Paxos, EPaxos — the variant landscape

Lamport himself produced two follow-up variants: Cheap Paxos (2004) reduces the number of acceptors needed for fault tolerance by introducing "auxiliary" acceptors that activate only during failures. Fast Paxos (2006) cuts steady-state latency by letting clients send proposals directly to acceptors, skipping the proposer round-trip — at the cost of larger quorums (3f+1 for f failures instead of 2f+1). Generalised Paxos (2005) allows commutative commands to commit in different orders on different replicas as long as the user-defined commute relation holds. EPaxos (Egalitarian Paxos, Moraru et al., 2013) eliminates the leader entirely: any replica can propose, and dependencies between commands are tracked explicitly. EPaxos's dependency analysis makes it harder to reason about than Multi-Paxos, but it has the lowest tail latency in geo-distributed deployments because no single replica is a bottleneck. Flexible Paxos (Howard et al., 2016) shows that the prepare and accept quorums can be different sizes as long as they intersect — Q1 ∩ Q2 ≥ 1 is the only requirement. This unlocks deployments where reads cost three out of five and writes cost only two out of five, or vice versa. The variant landscape is rich; for production, Multi-Paxos and Raft dominate, but for research and specialist deployments, the variants matter.

The Chubby lock service — Paxos in production at Google

Burrows's 2006 paper "The Chubby Lock Service for Loosely-Coupled Distributed Systems" is the canonical real-world Paxos deployment. Chubby runs five replicas using Multi-Paxos to maintain a small filesystem-like namespace used for locks, leader election, and configuration. The design choices Burrows describes are the production-grade lessons: clients send all reads and writes to the master (so reads also benefit from linearisability without going through Paxos every time, via a master lease); the cluster runs five replicas because two-failure tolerance is the production requirement at Google's scale; the master changes once every few weeks under steady state, which means phase 1 is amortised effectively to zero; the system uses a separate gRPC-based ping protocol for failure detection (Chandra-Toueg ◇W approximation), independent of Paxos itself. Chubby's API is simple: open / close / acquire / release on a small file tree. ZooKeeper's API is a deliberate clone of Chubby's, with its own Paxos variant (Zab) underneath; etcd is a Go reimplementation of the same shape with Raft. The Chubby paper, more than the Paxos papers themselves, taught the industry how to ship a Paxos system.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
# save paxos_single_decree.py from the article body
python3 paxos_single_decree.py

# Then run a real Paxos system — etcd uses Raft, but ZooKeeper uses Zab (Paxos variant):
brew install zookeeper
zkServer start
# in another terminal, watch the leader election:
echo srvr | nc 127.0.0.1 2181
# look for "Mode: leader" or "Mode: follower"

Where this leads next

Paxos is the load-bearing chapter of Part 8. Every consensus protocol that follows builds on it.

Part 9 covers leader election and leases — the operational machinery on top of consensus. The Paxos cluster you build for your control plane will spend most of its time not running Paxos rounds, because a stable leader is in steady state; the leader's lease is the optimisation that lets reads skip the consensus tax. Read Part 9 next.

References

  1. Lamport, "The Part-Time Parliament" — TOCS 1998 — the original Paxos paper, with the parliament metaphor.
  2. Lamport, "Paxos Made Simple" — 2001 — Lamport's own re-explanation, much shorter and clearer.
  3. Lampson, "How to Build a Highly Available System Using Consensus" — WDAG 1996 — the cleanest non-Lamport exposition; teaches Paxos as engineering.
  4. Burrows, "The Chubby Lock Service for Loosely-Coupled Distributed Systems" — OSDI 2006 — the canonical production Paxos deployment, with operational lessons.
  5. Ongaro & Ousterhout, "In Search of an Understandable Consensus Algorithm" — USENIX ATC 2014 — Raft, which the authors describe as "Multi-Paxos made understandable".
  6. Howard, Malkhi, Spiegelman, "Flexible Paxos: Quorum intersection revisited" — OPODIS 2016 — generalises the quorum constraint; reads can use a smaller quorum than writes.
  7. Howard & Mortier, "Paxos vs Raft: have we reached consensus on distributed consensus?" — PaPoC 2020 — formalises the equivalence between Multi-Paxos and Raft.
  8. FLP impossibility — what it forbids — chapter 50; the result Paxos navigates around via partial synchrony and stable leadership.