EPaxos and Flexible Paxos
It is 19:42 IST on a Friday, and PaySetu's settlement service is paying a 110 ms tail penalty on every cross-region write because the Multi-Raft leader for shard 042 happens to live in Mumbai while the write originated in Singapore. Aditi, the platform-tech lead, is staring at a flame graph that says: 9 ms application logic, 12 ms local fsync, 89 ms WAN round-trip to the leader. The cluster is correct. The cluster is fast on paper. The cluster is wasting 80% of every cross-region write on a leader that did not need to be there. Why this exact frame is the entry point to EPaxos and Flexible Paxos: classical Multi-Paxos and Raft both bake in the assumption that one node is the leader, and every write goes through it. For a single-region deployment that assumption is fine — the leader-RTT is 0.5 ms and you do not notice. For a geo-distributed deployment the leader-RTT is 80–250 ms across continents, and the leader becomes the dominant cost. The two protocols in this chapter are the two clean answers: kill the leader entirely (EPaxos), or change the quorum shape so the leader matters less (Flexible Paxos).
EPaxos is a leaderless consensus protocol where any replica can propose a command; commands that do not interfere with each other commit in one round-trip (a "fast quorum" of ⌈3F/2⌉ + 1 for F failures), and conflicting commands fall back to a two-round Paxos. Flexible Paxos generalises Paxos's "majority quorum" to two intersecting quorum sets Q1 (used in phase 1, leader election) and Q2 (used in phase 2, accepting writes); the only requirement is Q1 ∩ Q2 ≠ ∅. Together they break the assumption that consensus needs a single fixed leader and a single quorum size — the building blocks for systems where the write originator is geographically far from "the leader".
Why a single leader is the wrong shape for geo-distribution
Multi-Paxos and Raft both run a stable leader; every write goes through it. In a single region this is the right call — the leader's local-quorum RTT is sub-millisecond, and steering every write through one node simplifies the protocol enormously (no concurrency, no command-dependency tracking, no fast-path/slow-path split). The cost only becomes visible when you stretch the cluster across regions. If your replicas live in Mumbai, Singapore, and Frankfurt, and the leader is in Mumbai, then a write originated in Frankfurt pays one Frankfurt→Mumbai hop (200 ms) before the leader can even start its quorum round. Why this is structural and not a tuning problem: the leader is a serialisation point. Even if you make the network faster, the leader still has to receive the write, append it locally, replicate to a majority, and ack — and the receive step is what kills geo-latency, because the originator is by definition not the leader most of the time. You cannot tune your way out of "the leader is in the wrong AZ for this client".
The two protocols in this chapter take orthogonal approaches:
- EPaxos (Moraru, Andersen, Kaminsky, SOSP 2013) drops the leader entirely. Any replica can propose. The protocol pays for this by tracking command interference graphs and running a two-tier quorum (fast quorum for non-conflicting commands, slow quorum for conflicting ones).
- Flexible Paxos (Howard, Malkhi, Spiegelman, OPODIS 2016) keeps the single leader but generalises the quorum. The classical Paxos rule says both phases use majority quorums; FPaxos shows that any two quorum sets
Q1(phase 1, prepare) andQ2(phase 2, accept) work as long as they intersect. This lets you trade off election latency against steady-state write latency, or shape quorums geographically.
EPaxos — leaderless commits via interference graphs
EPaxos's central observation: most pairs of commands in a real workload do not interfere with each other. A PUT key=A and a PUT key=B commute — the order in which they apply does not matter. Only commands that touch the same key (or, more generally, share a serialisable resource) need to agree on an order. So instead of forcing every command through one leader to establish a global total order, EPaxos lets each replica be the command leader for its own proposals, and tracks per-command dependency sets that record "this command must apply after these other commands".
The protocol has two paths:
- Fast path (one round-trip). The command leader sends a
PreAcceptto a fast quorum —⌈3F/2⌉ + 1replicas in a2F+1cluster (so 3 of 5 forF=1, no — actually 3 of 3 forF=1; forF=2clusters with 5 nodes, fast quorum is 4). Each replica replies with the dependencies it sees for this command (the set of currently-known commands that conflict). If all responses agree on the same dependency set, the command leader commits in one round-trip. - Slow path (two round-trips). If responses disagree on dependencies, the command leader runs a Paxos-Accept round to fix the dependency set, then commits.
The fast quorum is larger than a majority because EPaxos needs a stronger property: every two fast quorums for conflicting commands must intersect at one specific node, so that conflicting commands cannot both take the fast path with disagreeing dependencies. The exact size is derived from the protocol's safety proof; the take-away is that EPaxos's fast quorum is a 4-of-5 in a 5-node cluster vs Paxos's 3-of-5 majority.
In production-faithful workloads (Moraru's SOSP 2013 paper measures Spanner-style key-value workloads with 1–10% conflict rates), 95–98% of commands take the fast path. The slow path is for hot keys and contention, and is exactly the cost Multi-Paxos pays on every command. The amortised cost is 1.05 round-trips per command vs Multi-Paxos's leader-hop + 1 round-trip. Why the dependency-tracking machinery is the price of leaderlessness: without a global serialisation point, you cannot rely on a leader's log order to define "before". EPaxos replaces total order with a partial order — a DAG of commands, where the only edges are between commands that genuinely interfere. The DAG is the equivalent of Multi-Paxos's log, but it grows as a graph rather than a sequence, and the per-command leader records its incoming edges into the DAG via the dependency set.
A working EPaxos fast-path simulator
# epaxos_fast_path.py — minimal EPaxos fast-path simulator: 5 replicas, simulated
# inter-region latencies, two clients submitting non-conflicting and conflicting writes.
import collections, heapq, random
REPLICAS = ["mumbai", "singapore", "frankfurt", "tokyo", "sydney"]
RTT = { # one-way ms between replicas
("mumbai","singapore"): 30, ("mumbai","frankfurt"): 100, ("mumbai","tokyo"): 60, ("mumbai","sydney"): 75,
("singapore","frankfurt"): 110, ("singapore","tokyo"): 35, ("singapore","sydney"): 50,
("frankfurt","tokyo"): 130, ("frankfurt","sydney"): 145, ("tokyo","sydney"): 55,
}
def lat(a, b):
if a == b: return 0
return RTT.get((a,b)) or RTT.get((b,a))
events, log = [], []
def schedule(t, kind, *args): heapq.heappush(events, (t, kind, *args))
class Replica:
def __init__(self, nid):
self.nid = nid
self.deps = collections.defaultdict(set) # key -> set of command_ids that touched it
def pre_accept(self, cmd_id, key, leader, t):
seen_deps = set(self.deps[key])
self.deps[key].add(cmd_id)
schedule(t + lat(self.nid, leader), "pre_accept_resp", leader, cmd_id, frozenset(seen_deps))
replicas = {n: Replica(n) for n in REPLICAS}
fast_quorum = 4 # ceil(3*F/2)+1 with F=2 in 5 nodes
# Two writes from frankfurt: one to key=A (non-conflicting), one to key=B (non-conflicting w/ A).
# Then a third write to key=A from singapore — this conflicts and shows fast-path agreement on deps.
def submit(cmd_id, key, leader, t):
log.append((t, f"{leader} proposes {cmd_id} key={key}"))
for r in REPLICAS:
schedule(t + lat(leader, r), "pre_accept", r, cmd_id, key, leader)
# Track responses
responses = collections.defaultdict(list)
committed_at = {}
def step(now, kind, *args):
if kind == "pre_accept":
replica, cmd_id, key, leader = args
replicas[replica].pre_accept(cmd_id, key, leader, now)
elif kind == "pre_accept_resp":
leader, cmd_id, deps = args
responses[cmd_id].append(deps)
if cmd_id in committed_at: return
if len(responses[cmd_id]) == fast_quorum:
unique = set(responses[cmd_id])
if len(unique) == 1:
committed_at[cmd_id] = now
log.append((now, f"{leader} fast-commit {cmd_id} deps={set(next(iter(unique)))}"))
else:
log.append((now, f"{leader} slow-path for {cmd_id} (deps disagree: {unique})"))
submit("c1", "A", "frankfurt", 0)
submit("c2", "B", "frankfurt", 5)
submit("c3", "A", "singapore", 30) # conflicts with c1; depending on ordering may fast or slow path
while events:
t, kind, *args = heapq.heappop(events); step(t, kind, *args)
for ts, msg in sorted(log): print(f"t={ts:6.1f}ms {msg}")
Sample run (Python 3.11, deterministic given the latency table):
t= 0.0ms frankfurt proposes c1 key=A
t= 5.0ms frankfurt proposes c2 key=B
t= 30.0ms singapore proposes c3 key=A
t= 200.0ms frankfurt fast-commit c1 deps=set()
t= 210.0ms frankfurt fast-commit c2 deps=set()
t= 220.0ms singapore slow-path for c3 (deps disagree: {frozenset(), frozenset({'c1'})})
Per-line walkthrough. fast_quorum = 4 is the specific size for F=2 in a 5-node cluster: ⌈3·2/2⌉ + 1 = 4. replicas[replica].pre_accept is the per-replica handler that records the command's dependency set for the touched key — the dependency set is what every replica reports back to the command leader. if len(unique) == 1: fast-commit is the fast-path gate: all responding replicas agreed on the same dependencies. else: slow-path is the Paxos-Accept fallback when responses disagree — c3 writes to key A and conflicts with c1, so different replicas see different dependency sets depending on whether c1's PreAccept arrived first. Why disagreement on deps forces the slow path: if two command leaders both fast-committed conflicting commands with different dependency sets, two replicas applying the DAG could derive different total orders during execution. The slow path runs an explicit Paxos-Accept round on the dependency set, forcing all replicas to agree on the same set before the command is allowed to commit. The slow path costs 2 RTTs vs the fast path's 1 RTT.
EPaxos's fast-path probability depends entirely on conflict rate. A workload with 0% conflicts (every write touches a different key) takes the fast path 100% of the time; a workload with 50% conflicts on a hot key takes the fast path roughly 50% of the time. PaySetu measured its settlement-service workload at 3.2% inter-key conflict rate; the fast-path rate at that conflict level was 96.4%, with the geo-distributed write p99 dropping from 480 ms (Multi-Raft with leader in Mumbai) to 134 ms (EPaxos with command leader at the originating region).
Flexible Paxos — the quorum theorem and grid quorums
Flexible Paxos starts from a different observation. Classical Paxos says both phases use majority quorums of size ⌈(N+1)/2⌉. Howard, Malkhi, and Spiegelman (OPODIS 2016) proved a stronger theorem: the only requirement is that any phase-1 quorum (Q1) intersects any phase-2 quorum (Q2). They are not required to be majorities, and they are not required to be the same size.
The intuition. Phase 1 (prepare) is run when a new leader takes over — it asks a quorum of replicas to promise not to accept proposals from older ballots, and it returns the highest-numbered accepted value. Phase 2 (accept) is run on every command — the leader asks a quorum to accept the value. The safety-critical interaction is: when a new leader runs phase 1, it must observe any value that a previous leader could have committed in phase 2. As long as Q1 and Q2 intersect, the new leader sees at least one replica that participated in the previous leader's accept quorum, and that replica reports back the previously-accepted value.
This unlocks several useful quorum shapes:
- Read-heavy workloads: pick
Q1 = N(election quorum is everyone) andQ2 = 1(write quorum is one replica). TriviallyQ1 ∩ Q2 ≠ ∅. Writes commit on a single replica's ack; elections require everyone, but elections are rare. This is the "primary-backup with synchronous replication to one slave" pattern recast as Flexible Paxos. - Write-heavy workloads: pick
Q1= majority (3 of 5) andQ2= majority (3 of 5). This is classical Paxos. - Geo-quorum: 9 replicas across 3 regions of 3 each. Pick
Q1 = 7(any election needs 7 acks) andQ2 = 3(any write needs 3 acks). PlaceQ2 = 3so a single region's three replicas form aQ2quorum — writes can complete entirely within the originating region.Q1 ∩ Q2 ≠ ∅because any 7 nodes intersect any 3-node region's worth of nodes (at least 1 of the region's 3 must be in any 7-node set in a 9-node cluster, since9 - 7 = 2 < 3).
The third shape is the production sweet spot for geo-distributed deployments: Spanner-style "every region writes locally, leader changes are rare and globally coordinated". Cosmos DB's Bounded Staleness mode uses this; CockroachDB's geo-partitioned cluster mode uses a similar grid-quorum pattern.
The trade-off is operational. Smaller Q2 means smaller write blast-radius (fewer replicas to crash before writes block), but the cluster is more vulnerable to losing the right combination of nodes. The grid quorum tolerates F = min(rows-1, cols-1) failures; with a 3×3 grid, you survive 2 failures, same as a 5-node majority cluster, but you have 9 nodes to operate. The win is geo-locality, not fault tolerance.
Where each protocol breaks — failure modes you must enumerate
Both protocols add complexity that classical Paxos / Raft do not have, and complexity is where production bugs live.
EPaxos failure modes. (a) High conflict rate degrades to two-RTT path. A workload that accidentally creates a hot key (think: BharatBazaar's flash-sale, every write touches inventory:item-7892) drops EPaxos's fast-path rate from 96% to under 30%, and the protocol's amortised cost balloons from 1.05 RTTs to ~1.7 RTTs. (b) Recovery from a failed command leader requires querying the entire dependency graph. When a command leader crashes mid-PreAccept, the recovery quorum must reconstruct the command's dependency set by examining what other replicas have seen — this is more complex than Raft's "just elect a new leader and replicate the leader's log". (c) Execution stalls on dependency cycles. EPaxos commits commands in DAG order, but if the DAG has a long dependency chain, command execution latency can be much higher than commit latency — the user-visible result of a write is delayed until all its dependencies execute. The reference implementation (efficient/epaxos on GitHub) handles this with strongly-connected-component detection in the DAG.
Flexible Paxos failure modes. (a) Asymmetric quorum sizes change failure tolerance asymmetrically. With Q1 = 5, Q2 = 1 in a 5-node cluster, you survive 4 failures for writes but only 0 failures for elections — losing one node makes elections impossible. The classic majority-majority configuration is the only one that gives symmetric F = (N-1)/2 tolerance. (b) Region-affinity write quorums create stale reads in other regions. A grid-quorum cluster where Mumbai writes to its own row sees that write replicate to Singapore and Frankfurt only after the next election or asynchronous replication — cross-region reads can be 30 seconds stale unless the read path explicitly waits for global propagation. (c) The grid layout assumes uniform replica counts per row/column. A real geo-cluster usually has 3+1+1 (one region with 3 replicas, two regions with 1 each) for cost reasons; the FPaxos quorum theorem still applies but the elegant grid intuition breaks, and you have to compute custom quorum sets that satisfy Q1 ∩ Q2 ≠ ∅.
A 2024 PaySetu post-incident review found that the team's first attempt to roll out EPaxos for cross-region settlement writes reverted to Multi-Raft after three weeks. The blocking issue was not consensus latency — that was as predicted, ~134 ms p99 — but the operational complexity of the dependency-DAG-aware backup and snapshot tooling. The team built a hybrid: Flexible Paxos with Q1 = 5, Q2 = 3 and region-affine Q2 placement, which gave 88 ms p99 cross-region writes without the DAG complexity. The lesson: leaderless protocols have real performance wins, but the operational tooling lags 5–10 years behind leader-based protocols, and you should budget for that gap.
Common confusions
-
"EPaxos has no leader, so it has no single point of failure." EPaxos has no static leader, but each command has a command leader — the replica that initiated that command. If the command leader crashes mid-PreAccept, recovery is more involved than Raft's leader election: peers must query the entire dependency set to determine whether the command was committed, and the recovery protocol involves running an explicit-prepare on the command. The "no SPOF" claim is about steady-state liveness, not crash-recovery.
-
"Flexible Paxos lets you pick any two quorums." The constraint is
Q1 ∩ Q2 ≠ ∅. PickingQ1 = {nodes 1,2}andQ2 = {nodes 3,4,5}in a 5-node cluster violates the rule — they do not intersect — and the protocol is unsafe. Standard FPaxos quorum constructions enforce intersection by construction (majority×majority, grid, weighted). -
"EPaxos is faster than Raft." EPaxos is faster on workloads with low conflict rate. On a hot-key workload it can be slower, because the slow path costs 2 RTTs vs Multi-Paxos's 1 RTT plus leader-hop. The break-even depends on the leader-hop latency. In a single-region deployment, Multi-Paxos almost always wins. In a geo-distributed deployment with low conflict, EPaxos wins decisively.
-
"Flexible Paxos and EPaxos solve the same problem." They both attack the leader-bottleneck, but the mechanisms are orthogonal. EPaxos eliminates the leader entirely; FPaxos reshapes the quorums while keeping the leader. You can compose them: a leaderless protocol can also use asymmetric quorums for its rounds (this is what the WPaxos and Mencius papers do).
-
"
⌈3F/2⌉ + 1is just majority + 1." It is not. ForF=1(3 nodes), majority is 2, EPaxos fast quorum is⌈3/2⌉ + 1 = 3— every node. ForF=2(5 nodes), majority is 3, fast quorum is⌈6/2⌉ + 1 = 4— one more than majority. The asymmetry grows: forF=10(21 nodes), majority is 11, fast quorum is 16. EPaxos's fast quorum is roughly 75% of the cluster, not 50%+1. -
"Asymmetric partitions affect EPaxos and FPaxos equally." Both can be hurt by asymmetric partitions, but EPaxos is more vulnerable: a unidirectional link failure between two replicas can cause one to think the other is up (PreAccept arrives) while the other does not respond, leading to slow-path fallback even when no real conflict exists. PreVote-style probes that detect bidirectional reachability help, but the original EPaxos paper does not specify them; production implementations layer them on.
Going deeper
The fast-quorum size proof — why ⌈3F/2⌉ + 1 and not majority
EPaxos's fast quorum is larger than a Paxos majority because the protocol needs an additional intersection property. Consider two commands c1 and c2 that conflict (touch the same key). Their command leaders run independent fast-path PreAccepts. Suppose both fast-paths succeed — each receives identical responses from its own fast quorum. Safety requires: the dependency set both leaders settle on must include at least one of the two commands depending on the other (otherwise execution order is undefined). That, in turn, requires that the two fast quorums share a node that has seen both commands — and the order of arrival at that shared node defines the dependency direction.
The proof is in Moraru's PhD thesis (2014) and the SOSP 2013 paper. The result: any two fast quorums for conflicting commands must intersect at a ⌈F/2⌉ + 1 set, and a fast quorum size that guarantees this in a 2F+1 cluster is ⌈3F/2⌉ + 1. The classical Paxos majority F+1 is not enough because two majorities can intersect at exactly one node, and that node might not have seen both commands' PreAccepts. The fast quorum's larger size guarantees a thicker intersection.
Mencius — round-robin leadership as a halfway point
Mencius (Mao, Junqueira, Marzullo, OSDI 2008) is a third design point between Multi-Paxos and EPaxos. It assigns each consensus instance to a fixed leader in round-robin order — instance 0 to node 0, instance 1 to node 1, etc. — so the leader role is distributed, but each individual decision still has a designated leader. This avoids EPaxos's dependency-DAG complexity while still distributing the originator-to-leader hop cost. The cost is that progress on instance i requires hearing from node i % N, so a slow node throttles the entire log. Mencius is rarely deployed in production today (the "skip" mechanism it uses to handle slow leaders is hard to get right), but its design hints at why production systems gravitate toward Multi-Raft + sharding (one leader per shard, naturally distributing the leader role) instead of leaderless protocols.
WPaxos — weighted quorums from FPaxos
WPaxos (Ailijiang et al., DSN 2017) extends Flexible Paxos with weighted quorums: each replica has a weight, and a quorum is a set whose total weight exceeds a threshold. This lets you place 3 replicas in your primary AZ each with weight 3, and 2 replicas in a backup AZ with weight 1, and define Q2 as "total weight ≥ 4" — a single primary-AZ replica's weight is 3, so paired with one other replica it forms a Q2. The mathematical trick is the same intersection theorem; the operational benefit is fine-grained control over where writes can complete. Production deployments (Microsoft's Service Fabric ring, some CockroachDB internal-replication features) use weighted-quorum logic without naming it as WPaxos.
Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
# save epaxos_fast_path.py from the article body
python3 epaxos_fast_path.py
# Then explore a real EPaxos implementation:
git clone https://github.com/efficient/epaxos
cd epaxos && go build ./...
# The reference implementation has a fast-path / slow-path switch in src/epaxos/epaxos.go
# Read the PreAcceptHandler function — it is the protocol's heart in ~120 lines.
Where this leads next
EPaxos and Flexible Paxos are the leading "post-Raft" consensus protocols, but neither is a drop-in replacement for production-quality Raft libraries. The practical pattern most modern systems converge on is sharding (one Raft group per shard, distributing the leader role across the cluster) plus geo-aware placement. The next chapters explore both directions.
- Multi-Raft and sharding consensus — chapter 54; how thousands of Raft groups are run in parallel, the heartbeat-batching trick, and how MultiRaft sidesteps the single-leader bottleneck without changing the underlying protocol.
- Byzantine consensus — PBFT, HotStuff — chapter 55; consensus when nodes can lie, not just crash. The leaderless / asymmetric-quorum themes recur, with stronger adversarial assumptions.
- Geo-distribution and write-local-read-anywhere — Part 17; how Spanner, Cosmos DB, and CockroachDB shape their replication around the WAN-RTT problem.
- Failure detection — phi-accrual — Part 10; the heartbeat-quality estimator that makes any consensus protocol's election-timeout tunable.
Part 9 (leader election and leases) follows directly. The fencing-token machinery there is what production systems use to prevent the split-brain problem that asymmetric quorums and leaderless protocols both make harder.
References
- Moraru, Andersen, Kaminsky, "There Is More Consensus in Egalitarian Parliaments" — SOSP 2013 — the original EPaxos paper. Sections 3 (protocol) and 4 (correctness) are the load-bearing reads.
- Howard, Malkhi, Spiegelman, "Flexible Paxos: Quorum intersection revisited" — OPODIS 2016 — the FPaxos theorem and grid-quorum construction.
- Moraru, "Egalitarian Distributed Consensus" — Carnegie Mellon PhD thesis 2014 — long-form treatment of EPaxos, including the dependency-recovery protocol.
- Ailijiang, Charapko, Demirbas, "WPaxos: Wide-Area Network Flexible Consensus" — DSN 2017 — weighted quorums on top of FPaxos.
- Mao, Junqueira, Marzullo, "Mencius: Building Efficient Replicated State Machines for WANs" — OSDI 2008 — round-robin leader rotation, the design midpoint between Multi-Paxos and EPaxos.
- efficient/epaxos — reference Go implementation — the SOSP 2013 paper's authors' implementation, used as a benchmark in many later papers.
- Raft in detail — chapter 52; the leader-based baseline that EPaxos and Flexible Paxos are reacting against.
- Paxos and why people struggle with it — chapter 51; classical Paxos as the foundation FPaxos generalises.