EPaxos and Flexible Paxos

It is 19:42 IST on a Friday, and PaySetu's settlement service is paying a 110 ms tail penalty on every cross-region write because the Multi-Raft leader for shard 042 happens to live in Mumbai while the write originated in Singapore. Aditi, the platform-tech lead, is staring at a flame graph that says: 9 ms application logic, 12 ms local fsync, 89 ms WAN round-trip to the leader. The cluster is correct. The cluster is fast on paper. The cluster is wasting 80% of every cross-region write on a leader that did not need to be there. Why this exact frame is the entry point to EPaxos and Flexible Paxos: classical Multi-Paxos and Raft both bake in the assumption that one node is the leader, and every write goes through it. For a single-region deployment that assumption is fine — the leader-RTT is 0.5 ms and you do not notice. For a geo-distributed deployment the leader-RTT is 80–250 ms across continents, and the leader becomes the dominant cost. The two protocols in this chapter are the two clean answers: kill the leader entirely (EPaxos), or change the quorum shape so the leader matters less (Flexible Paxos).

EPaxos is a leaderless consensus protocol where any replica can propose a command; commands that do not interfere with each other commit in one round-trip (a "fast quorum" of ⌈3F/2⌉ + 1 for F failures), and conflicting commands fall back to a two-round Paxos. Flexible Paxos generalises Paxos's "majority quorum" to two intersecting quorum sets Q1 (used in phase 1, leader election) and Q2 (used in phase 2, accepting writes); the only requirement is Q1 ∩ Q2 ≠ ∅. Together they break the assumption that consensus needs a single fixed leader and a single quorum size — the building blocks for systems where the write originator is geographically far from "the leader".

Why a single leader is the wrong shape for geo-distribution

Multi-Paxos and Raft both run a stable leader; every write goes through it. In a single region this is the right call — the leader's local-quorum RTT is sub-millisecond, and steering every write through one node simplifies the protocol enormously (no concurrency, no command-dependency tracking, no fast-path/slow-path split). The cost only becomes visible when you stretch the cluster across regions. If your replicas live in Mumbai, Singapore, and Frankfurt, and the leader is in Mumbai, then a write originated in Frankfurt pays one Frankfurt→Mumbai hop (200 ms) before the leader can even start its quorum round. Why this is structural and not a tuning problem: the leader is a serialisation point. Even if you make the network faster, the leader still has to receive the write, append it locally, replicate to a majority, and ack — and the receive step is what kills geo-latency, because the originator is by definition not the leader most of the time. You cannot tune your way out of "the leader is in the wrong AZ for this client".

The two protocols in this chapter take orthogonal approaches:

Leader-based vs leaderless write paths under geo-distributionTwo panels. Left: a leader-based protocol (Multi-Paxos / Raft) with a write originating in Frankfurt that must hop to the Mumbai leader before the quorum round, paying 200 ms in the leader-hop. Right: EPaxos where the Frankfurt replica acts as command-leader and runs the quorum round directly, paying only the local-region quorum RTT. Why leaderless wins for geo: the originator-to-leader hop disappears Multi-Paxos / Raft leader in Mumbai, write in Frankfurt Frankfurt client Mumbai leader Singapore Tokyo 200 ms replicate (50–80 ms) Total write RTT: 200 + 80 + 200 ≈ 480 ms leader-hop dominates EPaxos (leaderless) any replica can propose Frankfurt cmd-leader Mumbai Singapore Tokyo parallel fast quorum Total write RTT: max(80, 100, 110) ≈ 110 ms no leader-hop For non-conflicting commands, EPaxos pays one fast-quorum RTT instead of leader-hop + quorum-RTT.
Illustrative — leader-based vs leaderless geo-distributed write path. The exact numbers depend on inter-region RTT; the structural difference is whether the originator pays a "to the leader" hop before the quorum round even starts.

EPaxos — leaderless commits via interference graphs

EPaxos's central observation: most pairs of commands in a real workload do not interfere with each other. A PUT key=A and a PUT key=B commute — the order in which they apply does not matter. Only commands that touch the same key (or, more generally, share a serialisable resource) need to agree on an order. So instead of forcing every command through one leader to establish a global total order, EPaxos lets each replica be the command leader for its own proposals, and tracks per-command dependency sets that record "this command must apply after these other commands".

The protocol has two paths:

The fast quorum is larger than a majority because EPaxos needs a stronger property: every two fast quorums for conflicting commands must intersect at one specific node, so that conflicting commands cannot both take the fast path with disagreeing dependencies. The exact size is derived from the protocol's safety proof; the take-away is that EPaxos's fast quorum is a 4-of-5 in a 5-node cluster vs Paxos's 3-of-5 majority.

In production-faithful workloads (Moraru's SOSP 2013 paper measures Spanner-style key-value workloads with 1–10% conflict rates), 95–98% of commands take the fast path. The slow path is for hot keys and contention, and is exactly the cost Multi-Paxos pays on every command. The amortised cost is 1.05 round-trips per command vs Multi-Paxos's leader-hop + 1 round-trip. Why the dependency-tracking machinery is the price of leaderlessness: without a global serialisation point, you cannot rely on a leader's log order to define "before". EPaxos replaces total order with a partial order — a DAG of commands, where the only edges are between commands that genuinely interfere. The DAG is the equivalent of Multi-Paxos's log, but it grows as a graph rather than a sequence, and the per-command leader records its incoming edges into the DAG via the dependency set.

A working EPaxos fast-path simulator

# epaxos_fast_path.py — minimal EPaxos fast-path simulator: 5 replicas, simulated
# inter-region latencies, two clients submitting non-conflicting and conflicting writes.
import collections, heapq, random

REPLICAS = ["mumbai", "singapore", "frankfurt", "tokyo", "sydney"]
RTT = {  # one-way ms between replicas
    ("mumbai","singapore"): 30, ("mumbai","frankfurt"): 100, ("mumbai","tokyo"): 60, ("mumbai","sydney"): 75,
    ("singapore","frankfurt"): 110, ("singapore","tokyo"): 35, ("singapore","sydney"): 50,
    ("frankfurt","tokyo"): 130, ("frankfurt","sydney"): 145, ("tokyo","sydney"): 55,
}
def lat(a, b):
    if a == b: return 0
    return RTT.get((a,b)) or RTT.get((b,a))

events, log = [], []
def schedule(t, kind, *args): heapq.heappush(events, (t, kind, *args))

class Replica:
    def __init__(self, nid):
        self.nid = nid
        self.deps = collections.defaultdict(set)  # key -> set of command_ids that touched it
    def pre_accept(self, cmd_id, key, leader, t):
        seen_deps = set(self.deps[key])
        self.deps[key].add(cmd_id)
        schedule(t + lat(self.nid, leader), "pre_accept_resp", leader, cmd_id, frozenset(seen_deps))

replicas = {n: Replica(n) for n in REPLICAS}
fast_quorum = 4   # ceil(3*F/2)+1 with F=2 in 5 nodes

# Two writes from frankfurt: one to key=A (non-conflicting), one to key=B (non-conflicting w/ A).
# Then a third write to key=A from singapore — this conflicts and shows fast-path agreement on deps.
def submit(cmd_id, key, leader, t):
    log.append((t, f"{leader} proposes {cmd_id} key={key}"))
    for r in REPLICAS:
        schedule(t + lat(leader, r), "pre_accept", r, cmd_id, key, leader)

# Track responses
responses = collections.defaultdict(list)
committed_at = {}

def step(now, kind, *args):
    if kind == "pre_accept":
        replica, cmd_id, key, leader = args
        replicas[replica].pre_accept(cmd_id, key, leader, now)
    elif kind == "pre_accept_resp":
        leader, cmd_id, deps = args
        responses[cmd_id].append(deps)
        if cmd_id in committed_at: return
        if len(responses[cmd_id]) == fast_quorum:
            unique = set(responses[cmd_id])
            if len(unique) == 1:
                committed_at[cmd_id] = now
                log.append((now, f"{leader} fast-commit {cmd_id} deps={set(next(iter(unique)))}"))
            else:
                log.append((now, f"{leader} slow-path for {cmd_id} (deps disagree: {unique})"))

submit("c1", "A", "frankfurt", 0)
submit("c2", "B", "frankfurt", 5)
submit("c3", "A", "singapore", 30)   # conflicts with c1; depending on ordering may fast or slow path

while events:
    t, kind, *args = heapq.heappop(events); step(t, kind, *args)

for ts, msg in sorted(log): print(f"t={ts:6.1f}ms  {msg}")

Sample run (Python 3.11, deterministic given the latency table):

t=   0.0ms  frankfurt proposes c1 key=A
t=   5.0ms  frankfurt proposes c2 key=B
t=  30.0ms  singapore proposes c3 key=A
t= 200.0ms  frankfurt fast-commit c1 deps=set()
t= 210.0ms  frankfurt fast-commit c2 deps=set()
t= 220.0ms  singapore slow-path for c3 (deps disagree: {frozenset(), frozenset({'c1'})})

Per-line walkthrough. fast_quorum = 4 is the specific size for F=2 in a 5-node cluster: ⌈3·2/2⌉ + 1 = 4. replicas[replica].pre_accept is the per-replica handler that records the command's dependency set for the touched key — the dependency set is what every replica reports back to the command leader. if len(unique) == 1: fast-commit is the fast-path gate: all responding replicas agreed on the same dependencies. else: slow-path is the Paxos-Accept fallback when responses disagree — c3 writes to key A and conflicts with c1, so different replicas see different dependency sets depending on whether c1's PreAccept arrived first. Why disagreement on deps forces the slow path: if two command leaders both fast-committed conflicting commands with different dependency sets, two replicas applying the DAG could derive different total orders during execution. The slow path runs an explicit Paxos-Accept round on the dependency set, forcing all replicas to agree on the same set before the command is allowed to commit. The slow path costs 2 RTTs vs the fast path's 1 RTT.

EPaxos's fast-path probability depends entirely on conflict rate. A workload with 0% conflicts (every write touches a different key) takes the fast path 100% of the time; a workload with 50% conflicts on a hot key takes the fast path roughly 50% of the time. PaySetu measured its settlement-service workload at 3.2% inter-key conflict rate; the fast-path rate at that conflict level was 96.4%, with the geo-distributed write p99 dropping from 480 ms (Multi-Raft with leader in Mumbai) to 134 ms (EPaxos with command leader at the originating region).

Flexible Paxos — the quorum theorem and grid quorums

Flexible Paxos starts from a different observation. Classical Paxos says both phases use majority quorums of size ⌈(N+1)/2⌉. Howard, Malkhi, and Spiegelman (OPODIS 2016) proved a stronger theorem: the only requirement is that any phase-1 quorum (Q1) intersects any phase-2 quorum (Q2). They are not required to be majorities, and they are not required to be the same size.

The intuition. Phase 1 (prepare) is run when a new leader takes over — it asks a quorum of replicas to promise not to accept proposals from older ballots, and it returns the highest-numbered accepted value. Phase 2 (accept) is run on every command — the leader asks a quorum to accept the value. The safety-critical interaction is: when a new leader runs phase 1, it must observe any value that a previous leader could have committed in phase 2. As long as Q1 and Q2 intersect, the new leader sees at least one replica that participated in the previous leader's accept quorum, and that replica reports back the previously-accepted value.

This unlocks several useful quorum shapes:

The third shape is the production sweet spot for geo-distributed deployments: Spanner-style "every region writes locally, leader changes are rare and globally coordinated". Cosmos DB's Bounded Staleness mode uses this; CockroachDB's geo-partitioned cluster mode uses a similar grid-quorum pattern.

Flexible Paxos grid quorum on a 3x3 clusterA 3x3 grid of 9 replicas arranged by region (rows) and AZ (columns). Q1 (election) is highlighted as one full column (3 replicas, one per region). Q2 (write) is highlighted as one full row (3 replicas, all in one region). Any column intersects any row at exactly one node — so Q1 and Q2 always intersect. Flexible Paxos grid quorum — Q1 (column) ∩ Q2 (row) = exactly 1 node Region Mumbai Singapore Frankfurt AZ-a AZ-b AZ-c N1 N2 N3 N4 N5 N6 N7 N8 N9 Q2 (write) = N1, N2, N3 Mumbai writes locally Q1 (elect) N1, N4, N7 N=9 nodes. Q1 = 1 column (3 nodes, one per region). Q2 = 1 row (3 nodes, all in one region). Q1 ∩ Q2 = exactly N1. Cluster tolerates F=2 failures (2 column-removed still leaves 1 from any row).
Illustrative — Flexible Paxos grid-quorum layout for a 9-node geo-distributed cluster. The grid lets a region commit writes locally (Q2 = one row) while elections still cross regions (Q1 = one column). Q1 ∩ Q2 = exactly one node, satisfying the FPaxos safety condition.

The trade-off is operational. Smaller Q2 means smaller write blast-radius (fewer replicas to crash before writes block), but the cluster is more vulnerable to losing the right combination of nodes. The grid quorum tolerates F = min(rows-1, cols-1) failures; with a 3×3 grid, you survive 2 failures, same as a 5-node majority cluster, but you have 9 nodes to operate. The win is geo-locality, not fault tolerance.

Where each protocol breaks — failure modes you must enumerate

Both protocols add complexity that classical Paxos / Raft do not have, and complexity is where production bugs live.

EPaxos failure modes. (a) High conflict rate degrades to two-RTT path. A workload that accidentally creates a hot key (think: BharatBazaar's flash-sale, every write touches inventory:item-7892) drops EPaxos's fast-path rate from 96% to under 30%, and the protocol's amortised cost balloons from 1.05 RTTs to ~1.7 RTTs. (b) Recovery from a failed command leader requires querying the entire dependency graph. When a command leader crashes mid-PreAccept, the recovery quorum must reconstruct the command's dependency set by examining what other replicas have seen — this is more complex than Raft's "just elect a new leader and replicate the leader's log". (c) Execution stalls on dependency cycles. EPaxos commits commands in DAG order, but if the DAG has a long dependency chain, command execution latency can be much higher than commit latency — the user-visible result of a write is delayed until all its dependencies execute. The reference implementation (efficient/epaxos on GitHub) handles this with strongly-connected-component detection in the DAG.

Flexible Paxos failure modes. (a) Asymmetric quorum sizes change failure tolerance asymmetrically. With Q1 = 5, Q2 = 1 in a 5-node cluster, you survive 4 failures for writes but only 0 failures for elections — losing one node makes elections impossible. The classic majority-majority configuration is the only one that gives symmetric F = (N-1)/2 tolerance. (b) Region-affinity write quorums create stale reads in other regions. A grid-quorum cluster where Mumbai writes to its own row sees that write replicate to Singapore and Frankfurt only after the next election or asynchronous replication — cross-region reads can be 30 seconds stale unless the read path explicitly waits for global propagation. (c) The grid layout assumes uniform replica counts per row/column. A real geo-cluster usually has 3+1+1 (one region with 3 replicas, two regions with 1 each) for cost reasons; the FPaxos quorum theorem still applies but the elegant grid intuition breaks, and you have to compute custom quorum sets that satisfy Q1 ∩ Q2 ≠ ∅.

A 2024 PaySetu post-incident review found that the team's first attempt to roll out EPaxos for cross-region settlement writes reverted to Multi-Raft after three weeks. The blocking issue was not consensus latency — that was as predicted, ~134 ms p99 — but the operational complexity of the dependency-DAG-aware backup and snapshot tooling. The team built a hybrid: Flexible Paxos with Q1 = 5, Q2 = 3 and region-affine Q2 placement, which gave 88 ms p99 cross-region writes without the DAG complexity. The lesson: leaderless protocols have real performance wins, but the operational tooling lags 5–10 years behind leader-based protocols, and you should budget for that gap.

Common confusions

Going deeper

The fast-quorum size proof — why ⌈3F/2⌉ + 1 and not majority

EPaxos's fast quorum is larger than a Paxos majority because the protocol needs an additional intersection property. Consider two commands c1 and c2 that conflict (touch the same key). Their command leaders run independent fast-path PreAccepts. Suppose both fast-paths succeed — each receives identical responses from its own fast quorum. Safety requires: the dependency set both leaders settle on must include at least one of the two commands depending on the other (otherwise execution order is undefined). That, in turn, requires that the two fast quorums share a node that has seen both commands — and the order of arrival at that shared node defines the dependency direction.

The proof is in Moraru's PhD thesis (2014) and the SOSP 2013 paper. The result: any two fast quorums for conflicting commands must intersect at a ⌈F/2⌉ + 1 set, and a fast quorum size that guarantees this in a 2F+1 cluster is ⌈3F/2⌉ + 1. The classical Paxos majority F+1 is not enough because two majorities can intersect at exactly one node, and that node might not have seen both commands' PreAccepts. The fast quorum's larger size guarantees a thicker intersection.

Mencius — round-robin leadership as a halfway point

Mencius (Mao, Junqueira, Marzullo, OSDI 2008) is a third design point between Multi-Paxos and EPaxos. It assigns each consensus instance to a fixed leader in round-robin order — instance 0 to node 0, instance 1 to node 1, etc. — so the leader role is distributed, but each individual decision still has a designated leader. This avoids EPaxos's dependency-DAG complexity while still distributing the originator-to-leader hop cost. The cost is that progress on instance i requires hearing from node i % N, so a slow node throttles the entire log. Mencius is rarely deployed in production today (the "skip" mechanism it uses to handle slow leaders is hard to get right), but its design hints at why production systems gravitate toward Multi-Raft + sharding (one leader per shard, naturally distributing the leader role) instead of leaderless protocols.

WPaxos — weighted quorums from FPaxos

WPaxos (Ailijiang et al., DSN 2017) extends Flexible Paxos with weighted quorums: each replica has a weight, and a quorum is a set whose total weight exceeds a threshold. This lets you place 3 replicas in your primary AZ each with weight 3, and 2 replicas in a backup AZ with weight 1, and define Q2 as "total weight ≥ 4" — a single primary-AZ replica's weight is 3, so paired with one other replica it forms a Q2. The mathematical trick is the same intersection theorem; the operational benefit is fine-grained control over where writes can complete. Production deployments (Microsoft's Service Fabric ring, some CockroachDB internal-replication features) use weighted-quorum logic without naming it as WPaxos.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
# save epaxos_fast_path.py from the article body
python3 epaxos_fast_path.py

# Then explore a real EPaxos implementation:
git clone https://github.com/efficient/epaxos
cd epaxos && go build ./...
# The reference implementation has a fast-path / slow-path switch in src/epaxos/epaxos.go
# Read the PreAcceptHandler function — it is the protocol's heart in ~120 lines.

Where this leads next

EPaxos and Flexible Paxos are the leading "post-Raft" consensus protocols, but neither is a drop-in replacement for production-quality Raft libraries. The practical pattern most modern systems converge on is sharding (one Raft group per shard, distributing the leader role across the cluster) plus geo-aware placement. The next chapters explore both directions.

Part 9 (leader election and leases) follows directly. The fencing-token machinery there is what production systems use to prevent the split-brain problem that asymmetric quorums and leaderless protocols both make harder.

References

  1. Moraru, Andersen, Kaminsky, "There Is More Consensus in Egalitarian Parliaments" — SOSP 2013 — the original EPaxos paper. Sections 3 (protocol) and 4 (correctness) are the load-bearing reads.
  2. Howard, Malkhi, Spiegelman, "Flexible Paxos: Quorum intersection revisited" — OPODIS 2016 — the FPaxos theorem and grid-quorum construction.
  3. Moraru, "Egalitarian Distributed Consensus" — Carnegie Mellon PhD thesis 2014 — long-form treatment of EPaxos, including the dependency-recovery protocol.
  4. Ailijiang, Charapko, Demirbas, "WPaxos: Wide-Area Network Flexible Consensus" — DSN 2017 — weighted quorums on top of FPaxos.
  5. Mao, Junqueira, Marzullo, "Mencius: Building Efficient Replicated State Machines for WANs" — OSDI 2008 — round-robin leader rotation, the design midpoint between Multi-Paxos and EPaxos.
  6. efficient/epaxos — reference Go implementation — the SOSP 2013 paper's authors' implementation, used as a benchmark in many later papers.
  7. Raft in detail — chapter 52; the leader-based baseline that EPaxos and Flexible Paxos are reacting against.
  8. Paxos and why people struggle with it — chapter 51; classical Paxos as the foundation FPaxos generalises.