Wall: coordination is sometimes necessary
KapitalKite's order-router is everything Part 7 promised. Every outbound RPC is wrapped in a 250 ms timeout. Every retry has exponential backoff with full jitter and a 10% retry-budget cap. Each downstream lives behind a circuit breaker that opens at five consecutive failures, and the trade-engine pool is bulk-headed away from the market-data pool so a slow ticker cannot drink the order thread budget. Token-bucket rate limiters fence the upstream broker; load shedding kicks in at 92% CPU; the degradation ladder slides from "personalised order suggestions" to "static top-100 list" in 50 ms. On the morning of 19 March 2026 at 09:14:53 IST — three minutes after market open — order-router-leader-3 enters a 9-second IO stall because the EBS volume holding its WAL has degraded to 12 ms p99 fsync latency. The breaker opens. The shedding kicks in. The degraded mode serves 1.4 lakh users a flat ranker. Beautiful. Meanwhile, in another AZ, order-router-leader-7 — which was supposed to be a follower — has noticed leader-3 missing two heartbeats, declared itself leader, and started accepting writes against the same kapitalkite-orders shard. For 11 seconds, two leaders accept the same trades. Riya at desk 4 places a buy of 200 shares of an IT large-cap; the order is committed twice with two different sequence numbers; the broker accepts both. Why no Part 7 pattern saves this: timeouts, retries, breakers, bulkheads, rate limits, shedding, and degradation are all local — each is a defence the caller runs against an uncooperative callee. Two replicas disagreeing about which of them is the leader is not a caller-vs-callee problem; it is a peer-vs-peer problem, and no defensive code on the caller's side can settle it. The settlement requires the two replicas to agree — and agreement, in a distributed system, has a name and a price.
Eight chapters of reliability patterns taught your service to defend itself against unreliable peers. They cannot teach two replicas to agree on a leader, a sequence number, or a committed write. When state must be consistent across nodes, the answer is consensus — Paxos, Raft, leases with fencing — not more retries. Part 7 keeps the request alive; Part 8 makes the system agree about what happened.
Why this is a wall, not a chapter
The eight chapters of Part 7 — retries with backoff, circuit breakers, bulkheads, timeouts and deadline propagation, rate limiting, load shedding, degradation modes, and the patterns that compose them — were about one stance: the caller defends itself against the callee. Every algorithm assumed that the caller could decide unilaterally how to react when the callee misbehaved. Drop the request. Try a different replica. Open the breaker. Fall back to a cache. The caller is in charge; the callee's failure is the caller's problem to route around.
This stance breaks down the moment two callees must agree with each other. Specifically:
- Two replicas, both alive, must decide which one is the leader. A timeout on the caller does nothing — both replicas accept connections; both think they are leader.
- A sequence number must be unique across N nodes. If two nodes mint sequence number 4319 in the same millisecond, no retry rescues you — the duplicate is a correctness violation, not a transient failure.
- A write must be either committed everywhere or nowhere. The caller cannot retry into "committed on 2 of 3 replicas" — that is a state the protocol has to define, and the protocol has to make every replica agree on which final state the system landed in.
- A leader's lease must expire before another leader takes over. "Wait long enough" is not a guarantee when clocks drift and GC pauses last 14 seconds (chapter 41's lesson, now compounded).
- A schema migration must be visible everywhere or nowhere. A degraded fallback to "ignore the new column" is wrong — half your callers see one schema, half see another, and the database is suddenly answering the same query two different ways.
None of these is a reliability problem. The caller is healthy; the network is fine; every breaker is closed. The system still produces a wrong answer, because two nodes are doing the right local thing and the combination is incoherent. The coordination layer — consensus protocols, leader election with fencing tokens, atomic commit, and the lease-and-expiry primitives that anchor them — is the part of the system that survives this disagreement. It is the topic of Part 8 (consensus), Part 9 (leader election and leases), and the foundation everything from Part 12 (consistency) to Part 14 (distributed transactions) builds on.
Why "single-caller decisions cannot settle peer disagreement" is not hand-waving: every Part 7 pattern is a function of one variable — the caller's local state. A retry decision uses the caller's failure history. A breaker uses the caller's recent success rate. A shed decision uses the caller's CPU. None of them ever asks another node "what do you think?" — and none of them can, because asking would be a round-trip with the same unreliability that motivated Part 7 in the first place. The moment the question is "do you and I agree on which of us is the leader?", you must do the round-trip and you must define what "agreement" means under partial reachability. That is what Paxos, Raft, leases, and quorums solve, and it is why Part 8 is structurally separate from Part 7 in the curriculum.
A measurable demonstration — perfect Part 7, broken state
The script below is honest about the limits of defensive coding. We run a 3-replica order-router cluster. Every replica has perfect Part 7 hygiene — 250 ms timeouts, retries with jitter, breakers, bulkheads, the lot. We then partition the network so the leader (r0) cannot reach r1 and r2, but is still reachable from clients. We measure two things: how the caller sees the system (does the request succeed?) and what state the system actually contains (which replica thinks it is leader, what sequence number got assigned).
# wall_coord_breaks.py — perfect Part 7 hygiene, no consensus, split-brain happens
import asyncio, random, time
random.seed(42)
class Replica:
def __init__(self, name, peers):
self.name, self.peers = name, peers
self.role = "follower"
self.term = 0
self.last_heartbeat = time.monotonic()
self.committed = [] # local log
self.partitioned_from = set() # peers we cannot reach
async def heartbeat(self):
while True:
await asyncio.sleep(0.05)
now = time.monotonic()
if self.role == "leader":
for p in self.peers:
if p.name in self.partitioned_from: continue
p.last_heartbeat = now # heartbeat propagates
else:
if now - self.last_heartbeat > 0.3: # election timeout
self.role = "leader" # NO consensus — anyone can self-promote
self.term += 1 # local term, not agreed
print(f" {self.name} self-promoted at term={self.term}")
async def write(self, payload):
# Every Part 7 hygiene present: timeout, retry, breaker (omitted for clarity)
if self.role != "leader":
return ("not_leader", None)
seq = len(self.committed) + 1
self.committed.append((self.term, seq, payload))
return ("ok", (self.term, seq))
async def main():
r0, r1, r2 = Replica("r0", []), Replica("r1", []), Replica("r2", [])
for r in (r0, r1, r2): r.peers = [p for p in (r0, r1, r2) if p is not r]
r0.role = "leader"; r0.term = 1 # r0 starts as leader
hb = [asyncio.create_task(r.heartbeat()) for r in (r0, r1, r2)]
await asyncio.sleep(0.1)
print("PHASE 1 — healthy network, r0 is leader")
print(await r0.write({"user": "riya", "buy": 200, "symbol": "ALPHA"}))
print(await r1.write({"user": "rahul", "buy": 50, "symbol": "BETA"})) # rejected
print("PHASE 2 — partition r0 from r1, r2 (r0 still reachable from clients)")
r0.partitioned_from = {"r1", "r2"}
r1.partitioned_from = {"r0"}; r2.partitioned_from = {"r0"}
await asyncio.sleep(0.5) # election timeout fires on r1 and r2
print("PHASE 3 — clients still write to both 'leaders'")
print(" r0:", await r0.write({"user": "riya", "buy": 200, "symbol": "ALPHA"}))
print(" r1:", await r1.write({"user": "asha", "buy": 100, "symbol": "GAMMA"}))
print(" r2:", await r2.write({"user": "kiran", "buy": 75, "symbol": "DELTA"}))
print("PHASE 4 — final state across replicas (note divergent term + duplicate seq)")
for r in (r0, r1, r2):
print(f" {r.name}: role={r.role} term={r.term} log={r.committed}")
for t in hb: t.cancel()
asyncio.run(main())
Sample run on Python 3.11:
PHASE 1 — healthy network, r0 is leader
('ok', (1, 1))
('not_leader', None)
PHASE 2 — partition r0 from r1, r2 (r0 still reachable from clients)
r1 self-promoted at term=1
r2 self-promoted at term=1
PHASE 3 — clients still write to both 'leaders'
r0: ('ok', (1, 2))
r1: ('ok', (1, 1))
r2: ('ok', (1, 1))
PHASE 4 — final state across replicas (note divergent term + duplicate seq)
r0: role=leader term=1 log=[(1,1,...riya...), (1,2,...riya...)]
r1: role=leader term=1 log=[(1,1,...asha...)]
r2: role=leader term=1 log=[(1,1,...kiran...)]
Per-line walkthrough. heartbeat() runs the same fail-detector pattern Part 7 used for breakers — miss N heartbeats, decide the peer is dead. The crucial line is self.role = "leader": when the timeout fires, the replica self-promotes without asking anyone. There is no quorum vote, no term increment that requires majority approval, no fencing token. write() is the production endpoint. It has every Part 7 hygiene we want, but it asks one local question — self.role == "leader" — and that question's answer can be True on three replicas at once. partitioned_from simulates an asymmetric reachability failure: clients can reach all three replicas, the replicas cannot reach each other. Why this is the canonical split-brain shape: the partition is asymmetric — clients see all three replicas as healthy because the client-facing path works, while the replicas cannot exchange heartbeats because the replica-to-replica path is down. This is the most common production form of split-brain (cross-AZ replica replication path goes down before the client-facing load balancer notices anything wrong), and Part 7 patterns are blind to it because they only inspect the client-facing path.
The output shows the catastrophe: three replicas, three "leaders", three logs with the same sequence number (1, 1) mapped to three different writes. The caller's view is perfect — every write returned ok in 5 ms. The system's state is incoherent. Riya's buy is committed twice, on r0 only. Asha's and Kiran's writes exist on one replica each. When the partition heals, the log-merge logic has no principled way to reconcile — three different writes claim seq=1. No retry, no breaker, no bulkhead, no rate limiter, no degradation mode rescues this. The only fix is at a different layer: a leader-election protocol where promotion requires a majority quorum of peers to agree, and a fencing token that makes the old leader's writes rejectable once the new leader is elected. That is the work of chapters 50–57.
What the next part actually fixes — preview
Part 8 (chapters 50–57) takes the failure shapes Part 7 is blind to and gives you the consensus primitives that handle them. The order is principled — each chapter introduces a mechanism the previous one's failure mode demands.
The interactions matter as much as the mechanisms. Election without fencing tokens does not prevent zombie writes — the demoted leader keeps accepting writes for a while because it has not noticed it lost the lease yet, and the new leader has no way to reject those writes. PaySetu learned this in 2024: a 7-second zombie window after a failover allowed 2 800 duplicate UPI debit attempts to be queued; without the fencing token in the storage layer rejecting term < current_term, the duplicates would have hit the issuing bank. Consensus without leases for the read path makes every read pay the cost of a quorum round-trip. KapitalKite measured 1.4 ms p99 for lease-anchored reads vs 8.2 ms p99 for quorum reads at the same load — for 12 000 reads/second per shard, the difference is the gap between making the SLO and missing it. Leases without bounded clock skew are unsafe. Spanner's TrueTime exists because Google measured that in their datacentres without TrueTime, NTP-driven skew of up to 7 ms could cause a lease to be considered valid on the new leader before the old leader's lease had expired locally — opening a window where both nodes thought they had the lease.
Why "fencing token" is not just "version number": a version number is what the storage layer increments on every write. A fencing token is what the coordination layer hands the new leader, with the contract that storage must reject any write tagged with a token less than the highest token it has ever seen. This is a one-line change in the storage write path — if request.token < self.max_seen_token: return reject — but it is the difference between safe failover and zombie corruption. Martin Kleppmann's 2017 post on this is the canonical reference; the failure mode it describes is exactly what KapitalKite hit in our chapter-lead scenario: a stale leader holding a lock generation g=4 continues to issue writes after a new leader is elected with g=5, and unless the storage layer rejects g=4 writes, the new leader's state diverges from the old leader's still-incoming traffic.
Common confusions
-
"If my retries are good enough, I don't need consensus." Wrong in the same way "if my locks are tight, I don't need transactions" is wrong. Retries handle transient failures of one operation against one peer. Consensus is what makes the agreement about whether that operation happened at all. A perfectly-retried write that ends up committed on
r0andr1but notr2still leaves the system in an undefined state — only consensus can pick the canonical answer. -
"Leader election is just heartbeat-and-timeout." That is what the broken script in the demonstration above does, and it produces three leaders. Real leader election is heartbeat-and-timeout plus a quorum vote on promotion plus a monotonic term/epoch number that prevents a slower-network candidate from "winning" against a real elected leader plus a fencing token in storage. Removing any of those four pieces brings split-brain back. Heartbeats alone are a failure detector, not an election.
-
"Consensus is for databases." Consensus is for any multi-node decision that must be unique. Leader election in Kubernetes is consensus (etcd's Raft). Stream processing exactly-once is consensus (Kafka's controller, Flink's checkpoint coordinator). Distributed locks are consensus (etcd, Consul, ZooKeeper). Service-discovery membership is consensus or gossip with eventual consistency — and the choice of which is itself a consensus-vs-eventual trade-off. If your system has any state that two nodes can disagree about and the disagreement is wrong, you need consensus somewhere — even if you do not run Raft yourself, you depend on something that does.
-
"Quorum is the same as consensus." A quorum is a counting rule: with N replicas, R reads and W writes must overlap (W + R > N). That gives single-key linearisability if every operation is acked by a quorum. Consensus is the protocol that makes the quorum operate correctly under failure — handling the case where the quorum that acks the write is not the same quorum that handles the next read, the case where the network partitions a minority off, the case where the leader fails mid-write. Quorum is a primitive; consensus is a protocol that uses quorum.
-
"Eventual consistency means I do not need consensus." Eventual consistency means the system converges given enough time and no more failures. The convergence still needs a deterministic merge function (CRDTs in Part 13) or an authoritative tie-breaker. Most "eventually consistent" stores (Dynamo, Cassandra) still use consensus internally for membership, schema, and metadata — they just do not pay the cost on every data write. The cost is moved, not eliminated.
-
"Consensus is too expensive — that's why we don't use it." Consensus is expensive per decision — typically 1–4 ms within a region for Raft, 50–200 ms across regions. The cost is amortised by batching (one consensus round commits a batch of operations) and by leases (the leader holds a lease for many seconds and serves reads locally). A well-configured Raft cluster supports 50 000–200 000 operations per second per shard. The honest framing is: consensus is expensive enough that you should be selective about what you put through it, not avoid it entirely.
Going deeper
What FLP actually says — and why it is in chapter 50, not buried
The Fischer-Lynch-Paterson impossibility (JACM 1985) proves that in a fully asynchronous system with even one faulty process, no deterministic consensus protocol can guarantee both safety and liveness. The result is foundational because it tells you the protocol you are about to learn (Paxos, Raft) is not "perfect" — it is a deliberate trade-off. Paxos and Raft give up liveness during persistent partitions: under a partition, the protocol prefers to stall rather than risk safety violations. This is why an etcd cluster split 2-3 across an AZ failure stops accepting writes — the minority side cannot make progress, the majority side waits for the minority to come back or for an operator to manually intervene. That waiting is FLP made visible. The chapter places FLP first because every subsequent chapter (Paxos, Raft, EPaxos, multi-Raft) is a particular point in the trade-off space FLP defines, and reading them without the FLP frame turns them into trivia.
The order-router incident — what the post-incident report actually said
KapitalKite's post-incident report on the 19 March 2026 split-brain (the chapter-lead scenario) asked one question: "Where did the team mistake reliability for coordination?" The answer was three places. (1) The replica heartbeat loop used the same Part 7 fail-detector logic as the client-facing breaker — five missed heartbeats, declare dead — but treated "dead" as a unilateral signal. The fix was a quorum vote: a candidate may declare itself leader only if a majority of peers concur. (2) The storage layer wrote sequence numbers without a fencing token; both leaders could allocate seq=1 because neither had a coordination-layer token to scope it. The fix was per-shard generation numbers stored in a Raft-replicated metadata service. (3) The team had treated "consensus is expensive" as a reason to avoid Raft and instead built a "lighter" custom protocol on top of timeouts. The custom protocol was correct for the failures they had imagined; it was incorrect for the asymmetric partition they had not. The lesson — written into the post-incident's change in engineering practice section — was: if your system has more than one node that can mutate the same state, the question is not whether you have consensus, but where it lives and who maintains it. Often the right answer is "we use etcd" — a small dependency that is someone else's problem to keep correct.
Why "leases are cheap, consensus is expensive" is the next-chapter title
The wall after Part 8 (chapter 57) is "consensus is expensive — leases are cheap". That wall closes Part 8 and opens Part 9. The framing is operational: consensus done naively is one Raft round per decision (1–4 ms in-region), and at 200 000 decisions per second per shard, you cannot afford a Raft round on every read. Leases solve this by letting the leader serve reads locally for a duration T after a single Raft round establishes the lease. As long as T is bounded by the minimum clock-skew envelope and the lease-renewal overhead is amortised, reads cost almost nothing. The full coordination story is consensus for writes + leases for reads, not consensus alone. Spanner's read-locality story, Kubernetes' control-plane reads-from-leader, and KapitalKite's order-router post-fix architecture all use this composition. The reader who skips chapter 57 will think Raft makes every operation pay 4 ms — which would make it unusable at trading-system scale — and will not understand why people use it anyway.
When coordination is the wrong answer — push the agreement to the data type
Consensus makes the system slower per operation. Sometimes the right move is to choose a data structure that does not need agreement. CRDTs (Part 13) are commutative-associative-idempotent merge structures — G-Counters, OR-Sets, LWW-Registers — where any two replica states can be merged into a third state that is the "join" of both, deterministically. No consensus needed; no leader needed; no quorum needed. The trade-off is that the data structure has to fit a CRDT shape, and many things (uniqueness constraints, sequence numbers, transactional invariants) do not fit. The decision tree at design time: can this data type be a CRDT? if yes, you can avoid the consensus cost entirely; if no, you pay the Raft tax. CricStream's chat-message store is a CRDT (LWW-Set on message-ID). Their "currently-watching" counter is a G-Counter. Their "leader-of-the-stream-shard" is not a CRDT — that is a Raft cluster, because exactly-one is not a property CRDTs can enforce.
Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip # asyncio is stdlib in 3.11+
# save wall_coord_breaks.py from the article body
python3 wall_coord_breaks.py
# Expected: phase 4 shows three replicas all at role=leader, three logs with
# duplicate seq=1 mapped to different writes — the canonical split-brain pattern.
# Then verify the fix preview from Part 8 — install etcd and run an election:
brew install etcd # or: apt install etcd-server etcd-client
etcd & # start a local single-node cluster
etcdctl put /lock/order-router "r0" --lease=$(etcdctl lease grant 10 | awk '{print $2}')
# r1's parallel attempt with --prev-kv will fail; only one holder of the lease.
Where this leads next
Parts 8 and 9 (chapters 50–67) are where the system stops trusting that local defences are enough and starts paying for agreement. Each chapter takes one of the failure modes from the table above and gives you the mechanism, with a runnable Python artefact:
- FLP impossibility — what it forbids — chapter 50; the boundary every consensus protocol negotiates.
- Paxos and why people struggle with it — chapter 51; the original consensus protocol, derived from first principles.
- Raft in detail — chapter 52; the understandable consensus algorithm; production-default for etcd, Consul, CockroachDB.
- EPaxos and Flexible Paxos — chapter 53; leader-less and reduced-quorum variants for geo-distributed deployments.
- Multi-Raft and sharding consensus — chapter 54; how production systems run thousands of Raft groups in parallel.
- Byzantine consensus — PBFT, HotStuff — chapter 55; consensus when nodes may lie.
- When not to use consensus — chapter 56; the design discipline of pushing decisions to data types or eventual consistency.
- Wall: consensus is expensive — leases are cheap — chapter 57; the closing wall that opens Part 9 (leader election and leases).
After Parts 8 and 9, the system has a coherent foundation: defensive coding (Part 7) plus agreement (Parts 8–9). Part 10 then asks the question both layers depend on — how do you decide a peer has actually failed? — because every retry, every breaker, every election timeout is anchored on a failure-detector decision, and getting that wrong corrupts everything that depends on it. Part 11 (gossip) then handles the membership and metadata problems where consensus is too expensive and eventual is good enough. The progression: defend (7) → agree (8–9) → detect (10) → propagate (11). Each layer assumes the one above is solved; each layer's failures motivate the next.
References
- Lamport, "The Part-Time Parliament" — TOCS 1998 — the original Paxos paper; chapter 51 is built on this.
- Ongaro & Ousterhout, "In Search of an Understandable Consensus Algorithm" — USENIX ATC 2014 — the Raft paper; explicitly written for "understandable", and chapter 52 builds the simulation from this.
- Fischer, Lynch, Paterson, "Impossibility of Distributed Consensus with One Faulty Process" — JACM 1985 — the FLP result; required reading before any consensus chapter.
- Kleppmann, "How to do distributed locking" — 2016 — the canonical fencing-token explanation; if you read one post on coordination, read this one.
- Corbett et al., "Spanner: Google's Globally-Distributed Database" — OSDI 2012 — TrueTime-anchored leases plus Paxos; the production reference for how leases and consensus compose.
- Brooker, "Leader election in distributed systems" — AWS Builder's Library — the practitioner reference for what production leader election looks like.
- Degradation modes — chapter 48; the immediately previous chapter, the last defensive pattern before the wall.
- Crash, omission, timing, byzantine — Part 2 chapter; the failure-model taxonomy that Part 8 protocols negotiate against.