Wall: coordination is sometimes necessary

KapitalKite's order-router is everything Part 7 promised. Every outbound RPC is wrapped in a 250 ms timeout. Every retry has exponential backoff with full jitter and a 10% retry-budget cap. Each downstream lives behind a circuit breaker that opens at five consecutive failures, and the trade-engine pool is bulk-headed away from the market-data pool so a slow ticker cannot drink the order thread budget. Token-bucket rate limiters fence the upstream broker; load shedding kicks in at 92% CPU; the degradation ladder slides from "personalised order suggestions" to "static top-100 list" in 50 ms. On the morning of 19 March 2026 at 09:14:53 IST — three minutes after market open — order-router-leader-3 enters a 9-second IO stall because the EBS volume holding its WAL has degraded to 12 ms p99 fsync latency. The breaker opens. The shedding kicks in. The degraded mode serves 1.4 lakh users a flat ranker. Beautiful. Meanwhile, in another AZ, order-router-leader-7 — which was supposed to be a follower — has noticed leader-3 missing two heartbeats, declared itself leader, and started accepting writes against the same kapitalkite-orders shard. For 11 seconds, two leaders accept the same trades. Riya at desk 4 places a buy of 200 shares of an IT large-cap; the order is committed twice with two different sequence numbers; the broker accepts both. Why no Part 7 pattern saves this: timeouts, retries, breakers, bulkheads, rate limits, shedding, and degradation are all local — each is a defence the caller runs against an uncooperative callee. Two replicas disagreeing about which of them is the leader is not a caller-vs-callee problem; it is a peer-vs-peer problem, and no defensive code on the caller's side can settle it. The settlement requires the two replicas to agree — and agreement, in a distributed system, has a name and a price.

Eight chapters of reliability patterns taught your service to defend itself against unreliable peers. They cannot teach two replicas to agree on a leader, a sequence number, or a committed write. When state must be consistent across nodes, the answer is consensus — Paxos, Raft, leases with fencing — not more retries. Part 7 keeps the request alive; Part 8 makes the system agree about what happened.

Why this is a wall, not a chapter

The eight chapters of Part 7 — retries with backoff, circuit breakers, bulkheads, timeouts and deadline propagation, rate limiting, load shedding, degradation modes, and the patterns that compose them — were about one stance: the caller defends itself against the callee. Every algorithm assumed that the caller could decide unilaterally how to react when the callee misbehaved. Drop the request. Try a different replica. Open the breaker. Fall back to a cache. The caller is in charge; the callee's failure is the caller's problem to route around.

This stance breaks down the moment two callees must agree with each other. Specifically:

None of these is a reliability problem. The caller is healthy; the network is fine; every breaker is closed. The system still produces a wrong answer, because two nodes are doing the right local thing and the combination is incoherent. The coordination layer — consensus protocols, leader election with fencing tokens, atomic commit, and the lease-and-expiry primitives that anchor them — is the part of the system that survives this disagreement. It is the topic of Part 8 (consensus), Part 9 (leader election and leases), and the foundation everything from Part 12 (consistency) to Part 14 (distributed transactions) builds on.

Two stances — defending alone vs agreeing togetherTwo-column diagram. Left column: "Part 7 — Reliability Patterns. The caller defends itself." Lists eight items: timeouts, retries+backoff+jitter, idempotency keys, circuit breakers, bulkheads, deadline propagation, rate limiting, load shedding+degradation. Right column: "Part 8+ — Coordination. Peers agree with each other." Lists eight items: leader election, fencing tokens, Paxos/Raft, quorum reads/writes, atomic commit (2PC/Paxos commit), leases with monotonic generation, log replication, view changes. A vertical accent-coloured dashed line in the middle is labelled "the wall — local defence ends here". Below: "single-caller decisions cannot settle peer-vs-peer disagreement; consensus has a different cost model". Two stances, one production system Defending alone ≠ agreeing together Part 7 — Reliability Patterns "the callee misbehaved; what do I do?" — timeouts that match the SLO — retries with backoff + jitter — idempotency keys — circuit breakers (closed/open/half-open) — bulkheads (per-dependency pool) — deadline propagation — rate limiting (token / leaky bucket) — load shedding + degradation modes caller decides alone no peer round-trips required cost: O(0) extra messages the wall Parts 8 + 9 — Coordination "two peers disagree; how do we settle?" — leader election (term / epoch) — fencing tokens (monotonic generation) — Paxos / Raft consensus — quorum reads + writes (W + R > N) — atomic commit (2PC, Paxos commit) — leases with bounded skew — log replication + view changes — failure detector + suspicion peers agree together at least one round-trip per decision cost: O(quorum) messages, O(RTT) latency
Illustrative — the boundary between Part 7 and Parts 8–9. Left: defensive patterns the caller runs alone, no extra round-trips. Right: agreement protocols that require peers to talk to each other, paid for in latency and messages.

Why "single-caller decisions cannot settle peer disagreement" is not hand-waving: every Part 7 pattern is a function of one variable — the caller's local state. A retry decision uses the caller's failure history. A breaker uses the caller's recent success rate. A shed decision uses the caller's CPU. None of them ever asks another node "what do you think?" — and none of them can, because asking would be a round-trip with the same unreliability that motivated Part 7 in the first place. The moment the question is "do you and I agree on which of us is the leader?", you must do the round-trip and you must define what "agreement" means under partial reachability. That is what Paxos, Raft, leases, and quorums solve, and it is why Part 8 is structurally separate from Part 7 in the curriculum.

A measurable demonstration — perfect Part 7, broken state

The script below is honest about the limits of defensive coding. We run a 3-replica order-router cluster. Every replica has perfect Part 7 hygiene — 250 ms timeouts, retries with jitter, breakers, bulkheads, the lot. We then partition the network so the leader (r0) cannot reach r1 and r2, but is still reachable from clients. We measure two things: how the caller sees the system (does the request succeed?) and what state the system actually contains (which replica thinks it is leader, what sequence number got assigned).

# wall_coord_breaks.py — perfect Part 7 hygiene, no consensus, split-brain happens
import asyncio, random, time
random.seed(42)

class Replica:
    def __init__(self, name, peers):
        self.name, self.peers = name, peers
        self.role = "follower"
        self.term = 0
        self.last_heartbeat = time.monotonic()
        self.committed = []        # local log
        self.partitioned_from = set()  # peers we cannot reach

    async def heartbeat(self):
        while True:
            await asyncio.sleep(0.05)
            now = time.monotonic()
            if self.role == "leader":
                for p in self.peers:
                    if p.name in self.partitioned_from: continue
                    p.last_heartbeat = now  # heartbeat propagates
            else:
                if now - self.last_heartbeat > 0.3:   # election timeout
                    self.role = "leader"               # NO consensus — anyone can self-promote
                    self.term += 1                     # local term, not agreed
                    print(f"  {self.name} self-promoted at term={self.term}")

    async def write(self, payload):
        # Every Part 7 hygiene present: timeout, retry, breaker (omitted for clarity)
        if self.role != "leader":
            return ("not_leader", None)
        seq = len(self.committed) + 1
        self.committed.append((self.term, seq, payload))
        return ("ok", (self.term, seq))

async def main():
    r0, r1, r2 = Replica("r0", []), Replica("r1", []), Replica("r2", [])
    for r in (r0, r1, r2): r.peers = [p for p in (r0, r1, r2) if p is not r]
    r0.role = "leader"; r0.term = 1   # r0 starts as leader

    hb = [asyncio.create_task(r.heartbeat()) for r in (r0, r1, r2)]
    await asyncio.sleep(0.1)

    print("PHASE 1 — healthy network, r0 is leader")
    print(await r0.write({"user": "riya", "buy": 200, "symbol": "ALPHA"}))
    print(await r1.write({"user": "rahul", "buy": 50, "symbol": "BETA"}))  # rejected

    print("PHASE 2 — partition r0 from r1, r2 (r0 still reachable from clients)")
    r0.partitioned_from = {"r1", "r2"}
    r1.partitioned_from = {"r0"}; r2.partitioned_from = {"r0"}
    await asyncio.sleep(0.5)   # election timeout fires on r1 and r2

    print("PHASE 3 — clients still write to both 'leaders'")
    print("  r0:", await r0.write({"user": "riya", "buy": 200, "symbol": "ALPHA"}))
    print("  r1:", await r1.write({"user": "asha", "buy": 100, "symbol": "GAMMA"}))
    print("  r2:", await r2.write({"user": "kiran", "buy": 75, "symbol": "DELTA"}))

    print("PHASE 4 — final state across replicas (note divergent term + duplicate seq)")
    for r in (r0, r1, r2):
        print(f"  {r.name}: role={r.role} term={r.term} log={r.committed}")
    for t in hb: t.cancel()

asyncio.run(main())

Sample run on Python 3.11:

PHASE 1 — healthy network, r0 is leader
('ok', (1, 1))
('not_leader', None)
PHASE 2 — partition r0 from r1, r2 (r0 still reachable from clients)
  r1 self-promoted at term=1
  r2 self-promoted at term=1
PHASE 3 — clients still write to both 'leaders'
  r0: ('ok', (1, 2))
  r1: ('ok', (1, 1))
  r2: ('ok', (1, 1))
PHASE 4 — final state across replicas (note divergent term + duplicate seq)
  r0: role=leader term=1 log=[(1,1,...riya...), (1,2,...riya...)]
  r1: role=leader term=1 log=[(1,1,...asha...)]
  r2: role=leader term=1 log=[(1,1,...kiran...)]

Per-line walkthrough. heartbeat() runs the same fail-detector pattern Part 7 used for breakers — miss N heartbeats, decide the peer is dead. The crucial line is self.role = "leader": when the timeout fires, the replica self-promotes without asking anyone. There is no quorum vote, no term increment that requires majority approval, no fencing token. write() is the production endpoint. It has every Part 7 hygiene we want, but it asks one local question — self.role == "leader" — and that question's answer can be True on three replicas at once. partitioned_from simulates an asymmetric reachability failure: clients can reach all three replicas, the replicas cannot reach each other. Why this is the canonical split-brain shape: the partition is asymmetric — clients see all three replicas as healthy because the client-facing path works, while the replicas cannot exchange heartbeats because the replica-to-replica path is down. This is the most common production form of split-brain (cross-AZ replica replication path goes down before the client-facing load balancer notices anything wrong), and Part 7 patterns are blind to it because they only inspect the client-facing path.

The output shows the catastrophe: three replicas, three "leaders", three logs with the same sequence number (1, 1) mapped to three different writes. The caller's view is perfect — every write returned ok in 5 ms. The system's state is incoherent. Riya's buy is committed twice, on r0 only. Asha's and Kiran's writes exist on one replica each. When the partition heals, the log-merge logic has no principled way to reconcile — three different writes claim seq=1. No retry, no breaker, no bulkhead, no rate limiter, no degradation mode rescues this. The only fix is at a different layer: a leader-election protocol where promotion requires a majority quorum of peers to agree, and a fencing token that makes the old leader's writes rejectable once the new leader is elected. That is the work of chapters 50–57.

What the next part actually fixes — preview

Part 8 (chapters 50–57) takes the failure shapes Part 7 is blind to and gives you the consensus primitives that handle them. The order is principled — each chapter introduces a mechanism the previous one's failure mode demands.

Failure modes Part 7 cannot fix and Part 8 mechanisms that fix themA two-column table. Left: failure modes the reliability layer cannot detect or repair. Right: the Part 8/9 mechanism that addresses each. Rows: split-brain (two leaders) -> leader election with quorum vote; old leader keeps writing after fail-over -> fencing token (monotonic generation); duplicate sequence numbers -> Paxos/Raft single-decree consensus; partial commit (2 of 3 replicas) -> atomic commit (2PC or Paxos commit); stale read after failover -> quorum read with W+R>N; lease expires before reissue -> bounded clock skew + lease overlap; consensus stalls under partial sync -> FLP boundary, choose between safety and liveness; Byzantine sender -> PBFT or HotStuff. Header reads "Part 8 + 9 — eight chapters that establish agreement". Parts 8 + 9 preview — eight failure modes, eight mechanisms Failure mode the reliability layer cannot fix Coordination mechanism that does split-brain (two replicas both think leader) leader election with quorum vote (term/epoch) old leader keeps writing after fail-over fencing token (monotonic generation number) duplicate sequence numbers across nodes Paxos / Raft single-decree consensus partial commit (2 of 3 replicas applied) atomic commit (2PC or Paxos commit) stale read after failover quorum read with W + R > N lease expires before reissue under skew bounded skew + overlap, TrueTime-style consensus stalls under partial synchrony FLP boundary — pick safety or liveness byzantine sender (lying / corrupted) PBFT / HotStuff (3f+1 replicas) Each row demands the next: leader election needs fencing to prevent zombie writes; consensus needs leases for the common case so every read does not pay quorum cost; leases need bounded skew or they are unsafe. Each chapter exposes the next.
Illustrative — the eight failure shapes Part 7 leaves on the floor and the Part 8/9 mechanisms that pick them up. Each mechanism enables the next: election needs fencing, consensus needs leases for cheap reads, leases need bounded skew. The composition is the system.

The interactions matter as much as the mechanisms. Election without fencing tokens does not prevent zombie writes — the demoted leader keeps accepting writes for a while because it has not noticed it lost the lease yet, and the new leader has no way to reject those writes. PaySetu learned this in 2024: a 7-second zombie window after a failover allowed 2 800 duplicate UPI debit attempts to be queued; without the fencing token in the storage layer rejecting term < current_term, the duplicates would have hit the issuing bank. Consensus without leases for the read path makes every read pay the cost of a quorum round-trip. KapitalKite measured 1.4 ms p99 for lease-anchored reads vs 8.2 ms p99 for quorum reads at the same load — for 12 000 reads/second per shard, the difference is the gap between making the SLO and missing it. Leases without bounded clock skew are unsafe. Spanner's TrueTime exists because Google measured that in their datacentres without TrueTime, NTP-driven skew of up to 7 ms could cause a lease to be considered valid on the new leader before the old leader's lease had expired locally — opening a window where both nodes thought they had the lease.

Why "fencing token" is not just "version number": a version number is what the storage layer increments on every write. A fencing token is what the coordination layer hands the new leader, with the contract that storage must reject any write tagged with a token less than the highest token it has ever seen. This is a one-line change in the storage write path — if request.token < self.max_seen_token: return reject — but it is the difference between safe failover and zombie corruption. Martin Kleppmann's 2017 post on this is the canonical reference; the failure mode it describes is exactly what KapitalKite hit in our chapter-lead scenario: a stale leader holding a lock generation g=4 continues to issue writes after a new leader is elected with g=5, and unless the storage layer rejects g=4 writes, the new leader's state diverges from the old leader's still-incoming traffic.

Common confusions

Going deeper

What FLP actually says — and why it is in chapter 50, not buried

The Fischer-Lynch-Paterson impossibility (JACM 1985) proves that in a fully asynchronous system with even one faulty process, no deterministic consensus protocol can guarantee both safety and liveness. The result is foundational because it tells you the protocol you are about to learn (Paxos, Raft) is not "perfect" — it is a deliberate trade-off. Paxos and Raft give up liveness during persistent partitions: under a partition, the protocol prefers to stall rather than risk safety violations. This is why an etcd cluster split 2-3 across an AZ failure stops accepting writes — the minority side cannot make progress, the majority side waits for the minority to come back or for an operator to manually intervene. That waiting is FLP made visible. The chapter places FLP first because every subsequent chapter (Paxos, Raft, EPaxos, multi-Raft) is a particular point in the trade-off space FLP defines, and reading them without the FLP frame turns them into trivia.

The order-router incident — what the post-incident report actually said

KapitalKite's post-incident report on the 19 March 2026 split-brain (the chapter-lead scenario) asked one question: "Where did the team mistake reliability for coordination?" The answer was three places. (1) The replica heartbeat loop used the same Part 7 fail-detector logic as the client-facing breaker — five missed heartbeats, declare dead — but treated "dead" as a unilateral signal. The fix was a quorum vote: a candidate may declare itself leader only if a majority of peers concur. (2) The storage layer wrote sequence numbers without a fencing token; both leaders could allocate seq=1 because neither had a coordination-layer token to scope it. The fix was per-shard generation numbers stored in a Raft-replicated metadata service. (3) The team had treated "consensus is expensive" as a reason to avoid Raft and instead built a "lighter" custom protocol on top of timeouts. The custom protocol was correct for the failures they had imagined; it was incorrect for the asymmetric partition they had not. The lesson — written into the post-incident's change in engineering practice section — was: if your system has more than one node that can mutate the same state, the question is not whether you have consensus, but where it lives and who maintains it. Often the right answer is "we use etcd" — a small dependency that is someone else's problem to keep correct.

Why "leases are cheap, consensus is expensive" is the next-chapter title

The wall after Part 8 (chapter 57) is "consensus is expensive — leases are cheap". That wall closes Part 8 and opens Part 9. The framing is operational: consensus done naively is one Raft round per decision (1–4 ms in-region), and at 200 000 decisions per second per shard, you cannot afford a Raft round on every read. Leases solve this by letting the leader serve reads locally for a duration T after a single Raft round establishes the lease. As long as T is bounded by the minimum clock-skew envelope and the lease-renewal overhead is amortised, reads cost almost nothing. The full coordination story is consensus for writes + leases for reads, not consensus alone. Spanner's read-locality story, Kubernetes' control-plane reads-from-leader, and KapitalKite's order-router post-fix architecture all use this composition. The reader who skips chapter 57 will think Raft makes every operation pay 4 ms — which would make it unusable at trading-system scale — and will not understand why people use it anyway.

When coordination is the wrong answer — push the agreement to the data type

Consensus makes the system slower per operation. Sometimes the right move is to choose a data structure that does not need agreement. CRDTs (Part 13) are commutative-associative-idempotent merge structures — G-Counters, OR-Sets, LWW-Registers — where any two replica states can be merged into a third state that is the "join" of both, deterministically. No consensus needed; no leader needed; no quorum needed. The trade-off is that the data structure has to fit a CRDT shape, and many things (uniqueness constraints, sequence numbers, transactional invariants) do not fit. The decision tree at design time: can this data type be a CRDT? if yes, you can avoid the consensus cost entirely; if no, you pay the Raft tax. CricStream's chat-message store is a CRDT (LWW-Set on message-ID). Their "currently-watching" counter is a G-Counter. Their "leader-of-the-stream-shard" is not a CRDT — that is a Raft cluster, because exactly-one is not a property CRDTs can enforce.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip   # asyncio is stdlib in 3.11+
# save wall_coord_breaks.py from the article body
python3 wall_coord_breaks.py
# Expected: phase 4 shows three replicas all at role=leader, three logs with
# duplicate seq=1 mapped to different writes — the canonical split-brain pattern.

# Then verify the fix preview from Part 8 — install etcd and run an election:
brew install etcd  # or: apt install etcd-server etcd-client
etcd &              # start a local single-node cluster
etcdctl put /lock/order-router "r0" --lease=$(etcdctl lease grant 10 | awk '{print $2}')
# r1's parallel attempt with --prev-kv will fail; only one holder of the lease.

Where this leads next

Parts 8 and 9 (chapters 50–67) are where the system stops trusting that local defences are enough and starts paying for agreement. Each chapter takes one of the failure modes from the table above and gives you the mechanism, with a runnable Python artefact:

After Parts 8 and 9, the system has a coherent foundation: defensive coding (Part 7) plus agreement (Parts 8–9). Part 10 then asks the question both layers depend on — how do you decide a peer has actually failed? — because every retry, every breaker, every election timeout is anchored on a failure-detector decision, and getting that wrong corrupts everything that depends on it. Part 11 (gossip) then handles the membership and metadata problems where consensus is too expensive and eventual is good enough. The progression: defend (7) → agree (8–9) → detect (10) → propagate (11). Each layer assumes the one above is solved; each layer's failures motivate the next.

References

  1. Lamport, "The Part-Time Parliament" — TOCS 1998 — the original Paxos paper; chapter 51 is built on this.
  2. Ongaro & Ousterhout, "In Search of an Understandable Consensus Algorithm" — USENIX ATC 2014 — the Raft paper; explicitly written for "understandable", and chapter 52 builds the simulation from this.
  3. Fischer, Lynch, Paterson, "Impossibility of Distributed Consensus with One Faulty Process" — JACM 1985 — the FLP result; required reading before any consensus chapter.
  4. Kleppmann, "How to do distributed locking" — 2016 — the canonical fencing-token explanation; if you read one post on coordination, read this one.
  5. Corbett et al., "Spanner: Google's Globally-Distributed Database" — OSDI 2012 — TrueTime-anchored leases plus Paxos; the production reference for how leases and consensus compose.
  6. Brooker, "Leader election in distributed systems" — AWS Builder's Library — the practitioner reference for what production leader election looks like.
  7. Degradation modes — chapter 48; the immediately previous chapter, the last defensive pattern before the wall.
  8. Crash, omission, timing, byzantine — Part 2 chapter; the failure-model taxonomy that Part 8 protocols negotiate against.