Wall: most of what you send breaks somewhere

PaySetu's checkout fan-out hits 220 RPCs per merchant transaction — fraud-score, ledger-debit, partner-bank, GST cache, notification, audit log, eight retries' worth of read-side cache misses. The LB layer is perfect: bounded-load consistent hashing for cache locality, P2C for stateless services, locality-aware routing keeping 92% of calls inside the AZ. On a Tuesday at 14:32, a single payment-pod-7 enters a 14-second GC pause. The pod still answers TCP keepalives, still passes the L4 health check, still appears on every caller's endpoint list. P2C picks it twice in the next 100 ms because it has zero in-flight requests (everyone else's RPCs to it are stuck waiting). A 2.1% slice of merchant transactions starts piling up behind that pod. The dashboard is green. Aditi, on the platform team, watches the p99 graph climb past 800 ms while every backend reports healthy=true, and writes the line that becomes the title of this chapter into the postmortem.

Eight chapters of load-balancing math taught you how to pick the right backend. They cannot teach you what to do when the backend you picked goes silent for 14 seconds, returns 200 OK with a corrupt body, or accepts your TCP connection but never reads from the socket. The next part — reliability patterns — is the layer that survives those failures. The LB picks; the reliability layer recovers.

Why this is a wall, not a chapter

The eight chapters of Part 6 — random, round-robin, least-connections, P2C, consistent hashing, Maglev, locality-aware, bounded-load, client-side-vs-proxy — were about getting one decision right: which backend should this request go to? Every algorithm assumed that once you picked the backend, the backend would either respond on time or fail loudly. Both assumptions are wrong in a way that the LB layer cannot fix from inside.

A backend can:

None of these is a load-balancing problem. The LB picked a backend that was, by every signal it had, the right one. The request still broke. The reliability layer — retries, timeouts, hedged requests, circuit breakers, deadline propagation, fencing tokens — is the part of the system that survives those failures. It is the topic of Part 7, and it is where the next eleven chapters live.

Two layers, two responsibilities — picking and survivingTwo-column comparison. Left: "Part 6 — Load Balancing. The LB picks a backend." Lists eight items: round-robin, least-connections, P2C, consistent hashing, Maglev, locality-aware, bounded-load CH, client-side vs proxy. Right: "Part 7 — Reliability Patterns. The caller survives the pick." Lists eight items: timeouts, retries with backoff, idempotency keys, hedged requests, circuit breakers, bulkheads, deadline propagation, fencing tokens. A vertical accent-coloured dashed line in the middle is labelled "the wall". Below the wall, an annotation reads: "the LB picked the right backend; the backend still broke; survival is a different layer". Two layers, one request path Picking the backend ≠ surviving the backend Part 6 — Load Balancing "which backend should I pick?" — random + round-robin — least-connections (JSQ) — power of two choices (P2C) — consistent hashing (ring/jump/Maglev) — locality-aware routing — bounded-load consistent hashing — client-side vs proxy-side data plane — xDS dynamic config picks the backend assumes the backend will reply or fail loudly the wall Part 7 — Reliability Patterns "the backend was wrong; now what?" — timeouts that match the SLO — retries with exponential backoff + jitter — idempotency keys at the API boundary — hedged requests for the long tail — circuit breakers (closed / open / half-open) — bulkheads (pool isolation) — deadline propagation across hops — fencing tokens for stale leaders survives the backend recovers when the LB layer's model of "healthy" was wrong
Illustrative — the boundary between the two parts. Part 6 hands a backend address to the caller; Part 7 wraps every call to that backend in the retries, timeouts, and circuit breakers that survive the half-failures the LB layer cannot detect.

Why "the LB layer cannot detect" is a load-bearing claim and not hand-waving: the LB layer's job is to pick a backend before the request runs. To detect a 14-second GC pause it would have to be the request itself — by the time the GC pause is observable, the request is already in flight on that pod, and the LB has handed control off to the caller. Health checks help only as a rear-view mirror: the caller observed a GC pause on pod-7 over the last minute, marked it suspect, and now the LB skips it. But the current request, the one still waiting on pod-7 right now, was already routed; the only thing that can rescue it is a timeout + retry initiated by the caller, not by the LB. This is why Part 7 is a separate layer, not a feature of Part 6.

A measurable demonstration — perfect LB, broken request

The script below sets up a perfect P2C load balancer routing requests across 8 backends. One backend (pod-7) enters a 14-second pause at t=2.0s. Every backend's health check stays green throughout (the health check does not touch the same code path as the real request). We measure two things: the LB's view of the system (which backend got picked, and how often) and the caller's view (request latency, including requests that never returned because they were stuck behind the paused pod).

# wall_lb_breaks.py — when a backend half-fails, what does the LB see vs the caller?
import asyncio, random, time, statistics

NUM_BACKENDS = 8
PAUSED_POD = 7
PAUSE_START = 2.0
PAUSE_DURATION = 14.0
random.seed(42)

class Backend:
    def __init__(self, name, mean_ms):
        self.name, self.mean_ms = name, mean_ms
        self.in_flight = 0
        self.healthy = True       # the LB's view, updated by health-check loop
        self.served = 0

    async def serve(self, t_now):
        self.in_flight += 1
        try:
            if self.name == f"pod-{PAUSED_POD}" and PAUSE_START <= t_now <= PAUSE_START + PAUSE_DURATION:
                # the pod is in a 14s GC pause: socket accepted, no progress
                await asyncio.sleep(PAUSE_DURATION)
            else:
                await asyncio.sleep(self.mean_ms / 1000)
            self.served += 1
        finally:
            self.in_flight -= 1

backends = [Backend(f"pod-{i}", 5.0 + random.uniform(-1, 1)) for i in range(NUM_BACKENDS)]

def p2c_pick(pool):
    a, b = random.sample(pool, 2)
    return a if a.in_flight <= b.in_flight else b

async def caller(start_t):
    t0 = time.perf_counter()
    target = p2c_pick([b for b in backends if b.healthy])
    try:
        await asyncio.wait_for(target.serve(time.perf_counter() - start_t), timeout=15.0)
        return ("ok", target.name, (time.perf_counter() - t0) * 1000)
    except asyncio.TimeoutError:
        return ("timeout", target.name, (time.perf_counter() - t0) * 1000)

async def main():
    start = time.perf_counter()
    tasks = []
    for i in range(2000):
        tasks.append(asyncio.create_task(caller(start)))
        await asyncio.sleep(0.01)        # 100 RPS
    results = await asyncio.gather(*tasks)

    oks = [r for r in results if r[0] == "ok"]
    timeouts = [r for r in results if r[0] == "timeout"]
    pod7_picks = sum(1 for r in results if r[1] == f"pod-{PAUSED_POD}")
    pod7_oks = sum(1 for r in oks if r[1] == f"pod-{PAUSED_POD}")
    lats = sorted(r[2] for r in oks)

    print(f"requests sent: {len(results)}")
    print(f"  ok:        {len(oks)}")
    print(f"  timeout:   {len(timeouts)}  (pod-{PAUSED_POD} picked but never replied)")
    print(f"pod-{PAUSED_POD} picked {pod7_picks}x; pod-{PAUSED_POD} replied {pod7_oks}x")
    print(f"latency (ok only): p50={lats[len(lats)//2]:.1f}ms  "
          f"p99={lats[int(0.99*len(lats))]:.1f}ms  p999={lats[int(0.999*len(lats))]:.1f}ms")
    print(f"LB's view: every backend healthy={[b.healthy for b in backends]}")

asyncio.run(main())

Sample run:

requests sent: 2000
  ok:        1958
  timeout:   42  (pod-7 picked but never replied)
pod-7 picked 244x; pod-7 replied 202x
latency (ok only): p50=5.1ms  p99=15.3ms  p999=14987.4ms
LB's view: every backend healthy=[True, True, True, True, True, True, True, True]

Per-line walkthrough. Backend.serve simulates the half-failure honestly — when pod-7 is in its pause window, the request awaits for the full 14 s rather than returning quickly; the LB's healthy flag is not touched, because the health check (not modelled here) hits a separate code path that is fine. p2c_pick picks among b.healthy == True backends; every backend including pod-7 qualifies. asyncio.wait_for(..., timeout=15.0) is the caller-side timeout — the only thing that lets the caller bail out when the backend goes silent. Without that timeout, the caller would block until the pause ends, and the entire concurrency budget would saturate behind pod-7.

The output shows the painful truth: 42 requests timed out (those that hit pod-7 during the pause), the LB still considers every backend healthy, and the caller's p999 is 14 987 ms — almost exactly the pause duration. The LB layer sees nothing wrong. P2C still picks pod-7 244 times because it has zero in-flight from the LB's perspective. The only mechanism that recovered the system is the wait_for(timeout=15.0) line — a pattern that belongs to the next layer, not this one. That is why Part 6 ends here and Part 7 begins next.

Why p999 ≈ pause duration and not less: the requests that hit pod-7 during the pause window are stuck waiting for pod-7 to come back. The earliest of those requests entered at roughly t=2s and waits until t=16s (the moment the pause ends), giving a 14 s wait — almost exactly p999 in the output. A request that entered at t=15.9s waits only 0.1s before the pause ends. The distribution of "stuck" requests is therefore roughly uniform on [0, 14] seconds, and the tail of that distribution dominates p999 even though it is only 2% of total traffic. This is also why "average latency" is a useless metric for this failure: 42 requests at 14 s buried in 1958 requests at 5 ms produces an average of about 290 ms — which looks like nothing on a p50 dashboard but is the difference between a working checkout and a failed one for the 2.1% of users who hit the pod.

What the next part actually fixes — preview

Part 7 (chapters 42–52) takes each of the failure modes the LB layer cannot fix and gives you a mechanism for it. The order is not arbitrary — each pattern addresses a specific shape of failure that the previous chapter exposed.

Failure modes Part 6 cannot fix and Part 7 patterns that fix themA two-column table. Left: failure modes the LB layer cannot detect. Right: the Part 7 pattern that addresses each. Rows: silent backend (14s GC pause) → timeout + caller cancellation; transient error 503 → retry with exponential backoff + jitter; double-charge from retry → idempotency key; long tail p99 even when healthy → hedged requests; cascading retries causing storm → retry budget + circuit breaker; one slow service eating the caller's thread pool → bulkhead; cross-hop deadline drift → deadline propagation; old leader still writing after election → fencing token. The header reads "Part 7 — eleven chapters that survive Part 6's blind spots". Part 7 preview — eleven chapters, eight failure modes Failure mode the LB cannot fix Part 7 pattern that does silent backend (GC pause, stuck thread) timeout + caller-side cancellation transient 503 / connection reset retry with exponential backoff + jitter double-charge from a naive retry idempotency key at API boundary long tail p99 even when "healthy" hedged request to a second backend cascading retry storm under partial failure retry budget + circuit breaker one slow service exhausts caller's pool bulkhead (per-dependency pool) cross-hop deadline drift deadline propagation in metadata old leader still writes after fail-over fencing token (monotonic generation) Each row is at least one chapter. Most rows interact with the others — retries without idempotency double-charge; timeouts without budgets storm; hedged requests without deadline propagation amplify load. Compose carefully.
Illustrative — the eight failure modes Part 6 leaves on the floor and the Part 7 patterns that pick them up. Several patterns interact (e.g. retries without idempotency double-charge), which is why Part 7 is eleven chapters not eight.

The interactions matter as much as the patterns themselves. Retries without idempotency double-charge merchants — at PaySetu, this is a ₹-shaped incident, not a graph-shape one. Hedged requests without retry budgets turn a 1% slowdown on one service into a 100% load amplification on the next service. Timeouts without deadline propagation mean every hop in a 12-hop call graph thinks it has 1500 ms even though the user-facing SLO is 1500 ms total — the last hop, started 800 ms in, gets 1500 ms of timeout budget instead of the 700 ms it actually has, which means it cheerfully waits for a backend that has already missed the user's deadline. Each pattern is local; the composition is global.

Why deadline propagation is not "just a timeout": a timeout is per-hop and starts fresh on every RPC. A deadline is end-to-end and counts down regardless of how many hops have happened. If your user-facing SLO is 1500 ms and the call graph has hops A→B→C→D, plain timeouts give each hop 1500 ms (so total budget could be 6000 ms), while deadline propagation gives A→B 1500 ms, B→C the remaining 1300 ms, C→D the remaining 700 ms. CricStream during a cricket final — when consensus across regions is hot and the user's player is already buffering — has measured this difference: with deadline propagation, late hops short-circuit to a fallback in 50 ms instead of waiting the full 1500 ms, dropping cascading p99 by 4× during the highest-load minute.

Common confusions

Going deeper

Why the failure mode in Part 6's blind spot has a name — "fail-slow"

The distributed-systems literature has a precise vocabulary for the failure shape Part 7 has to handle. Fail-stop is the easy case: the node crashes, the TCP connection breaks, the LB removes it, life goes on. Fail-slow (sometimes called "limping" or "grey failure") is the case where the node continues to appear alive — answers heartbeats, accepts connections, even returns 200 OK on health checks — while doing its real work catastrophically slowly. The 2018 OSDI paper "Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems" (Gunawi et al.) catalogued 101 fail-slow events from major operators (Google, Microsoft, Amazon, Facebook, NetApp). The most common causes were a failing disk drive, a network interface degrading from 10 Gbps to 100 Mbps, a CPU stuck in a thermal-throttle loop, and a kernel scheduler bug. The paper's core finding is that fail-slow is more common than fail-stop in modern hardware — and that the recovery mechanism is never the LB. It is timeouts + retries + hedged requests + circuit breakers, exactly the contents of Part 7.

The PaySetu post-incident pattern — six failures, one root cause shape

After three independent incidents in 2024 — the GC-pause incident in the chapter lead, a partner-bank slow-replication incident, and a connection-pool exhaustion in the fraud service — PaySetu's platform team built a postmortem template that asks one mandatory question: "Where in the call graph did the LB's view of healthy diverge from the caller's view of working?" All three incidents had the same answer — the LB saw healthy, the caller saw broken. The fix in every case was at the caller's layer: a tighter timeout, a circuit breaker per partner-bank, a bulkhead so the fraud service's slowness could not eat the checkout pool. The LB configuration was not changed in any of the three. This is the empirical evidence that the LB layer and the reliability layer are different problems with different fixes — and why Part 7 is structurally separate from Part 6 in the curriculum.

When Part 7 still does not save you — and Part 8 is needed

Even with perfect retries, timeouts, idempotency keys, and circuit breakers, there are failures that need agreement across nodes — not just defensive coding inside one. If your service has replicated state and the leader fails halfway through committing a write, no amount of retries on the caller's side recovers the question "did the write happen or not?". The answer needs consensus — Paxos, Raft, or a higher-level mechanism — and it is the topic of Part 8. Part 7's reliability patterns are necessary but not sufficient: they survive individual request failures, but they do not establish agreement about state. The two parts compose: Part 7 keeps requests landing somewhere; Part 8 makes sure the somewhere agrees about what was written.

The one number that separates a healthy reliability layer from a sick one

CricStream's platform team reports a single SLI that summarises whether the reliability layer is doing its job: the ratio of caller-observed errors to backend-observed errors. In a system where the reliability layer is working, the caller observes fewer errors than any individual backend reports — retries and hedged requests are absorbing transient failures, circuit breakers are pre-empting cascade. The healthy ratio is around 0.3 — the caller sees a third of the raw backend error count, because retries handled the rest. A ratio approaching 1.0 means the caller is seeing every error the backend produces — the reliability layer is missing or misconfigured. A ratio above 1.0 (the caller sees more errors than the backend reports) is a classic sign of timeout-induced failures: the caller is timing out before the backend has a chance to reply, counting it as a caller error, while the backend logs a successful response a moment later. CricStream tracks this ratio per call-pair and pages on values above 0.7. It is the closest thing to a unit test for "is the reliability layer working".

Reproduce this on your laptop

# Run the simulator from this chapter; observe the LB's view diverging from the caller's:
python3 -m venv .venv && source .venv/bin/activate
pip install asyncio       # builtin, but ensures venv works
python3 wall_lb_breaks.py
# Expected: ~42 timeouts, p999 ≈ 14000 ms, all backends report healthy=True

# Then verify the fix preview from Part 7 — add a circuit breaker that opens after 3 timeouts on pod-7:
# (in wall_lb_breaks.py, gate p2c_pick on a per-backend "consecutive_failures < 3" check)
# Re-run; expected: ~3 timeouts (the breaker trips on the 3rd failure), p999 drops to <100 ms.

Where this leads next

Part 7 (chapters 42–52) is where the system stops trusting the backend it picked. Each chapter takes one of the failure modes from the table above and gives you a mechanism with a runnable Python artefact:

After Part 7, Parts 8–9 take on consensus and leader election — the state-agreement problems that even a perfect reliability layer cannot solve alone. The system progresses from "pick a backend" (Part 6) to "survive the backend" (Part 7) to "agree on what happened" (Parts 8–9).

References

  1. Gunawi et al., "Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems" — OSDI 2018 — the canonical catalogue of fail-slow events; the empirical motivation for Part 7.
  2. Dean & Barroso, "The Tail at Scale" — CACM 2013 — hedged requests, latency-tail amplification, why p99 dominates user experience.
  3. Brooker, "Timeouts, retries, and backoff with jitter" — AWS Builder's Library — the practitioner reference for the backoff math used in Part 7 chapter 43.
  4. Nygard, Release It! (2nd ed., Pragmatic Bookshelf 2018) — the book that named circuit breakers and bulkheads as patterns; required reading for Part 7.
  5. Huang et al., "Gray Failure: The Achilles' Heel of Cloud-Scale Systems" — HotOS 2017 — Microsoft's framing of the same problem the OSDI paper above catalogues, with a control-theoretic flavour.
  6. Client-side vs proxy-side load balancing — the immediately previous chapter; the architectural choice whose blind spots motivate this wall.
  7. Bounded-load consistent hashing — the LB chapter whose elegance the wall most directly contrasts: even bounded-load CH cannot tell when a bounded backend has gone silent.