Wall: most of what you send breaks somewhere
PaySetu's checkout fan-out hits 220 RPCs per merchant transaction — fraud-score, ledger-debit, partner-bank, GST cache, notification, audit log, eight retries' worth of read-side cache misses. The LB layer is perfect: bounded-load consistent hashing for cache locality, P2C for stateless services, locality-aware routing keeping 92% of calls inside the AZ. On a Tuesday at 14:32, a single payment-pod-7 enters a 14-second GC pause. The pod still answers TCP keepalives, still passes the L4 health check, still appears on every caller's endpoint list. P2C picks it twice in the next 100 ms because it has zero in-flight requests (everyone else's RPCs to it are stuck waiting). A 2.1% slice of merchant transactions starts piling up behind that pod. The dashboard is green. Aditi, on the platform team, watches the p99 graph climb past 800 ms while every backend reports healthy=true, and writes the line that becomes the title of this chapter into the postmortem.
Eight chapters of load-balancing math taught you how to pick the right backend. They cannot teach you what to do when the backend you picked goes silent for 14 seconds, returns 200 OK with a corrupt body, or accepts your TCP connection but never reads from the socket. The next part — reliability patterns — is the layer that survives those failures. The LB picks; the reliability layer recovers.
Why this is a wall, not a chapter
The eight chapters of Part 6 — random, round-robin, least-connections, P2C, consistent hashing, Maglev, locality-aware, bounded-load, client-side-vs-proxy — were about getting one decision right: which backend should this request go to? Every algorithm assumed that once you picked the backend, the backend would either respond on time or fail loudly. Both assumptions are wrong in a way that the LB layer cannot fix from inside.
A backend can:
- Accept the connection but never reply (the application thread is in a 14-second GC pause; the kernel TCP stack is fine).
- Reply with 200 OK to a corrupted body (the response was generated from a stale cache populated before a schema migration).
- Reply quickly to the health check but slowly to real traffic (the health check hits a
/healthzthat does not touch the database; the real path does). - Reply at p50 = 4 ms but p99 = 2400 ms (the disk where the WAL lives is failing; reads from page cache are fast, reads that miss are catastrophically slow).
- Be reachable from the LB but not from half its peers (asymmetric partition; the LB's health check goes through one switch, the application traffic through another).
- Respond, but with a clock 14 seconds ahead (NTP step caused a write at
t=now+14sto be visible before a read att=now).
None of these is a load-balancing problem. The LB picked a backend that was, by every signal it had, the right one. The request still broke. The reliability layer — retries, timeouts, hedged requests, circuit breakers, deadline propagation, fencing tokens — is the part of the system that survives those failures. It is the topic of Part 7, and it is where the next eleven chapters live.
Why "the LB layer cannot detect" is a load-bearing claim and not hand-waving: the LB layer's job is to pick a backend before the request runs. To detect a 14-second GC pause it would have to be the request itself — by the time the GC pause is observable, the request is already in flight on that pod, and the LB has handed control off to the caller. Health checks help only as a rear-view mirror: the caller observed a GC pause on pod-7 over the last minute, marked it suspect, and now the LB skips it. But the current request, the one still waiting on pod-7 right now, was already routed; the only thing that can rescue it is a timeout + retry initiated by the caller, not by the LB. This is why Part 7 is a separate layer, not a feature of Part 6.
A measurable demonstration — perfect LB, broken request
The script below sets up a perfect P2C load balancer routing requests across 8 backends. One backend (pod-7) enters a 14-second pause at t=2.0s. Every backend's health check stays green throughout (the health check does not touch the same code path as the real request). We measure two things: the LB's view of the system (which backend got picked, and how often) and the caller's view (request latency, including requests that never returned because they were stuck behind the paused pod).
# wall_lb_breaks.py — when a backend half-fails, what does the LB see vs the caller?
import asyncio, random, time, statistics
NUM_BACKENDS = 8
PAUSED_POD = 7
PAUSE_START = 2.0
PAUSE_DURATION = 14.0
random.seed(42)
class Backend:
def __init__(self, name, mean_ms):
self.name, self.mean_ms = name, mean_ms
self.in_flight = 0
self.healthy = True # the LB's view, updated by health-check loop
self.served = 0
async def serve(self, t_now):
self.in_flight += 1
try:
if self.name == f"pod-{PAUSED_POD}" and PAUSE_START <= t_now <= PAUSE_START + PAUSE_DURATION:
# the pod is in a 14s GC pause: socket accepted, no progress
await asyncio.sleep(PAUSE_DURATION)
else:
await asyncio.sleep(self.mean_ms / 1000)
self.served += 1
finally:
self.in_flight -= 1
backends = [Backend(f"pod-{i}", 5.0 + random.uniform(-1, 1)) for i in range(NUM_BACKENDS)]
def p2c_pick(pool):
a, b = random.sample(pool, 2)
return a if a.in_flight <= b.in_flight else b
async def caller(start_t):
t0 = time.perf_counter()
target = p2c_pick([b for b in backends if b.healthy])
try:
await asyncio.wait_for(target.serve(time.perf_counter() - start_t), timeout=15.0)
return ("ok", target.name, (time.perf_counter() - t0) * 1000)
except asyncio.TimeoutError:
return ("timeout", target.name, (time.perf_counter() - t0) * 1000)
async def main():
start = time.perf_counter()
tasks = []
for i in range(2000):
tasks.append(asyncio.create_task(caller(start)))
await asyncio.sleep(0.01) # 100 RPS
results = await asyncio.gather(*tasks)
oks = [r for r in results if r[0] == "ok"]
timeouts = [r for r in results if r[0] == "timeout"]
pod7_picks = sum(1 for r in results if r[1] == f"pod-{PAUSED_POD}")
pod7_oks = sum(1 for r in oks if r[1] == f"pod-{PAUSED_POD}")
lats = sorted(r[2] for r in oks)
print(f"requests sent: {len(results)}")
print(f" ok: {len(oks)}")
print(f" timeout: {len(timeouts)} (pod-{PAUSED_POD} picked but never replied)")
print(f"pod-{PAUSED_POD} picked {pod7_picks}x; pod-{PAUSED_POD} replied {pod7_oks}x")
print(f"latency (ok only): p50={lats[len(lats)//2]:.1f}ms "
f"p99={lats[int(0.99*len(lats))]:.1f}ms p999={lats[int(0.999*len(lats))]:.1f}ms")
print(f"LB's view: every backend healthy={[b.healthy for b in backends]}")
asyncio.run(main())
Sample run:
requests sent: 2000
ok: 1958
timeout: 42 (pod-7 picked but never replied)
pod-7 picked 244x; pod-7 replied 202x
latency (ok only): p50=5.1ms p99=15.3ms p999=14987.4ms
LB's view: every backend healthy=[True, True, True, True, True, True, True, True]
Per-line walkthrough. Backend.serve simulates the half-failure honestly — when pod-7 is in its pause window, the request awaits for the full 14 s rather than returning quickly; the LB's healthy flag is not touched, because the health check (not modelled here) hits a separate code path that is fine. p2c_pick picks among b.healthy == True backends; every backend including pod-7 qualifies. asyncio.wait_for(..., timeout=15.0) is the caller-side timeout — the only thing that lets the caller bail out when the backend goes silent. Without that timeout, the caller would block until the pause ends, and the entire concurrency budget would saturate behind pod-7.
The output shows the painful truth: 42 requests timed out (those that hit pod-7 during the pause), the LB still considers every backend healthy, and the caller's p999 is 14 987 ms — almost exactly the pause duration. The LB layer sees nothing wrong. P2C still picks pod-7 244 times because it has zero in-flight from the LB's perspective. The only mechanism that recovered the system is the wait_for(timeout=15.0) line — a pattern that belongs to the next layer, not this one. That is why Part 6 ends here and Part 7 begins next.
Why p999 ≈ pause duration and not less: the requests that hit pod-7 during the pause window are stuck waiting for pod-7 to come back. The earliest of those requests entered at roughly t=2s and waits until t=16s (the moment the pause ends), giving a 14 s wait — almost exactly p999 in the output. A request that entered at t=15.9s waits only 0.1s before the pause ends. The distribution of "stuck" requests is therefore roughly uniform on [0, 14] seconds, and the tail of that distribution dominates p999 even though it is only 2% of total traffic. This is also why "average latency" is a useless metric for this failure: 42 requests at 14 s buried in 1958 requests at 5 ms produces an average of about 290 ms — which looks like nothing on a p50 dashboard but is the difference between a working checkout and a failed one for the 2.1% of users who hit the pod.
What the next part actually fixes — preview
Part 7 (chapters 42–52) takes each of the failure modes the LB layer cannot fix and gives you a mechanism for it. The order is not arbitrary — each pattern addresses a specific shape of failure that the previous chapter exposed.
The interactions matter as much as the patterns themselves. Retries without idempotency double-charge merchants — at PaySetu, this is a ₹-shaped incident, not a graph-shape one. Hedged requests without retry budgets turn a 1% slowdown on one service into a 100% load amplification on the next service. Timeouts without deadline propagation mean every hop in a 12-hop call graph thinks it has 1500 ms even though the user-facing SLO is 1500 ms total — the last hop, started 800 ms in, gets 1500 ms of timeout budget instead of the 700 ms it actually has, which means it cheerfully waits for a backend that has already missed the user's deadline. Each pattern is local; the composition is global.
Why deadline propagation is not "just a timeout": a timeout is per-hop and starts fresh on every RPC. A deadline is end-to-end and counts down regardless of how many hops have happened. If your user-facing SLO is 1500 ms and the call graph has hops A→B→C→D, plain timeouts give each hop 1500 ms (so total budget could be 6000 ms), while deadline propagation gives A→B 1500 ms, B→C the remaining 1300 ms, C→D the remaining 700 ms. CricStream during a cricket final — when consensus across regions is hot and the user's player is already buffering — has measured this difference: with deadline propagation, late hops short-circuit to a fallback in 50 ms instead of waiting the full 1500 ms, dropping cascading p99 by 4× during the highest-load minute.
Common confusions
-
"If my LB is good enough, I don't need retries." Wrong in the same way "if my brakes are good I don't need a seatbelt" is wrong. The LB picks a backend; if the backend half-fails after the pick (GC pause, kernel scheduling delay, slow disk), the LB cannot recover the in-flight request. Retries (and timeouts, hedged requests, circuit breakers) are the layer that recovers requests the LB has already handed off.
-
"Hedged requests are just retries with a head start." They look similar but the mechanism is different. A retry runs after the first attempt fails or times out — it adds latency in the failure case to save the request. A hedged request runs in parallel with the first attempt after a small delay (e.g. 95th-percentile expected latency) — it sacrifices a small amount of duplicate load on the success case to cut the tail. Retries help correctness; hedged requests help latency. They compose, but they are not interchangeable.
-
"Idempotency means the server is safe." Idempotency is a property of the endpoint design, not the server. A
POST /paymentsis idempotent only if the API contract requires the client to send a uniqueIdempotency-Keyheader and the server enforces "first write wins, subsequent writes with the same key return the original response". If the client retries without the key, or the server doesn't dedupe, the operation is not idempotent regardless of server-side guarantees. Idempotency is a contract, not a flag. -
"Circuit breakers prevent retry storms." Only when configured correctly. A circuit breaker that opens after 5 consecutive failures helps a single dependency. A retry storm happens at the system level — caller A retries against B, which retries against C, and the retries multiply. Preventing that requires a retry budget (a cap on the fraction of traffic that can be retries, typically 10%) at every hop, not just a circuit breaker on each dependency.
-
"Timeouts of 30 s are conservative — won't hurt anything." They are catastrophic at the connection-pool level. A 30 s timeout on a service whose normal p99 is 50 ms means a single half-failed pod can hold 600 caller threads (at 20 RPS per thread) for 30 s — enough to exhaust most caller pools and propagate the failure upstream. Timeouts must be set tightly enough that a single failed dependency cannot drink the caller's whole pool. The rule of thumb at PaySetu is
timeout = max(p99 × 3, p999 × 1.5)— well above normal latency, well below pool exhaustion.
Going deeper
Why the failure mode in Part 6's blind spot has a name — "fail-slow"
The distributed-systems literature has a precise vocabulary for the failure shape Part 7 has to handle. Fail-stop is the easy case: the node crashes, the TCP connection breaks, the LB removes it, life goes on. Fail-slow (sometimes called "limping" or "grey failure") is the case where the node continues to appear alive — answers heartbeats, accepts connections, even returns 200 OK on health checks — while doing its real work catastrophically slowly. The 2018 OSDI paper "Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems" (Gunawi et al.) catalogued 101 fail-slow events from major operators (Google, Microsoft, Amazon, Facebook, NetApp). The most common causes were a failing disk drive, a network interface degrading from 10 Gbps to 100 Mbps, a CPU stuck in a thermal-throttle loop, and a kernel scheduler bug. The paper's core finding is that fail-slow is more common than fail-stop in modern hardware — and that the recovery mechanism is never the LB. It is timeouts + retries + hedged requests + circuit breakers, exactly the contents of Part 7.
The PaySetu post-incident pattern — six failures, one root cause shape
After three independent incidents in 2024 — the GC-pause incident in the chapter lead, a partner-bank slow-replication incident, and a connection-pool exhaustion in the fraud service — PaySetu's platform team built a postmortem template that asks one mandatory question: "Where in the call graph did the LB's view of healthy diverge from the caller's view of working?" All three incidents had the same answer — the LB saw healthy, the caller saw broken. The fix in every case was at the caller's layer: a tighter timeout, a circuit breaker per partner-bank, a bulkhead so the fraud service's slowness could not eat the checkout pool. The LB configuration was not changed in any of the three. This is the empirical evidence that the LB layer and the reliability layer are different problems with different fixes — and why Part 7 is structurally separate from Part 6 in the curriculum.
When Part 7 still does not save you — and Part 8 is needed
Even with perfect retries, timeouts, idempotency keys, and circuit breakers, there are failures that need agreement across nodes — not just defensive coding inside one. If your service has replicated state and the leader fails halfway through committing a write, no amount of retries on the caller's side recovers the question "did the write happen or not?". The answer needs consensus — Paxos, Raft, or a higher-level mechanism — and it is the topic of Part 8. Part 7's reliability patterns are necessary but not sufficient: they survive individual request failures, but they do not establish agreement about state. The two parts compose: Part 7 keeps requests landing somewhere; Part 8 makes sure the somewhere agrees about what was written.
The one number that separates a healthy reliability layer from a sick one
CricStream's platform team reports a single SLI that summarises whether the reliability layer is doing its job: the ratio of caller-observed errors to backend-observed errors. In a system where the reliability layer is working, the caller observes fewer errors than any individual backend reports — retries and hedged requests are absorbing transient failures, circuit breakers are pre-empting cascade. The healthy ratio is around 0.3 — the caller sees a third of the raw backend error count, because retries handled the rest. A ratio approaching 1.0 means the caller is seeing every error the backend produces — the reliability layer is missing or misconfigured. A ratio above 1.0 (the caller sees more errors than the backend reports) is a classic sign of timeout-induced failures: the caller is timing out before the backend has a chance to reply, counting it as a caller error, while the backend logs a successful response a moment later. CricStream tracks this ratio per call-pair and pages on values above 0.7. It is the closest thing to a unit test for "is the reliability layer working".
Reproduce this on your laptop
# Run the simulator from this chapter; observe the LB's view diverging from the caller's:
python3 -m venv .venv && source .venv/bin/activate
pip install asyncio # builtin, but ensures venv works
python3 wall_lb_breaks.py
# Expected: ~42 timeouts, p999 ≈ 14000 ms, all backends report healthy=True
# Then verify the fix preview from Part 7 — add a circuit breaker that opens after 3 timeouts on pod-7:
# (in wall_lb_breaks.py, gate p2c_pick on a per-backend "consecutive_failures < 3" check)
# Re-run; expected: ~3 timeouts (the breaker trips on the 3rd failure), p999 drops to <100 ms.
Where this leads next
Part 7 (chapters 42–52) is where the system stops trusting the backend it picked. Each chapter takes one of the failure modes from the table above and gives you a mechanism with a runnable Python artefact:
- Timeouts that match the SLO — the first defence; chapter 42.
- Retries with exponential backoff and jitter — chapter 43.
- Idempotency keys at the API boundary — chapter 44.
- Hedged requests for the long tail — chapter 45.
- Circuit breakers (closed, open, half-open) — chapter 46.
- Bulkheads and pool isolation — chapter 47.
- Deadline propagation across hops — chapter 48.
After Part 7, Parts 8–9 take on consensus and leader election — the state-agreement problems that even a perfect reliability layer cannot solve alone. The system progresses from "pick a backend" (Part 6) to "survive the backend" (Part 7) to "agree on what happened" (Parts 8–9).
References
- Gunawi et al., "Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems" — OSDI 2018 — the canonical catalogue of fail-slow events; the empirical motivation for Part 7.
- Dean & Barroso, "The Tail at Scale" — CACM 2013 — hedged requests, latency-tail amplification, why p99 dominates user experience.
- Brooker, "Timeouts, retries, and backoff with jitter" — AWS Builder's Library — the practitioner reference for the backoff math used in Part 7 chapter 43.
- Nygard, Release It! (2nd ed., Pragmatic Bookshelf 2018) — the book that named circuit breakers and bulkheads as patterns; required reading for Part 7.
- Huang et al., "Gray Failure: The Achilles' Heel of Cloud-Scale Systems" — HotOS 2017 — Microsoft's framing of the same problem the OSDI paper above catalogues, with a control-theoretic flavour.
- Client-side vs proxy-side load balancing — the immediately previous chapter; the architectural choice whose blind spots motivate this wall.
- Bounded-load consistent hashing — the LB chapter whose elegance the wall most directly contrasts: even bounded-load CH cannot tell when a bounded backend has gone silent.