Partial failures and why they're the worst
It is 19:42 on a Friday. PaySetu's payments tier is on call, and Riya is staring at a dashboard where every panel is green except the one panel that matters: end-to-end success rate, which has slumped from 99.97% to 96.4% over the last eight minutes. CPU is fine. Memory is fine. The leader is up, the followers are up, replication lag p99 is 18 ms, every health check on every replica is returning 200 OK in under 30 ms. And yet 1 in 28 customer payments is stuck at "PROCESSING" for over a minute, and the support queue is filling with screenshots of the spinner. Somewhere in the cluster a node is partly working — accepting connections, answering health checks, advancing the WAL, and silently dropping or deferring a fraction of real customer requests. Riya's job for the next two hours is to find which node, find which subsystem inside that node, and convince it to either fully succeed or fully fail. Until she does, the cluster is in the worst possible state — alive enough to accept work, dead enough to fail it. This chapter is about that state.
Partial failure is when a process is alive on every metric you measure but failing on the metric you forgot to measure. It is worse than a clean crash because the failure detector cannot fire, the load balancer cannot drain, the protocol cannot re-elect, and the customer cannot retry — every safety mechanism in the stack is conditioned on "the node is dead", and a partially-failed node is, by construction, not dead.
What "partial failure" actually means
A clean crash is the easiest failure mode to reason about. The process is gone, the TCP connection resets, the heartbeat stops, the failure detector fires within one timeout, the load balancer drains the endpoint, the protocol re-elects, and life proceeds with one fewer replica until the operator brings the node back. Every step in that recovery sequence is well-defined, well-tested, and runs without human intervention.
A partial failure is the opposite. The process is still there. It still answers TCP. It still emits heartbeats — at the right cadence, with the right payload. It still processes some requests correctly. But for some subset of work — defined by request type, by hash of the user-id, by which thread happens to be unlucky, by which connection pool entry got assigned — the work either fails outright, succeeds with corrupted output, or hangs forever. From outside, every observation says "alive". From the perspective of the failed work, the node is dead. The cluster's recovery machinery, which is keyed on "alive vs dead", produces no recovery — because, at the level it observes, there is nothing to recover from.
The taxonomy of partial-failure modes that production engineers encounter, ordered by how badly they break the failure detector:
| Mode | What's happening | What the detector sees | Detection difficulty |
|---|---|---|---|
| Fail-slow | Disk / network / GC introduces 100×–1000× latency on a fraction of requests | Heartbeats arriving on time, p50 latency normal | Hard — only p99/p99.9 shifts |
| Fail-stutter | Node alternates 50ms windows of work and 50ms of silence | Heartbeats arrive in bursts; misses look like packet loss | Hard — averages out |
| Half-open connection | TCP socket alive, application thread blocked / deadlocked | Connection accept works, request handler hangs | Hard — timeout-only |
| Resource starvation | One thread pool exhausted, others fine | Most requests OK; one specific endpoint times out | Medium — per-endpoint metrics |
| Silent corruption | Bad RAM / disk returns wrong bytes; CRC catches some, not all | Everything returns 200 OK with wrong data |
Very hard — needs end-to-end checks |
| Bug-in-one-codepath | New version's payment code path hits a NullPointerException; rest fine | Handler returns 500 on 1/N requests | Medium — error-rate per route |
| GC pause masquerading | 9-second STW pause; heartbeat misses; node returns; cluster has re-elected | Failure detector fires, but the "dead" node is actually fine 9s later | Medium — but creates split-brain |
Notice the pattern: every row's detection-difficulty column is at least "Medium". Clean crashes are the easy mode. Partial failures are the modes that make distributed-systems debugging hard, that fill post-mortem documents, that drove the entire field from "is this node up" (heartbeats) to "what fraction of this node's work is succeeding right now" (synthetic probes, per-endpoint metrics, percentile tracking, end-to-end correctness checks). The history of distributed-systems observability for the last 15 years is, in one sentence, the migration from binary alive/dead detection to continuous fraction-of-work-succeeding detection, driven by the realisation that partial failures dominate.
Why "alive" is not a single fact about a process: a Linux process is a collection of threads, file descriptors, lock-protected data structures, kernel buffers, and JIT'd machine code. Each of those can fail independently — one thread can deadlock while others run, one file descriptor can wedge while others serve, the GC can stall the JVM heap while the kernel-mode networking stack continues acking TCP keepalives. Asking "is this process alive" is asking about the average state of dozens of concurrent subsystems, and a partial failure is, almost by definition, when the average looks fine but a specific subsystem is broken.
Walking through fail-slow in code
The cleanest way to feel partial failure is to simulate it. The Python script below sets up a four-replica service where one replica injects a 1-in-8 fail-slow rate (500 ms latency on 1 of 8 calls, 12 ms otherwise). The client sends 1000 requests round-robin across the four replicas, computes p50 / p95 / p99 / p99.9 latency, and shows what the customer experiences. The point is to watch the aggregate metrics stay healthy while the tail metrics scream.
# fail_slow.py — simulate one replica out of four with a 1-in-8 fail-slow injection
import random, statistics
class Replica:
def __init__(self, idx, slow_rate=0.0, slow_ms=500.0, fast_ms=12.0):
self.idx = idx
self.slow_rate = slow_rate
self.slow_ms = slow_ms
self.fast_ms = fast_ms
self.served = 0
self.slow_count = 0
def handle(self):
self.served += 1
if random.random() < self.slow_rate:
self.slow_count += 1
return self.slow_ms + random.gauss(0, 30) # noisy slow path
return self.fast_ms + random.gauss(0, 2) # noisy fast path
def run(n_requests=1000):
replicas = [
Replica(0, slow_rate=0.0),
Replica(1, slow_rate=0.0),
Replica(2, slow_rate=1/8), # the fail-slow node
Replica(3, slow_rate=0.0),
]
latencies = []
for i in range(n_requests):
r = replicas[i % 4] # naive round-robin
latencies.append(r.handle())
latencies.sort()
def pct(p):
return latencies[int(p * len(latencies)) - 1]
print(f"served per replica: {[r.served for r in replicas]}")
print(f"slow events on r2: {replicas[2].slow_count}")
print(f"p50 = {pct(0.50):6.1f} ms")
print(f"p95 = {pct(0.95):6.1f} ms")
print(f"p99 = {pct(0.99):6.1f} ms")
print(f"p99.9 = {pct(0.999):6.1f} ms")
print(f"mean = {statistics.mean(latencies):6.1f} ms")
print(f"requests > 200ms: {sum(1 for x in latencies if x > 200)}")
random.seed(42); run()
Sample run:
served per replica: [250, 250, 250, 250]
slow events on r2: 31
p50 = 12.4 ms
p95 = 13.5 ms
p99 = 519.4 ms
p99.9 = 566.7 ms
mean = 27.1 ms
requests > 200ms: 31
Read what just happened. The mean latency moved from 12 ms to 27 ms — you would barely notice on a coarse dashboard. The p50 is unchanged. The p95 is unchanged. The p99 jumped from ~14 ms to 519 ms — a 37× regression — because 1% of requests is exactly the fraction landed on the bad replica's slow path. From the load balancer's perspective, all four replicas are serving the same number of requests with the same 200 OK rate. Heartbeats are landing. The /health endpoint, served by a separate hot path, is fine. The fail-slow is invisible to every detector that aggregates by mean and visible only to detectors that track tails. This is the headline lesson: partial failures hide in tails, and any system whose alerting averages over the population of requests will systematically miss them.
Why round-robin makes this worse than random: with 4 replicas and round-robin, exactly 25% of traffic is bound for the bad replica every minute, deterministically. With random load balancing, the same 25% expected share applies in the long run, but over short windows the proportion fluctuates, sometimes giving the bad replica less. Round-robin's uniformity is virtuous for healthy clusters and pathological for partially-failed ones — it concentrates the impact rather than spreading it. The fix is power-of-two-choices (P2C) or similar tail-aware load balancing, which we cover in Part 6: P2C reads two replicas' queue depths and picks the shorter, automatically draining traffic away from a queueing replica without needing to declare it failed.
Why the slow_ms = 500 was chosen rather than 50 ms: at 50 ms the regression on p99 would be ~5× — visible but not screaming. At 500 ms the regression is 37×, well past any reasonable SLO. The empirical fail-slow distribution Huang et al. (Microsoft 2017) collected from production showed slow events typically 100×–1000× the median latency — disk fsync hangs, GC pauses, network retransmits. Choosing 500 ms in the simulation matches the lower end of that empirical distribution. A real fail-slow disk in a JBOD shelf can sit at multi-second latencies for hours before failing fully or recovering — and during those hours, the cluster's tail latency is degraded by exactly the fraction of writes that land on that disk.
Why p99.9 in the output is 566.7 ms rather than far higher: the simulation has only 1000 requests, so the p99.9 bucket holds exactly one sample (the 999th sorted latency). Real production systems compute percentiles over millions of samples per minute and the p99.9 tail is dominated by tens of slow events; in the simulation the tail thins out because the sample size is tiny. This is why empirical practice uses HDR Histograms (hdrh) at scale — they retain accurate percentile information across orders of magnitude of latency without needing to keep every sample.
Production stories — partial failure in the wild
The three stories below are different mechanisms (disk, GC, configuration) producing the same observable: an alive node failing a fraction of work while every aggregate metric stays green.
Fail-slow disk — PaySetu's wallet replication, ap-south-1, March 2024. A NVMe drive on wallet-replica-3 developed a controller-level fault that caused 1-in-200 fsync calls to take 4–9 seconds instead of the usual 250 µs. Replication's commit path waits on fsync, so 1-in-200 wallet writes saw replication-lag spikes from 18 ms to 4–9 seconds. Customer impact: ~0.5% of payment confirmations took >5 s, surfacing as the "stuck spinner" UX. Detection took 47 minutes because the dashboards averaged replication lag over 1-minute windows — the spikes were diluted to a 30 ms p99 bump that read as normal. The fix was a per-replica p99.9 alert and a dmesg correlation: SMART logs showed the controller was issuing READAHEAD-fail retries which the kernel timed-out at 9 s. Drive replaced; cluster restored.
GC pause masquerading as failure — KapitalKite's order-matching cluster, October 2024. The market-open spike doubled allocation rate on om-2, triggering a 9-second old-generation collection. During the pause, heartbeats from om-2 stopped; the cluster's failure detector fired at 4 s and re-elected with om-1 as the new leader. Then the GC finished, om-2 returned, looked at its term and saw it was now stale, attempted to fail back, was rejected by fencing — but in the 9-second pause om-2 had been holding a connection from the FIX gateway and silently dropped 240 incoming orders. Postmortem found the orders had not been ack'd on the wire, so the FIX client's retry timer kicked in; 90% of the orders made it through within 30 seconds. The 10% that didn't became reconciliation cases the next morning. The fix was three-layered — bigger heap to reduce GC frequency, a shorter election timeout (3 s → 1.5 s) so re-election kicked in before clients gave up retrying, and a sticky-FIX-session policy so the gateway re-bound to the new leader within one second. The lesson: a long GC pause is a partial failure during the pause and a phantom node during the recovery; the cluster has to handle both halves.
Resource-starvation partial failure — CricStream's recommendation service, IPL final 2024. Traffic was 14× normal. The recommendation service had two thread pools — one for personalised feeds (expensive, 200 ms p50) and one for trending feeds (cheap, 12 ms p50). Default pool sizing was 200 personalised + 50 trending. At 14× load, the personalised pool saturated, request queue overflowed, and the service started returning 504 gateway timeouts on personalised requests — but trending requests were still served by their separate pool with no contention. From the load balancer, the service answered 95% of requests fine. Personalised customers (~5% of traffic, the high-engagement segment) saw mostly errors. The fix during the incident was to increase the personalised pool to 500; the post-incident fix was bulkhead pattern — separate connection pools per request class, with per-pool tail-latency alerts. The lesson: a single process with a shared resource is one partial-failure incident waiting to happen — partition resources by request class to localise the blast radius.
The signature shared by all three stories is the failure is real, the failure is partial, and every recovery mechanism in the stack assumed it was binary. The disk wasn't dead — fsync was slow on 0.5% of writes. The JVM wasn't dead — it was paused for 9 seconds. The recommendation service wasn't dead — it was saturated on one request class. Each mechanism the cluster had — health checks, heartbeats, replication acks, election timeouts — was binary: alive or dead. Each mechanism therefore failed to recognise the partial state, and the cluster sat in degraded operation until either a human intervened or the failure escalated to a clean crash that the binary mechanisms could finally see.
A second pattern: partial failures' detection cost is paid in latency. A binary "is the heartbeat arriving" check costs O(1) per node per second. A continuous "what fraction of requests on this node hit the slow path" check requires actual workload, sample-based inference, percentile tracking, and per-route segmentation — at minimum O(routes × percentiles × samples) per minute. The gap between cheap binary detection and rich continuous detection is the gap between the problem the textbook covers and the problem production engineers actually have, and it is why every observability vendor sells the same thing — high-resolution percentile-aware metrics — for steeply more money than basic up/down monitoring.
How the stack defends — and where it still leaks
Defences against partial failure stack from operationally-cheap to operationally-expensive. None alone is sufficient; the production deployment is a layered combination.
Tail-aware load balancing (P2C). The Power of Two Choices algorithm (Mitzenmacher 2001, Vamanan et al. 2013) makes load-balancing decisions based on observed queue depth or recent latency rather than round-robin or random. When a replica becomes slow, P2C automatically routes traffic away from it because its queue is longer. The replica is never declared "dead" — it just stops getting work. This is the cheapest, fastest mitigation against fail-slow because it does not require detection of the failure mode, only observation of the latency.
Hedged requests. When a request to one replica is taking longer than the p95 of recent requests, fire a duplicate to a second replica and use whichever returns first. The Tail at Scale paper (Dean & Barroso 2013) showed this drops p99 latency by 30–50% on Google services with negligible additional load (only the slow tail gets duplicated). Hedging works because partial failures show up as latency outliers; firing a second request bypasses the slow node without needing to know which node is slow. The cost is the duplicate request load, and the requirement that the request is idempotent (it must be safe to execute twice).
Per-route, per-percentile alerting. Instead of "is the cluster's mean latency under 50 ms", track p99 per route per replica. The fail-slow on wallet-replica-3 shows up as wallet.commit.p99{replica=3} = 4500ms while wallet.commit.p99{replica=*} = 30ms for the others. This requires the metrics infrastructure to support high-cardinality percentile tracking — Prometheus's histogram type, OpenTelemetry's exponential histograms, or HDR Histograms streamed to a backend. The cost is metrics-storage capacity and alerting-rule complexity.
Synthetic end-to-end probes. A probe that exercises the full request path — auth, validation, persistence, replication, response — and asserts on the output (not just status code). For PaySetu, a synthetic ₹1 transfer between two test wallets every 5 seconds, with the success criterion being "balance updated correctly on both wallets within 200 ms". Real partial failures fail this probe; /health does not.
Bulkhead pattern (resource isolation). Partition shared resources — thread pools, connection pools, memory regions — by request class, so that one class's saturation cannot starve another. Hystrix, the Netflix library, popularised the per-dependency thread pool. Modern frameworks (Java's Project Loom virtual threads, Go's goroutines) make the abstraction cheaper but the pattern is unchanged: do not let one class's partial failure spill into another's.
Circuit breakers. When a replica's recent error rate exceeds a threshold, mark it open and refuse to send requests for a cooling-off period. Half-open after the period, send a single probe; if it succeeds, close (resume routing); if it fails, re-open. Circuit breakers convert partial failure (some requests fail, others don't, all routed) into clean failure (no requests routed) at the cost of artificial unavailability — they intentionally drop traffic on a half-broken replica to prevent the broken half from poisoning the whole cluster.
Retries with deadlines and idempotency. When a request to one replica fails or times out, retry on a different replica. The retry needs a propagated deadline (otherwise the retry chain unbounded amplifies latency) and the operation needs an idempotency key (otherwise the retry on a replica that did receive the original commits twice). Deadline propagation is the underrated half of this — without it, retries cause retry storms during partial-failure incidents, where the cluster's load doubles or triples just as it loses capacity.
Why none of these alone is sufficient: P2C avoids slow replicas but cannot prevent silent corruption (the fast path is wrong, not slow). Hedged requests bypass slow nodes but not always-failing nodes (both copies fail). Synthetic probes catch the workload-level failures the probe simulates, missing those it doesn't. Bulkheads bound blast radius but don't recover. Circuit breakers convert partial to clean failure at the cost of throwing away potentially-good capacity. Retries fix transient failure but amplify load during incidents. Each defence catches a slice of the partial-failure space; the production system is the layered union of all of them, and even then a long-tail of incidents escapes every layer to surface as customer-visible degradation.
What remains genuinely hard: silent data corruption that passes every CRC and every probe. If the bad RAM corrupts a single bit in a payment amount field after the CRC was computed, the bit-flipped record is replicated, persisted, and read back without any check firing. End-to-end checksums (covered in Part 18 on observability) catch some of this, application-layer invariant checks catch some more, but the long tail of "the answer was wrong and nothing in the stack caught it" is the residual partial-failure surface that every distributed system inherits. The current research frontier — Google's ECC at every layer, Amazon's fingerprint-based correctness probes, Microsoft's Pingmesh — is all about closing this gap, and it is incomplete by design.
Common confusions
-
"Partial failure is just a slow node." No. Slow is one mode (fail-slow); silent corruption, half-open connection, resource starvation, and bug-in-one-codepath are all partial failures in which latency is fine and something else is broken. Treating partial failure as a synonym for fail-slow leaves the other modes invisible.
-
"If the health check says the node is up, the node is up." Health checks check the health-check codepath. Most production health-check endpoints are served by a thin servlet that touches none of the real workload's critical resources. A
/healthreturning200 OKis a statement about the/healthcodepath, not about the application's ability to serve real requests. -
"Retries fix partial failures." Retries fix transient failures — the request was lost in flight, the connection blipped, the GC paused for 200 ms. Retries against a fail-slow node send the second request to the same slow node; without P2C or hedging across replicas, the retry chain accumulates latency rather than escaping the failure. Retries also amplify load during incidents, often making the partial failure worse.
-
"A node is either alive or dead." A node is a collection of subsystems — process, thread pools, queues, file descriptors, kernel buffers, JIT'd code paths — and any subset can fail while the others serve. The binary alive/dead frame is a simplification suitable for the failure-detection layer but fundamentally incomplete for understanding what is going wrong on the actual hardware.
-
"Adding more replicas reduces partial-failure impact." Sometimes. More replicas reduces the fraction of requests routed to a partially-failed one (1/N of traffic instead of 1/3), which reduces customer-visible impact. But more replicas increases the probability that some replica is partially failed at any given time (each replica has its own failure rate, and the union grows). Whether more replicas net-helps depends on whether your routing layer is partial-failure aware — with P2C and hedging, more replicas helps; with naive round-robin, more replicas makes the tail worse because more failure events are integrated into the population.
-
"Partial failures are rare." They are the most common production failure mode. Clean crashes are easier to handle, easier to detect, and easier to recover from precisely because they are rarer than the gradual, resource-driven, software-bug-driven, hardware-degradation-driven partial failures that dominate the empirical incident distribution.
Going deeper
The Huang et al. fail-slow study
The 2018 USENIX paper "Fail-Slow at Scale" (Gunawi et al., FAST 2018) collected 101 fail-slow incidents from 12 large-scale systems including Google, Microsoft, Cassandra, ScyllaDB, and Hadoop. The headline findings: fail-slow is a distinct failure class from fail-stop, the median fail-slow event lasts hours (not seconds), and fail-slow events frequently cascade — one slow node causes upstream queues to back up, which cause upstream-of-upstream queues to back up, which produces a cluster-wide latency degradation that bears no obvious relationship to the original disk that started the cascade. The paper's recommendation, lifted directly into modern SRE practice, is that every distributed system needs explicit fail-slow detection separate from fail-stop detection — the same heartbeat-based mechanisms that catch fail-stop will systematically miss fail-slow because the heartbeats are still arriving, just delayed by amounts that look like noise on aggregate.
The fail-slow taxonomy from that paper — disk slowness from JBOD controller faults, network slowness from buggy NICs, CPU slowness from thermal throttling, memory slowness from swap or fragmentation — maps directly onto the modes a production engineer encounters. Reading the paper is worth two hours; the case studies are exactly the production scars the rest of this curriculum will reference.
GC pauses and the lease-fencing trap
A specific subclass of partial failure deserves its own treatment. When a JVM service experiences a long old-gen GC pause (typically multi-second), it is in a partial-failure state in a peculiar sense: the process is paused, not dead. From outside, every connection times out. From inside, the process is making no progress, but it has not crashed.
If the cluster has a leader-election protocol with a lease, the lease may expire during the GC pause. A new leader is elected. The GC pause finishes. The old leader resumes and — critically — has no idea it was paused. From the old leader's perspective, no time passed. It continues issuing writes against the storage backend, with stale lease in hand. Without a fencing token, the storage may accept those writes, producing split-brain corruption.
This is the textbook motivation for fencing tokens (covered in detail in Part 9): every lease comes with a monotonically-increasing token, the storage layer compare-and-swaps on the token, and a stale leader's writes are rejected even when the stale leader is unaware it is stale. The 2017 Discord post on their voice infrastructure migration documented exactly this scenario — a GC pause on a Go service caused a stale leader to issue writes that were silently accepted by the storage layer, producing duplicate state. The fix was to add fencing-token verification on every storage write.
The deeper observation is that a process that experiences a long pause is indistinguishable from a process that crashed and was replaced, and the cluster's protocol must handle both possibilities identically — assume the paused-then-resumed process is now stale, reject its writes, force it to re-join via the normal "I am a follower again" pathway. Lease + fencing token is the mechanism that makes "long pause" safe.
Fail-slow vs fail-fast — a deliberate engineering choice
There is a school of thought in distributed-systems design that says: prefer fail-fast over fail-slow. If a node is going to fail, make it fail crisply — crash early, raise loud errors, exit on the first sign of corruption — rather than degrading gracefully into a partially-broken state. The Erlang/OTP "let it crash" philosophy is the canonical example. Joe Armstrong's argument was that in a distributed system, the failure detector + supervisor can recover from a clean crash in milliseconds, but cannot recover from a partial failure for hours. Therefore, given the choice, choose to crash.
This is structurally why Erlang processes are extremely lightweight (so a crash is cheap), why supervision trees define restart policies declaratively, and why production Erlang systems run for years with crashes-per-process being routine. Discord's BEAM-then-Rust voice infrastructure (Part 20 case study) is a real-world adoption of this philosophy at scale. The transition to Rust did not abandon fail-fast; it kept the semantics by panicking on impossible conditions and letting the supervisor restart the process.
The counter-argument, mostly from systems with stateful in-memory caches that take minutes to warm: a crash is expensive when warmup is slow. For those systems, graceful degradation (drop expensive features, keep cheap ones working) is preferred to a clean crash. The choice is workload-specific. What is universal is that whichever you choose, you must choose deliberately and deploy the matching observability — fail-fast systems need fast supervisor restart and clean panic semantics; graceful-degradation systems need rich tail-latency monitoring and per-feature health metrics.
The detection-lag economic argument
The single most leveraged investment a distributed-systems team can make in reliability is reducing detection lag: the time between failure onset and the cluster realising something is wrong. For clean crashes, detection lag is one heartbeat-timeout interval (typically 1–3 seconds). For partial failures, detection lag is minutes to hours — typically until either a synthetic probe catches the issue, a percentile alert fires after the metric crosses a threshold, or a customer-support ticket surfaces the user-visible impact.
The cost of a failure is roughly proportional to the detection lag — every minute of lag is a minute of customer-visible impact, eroded trust, and accumulated incident hours. So the dollar value of reducing detection lag from 30 minutes to 3 minutes on a service serving ₹100 crore/month of revenue is approximately the difference in cumulative incident-impact integrated over the year. Even a small reduction (say 10% on a small fraction of incidents) typically pays for the entire observability budget many times over.
This is why the conventional wisdom in mature SRE organisations is "spend the observability budget on tail-latency tracking, synthetic probes, and per-percentile alerting before spending it on prettier dashboards" — the marginal incident-detection-lag reduction from a tail probe is far higher than from a dashboard improvement, and the dollar-impact of detection lag dominates the operational cost equation.
Reproduce this on your laptop
# Reproduce the fail-slow simulator
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
python3 fail_slow.py
# Vary slow_rate (try 1/16, 1/4) and slow_ms (try 100, 1000) and watch
# how p50 stays nearly constant while p99 / p99.9 explode.
# To inject real fail-slow on Linux for live cluster testing, use tc qdisc:
sudo tc qdisc add dev eth0 root netem delay 500ms 30ms 25%
# This delays 25% of packets by 500 ± 30 ms — observe the cluster's
# percentile metrics and watch which detectors fire (and which don't).
sudo tc qdisc del dev eth0 root
Where this leads next
The four partial-failure modes catalogued in this chapter — fail-slow, fail-stutter, half-open connection, resource starvation — define the failure surface every later chapter's protocol must survive. The next chapters develop the operational consequences:
- Fail-stop, fail-slow, fail-silent — the failure-mode taxonomy formalised, with detector requirements per mode.
- How real networks actually fail (studies) — empirical Microsoft / Google / Facebook studies, with real partial-failure frequencies.
- The asynchrony assumption FLP — why partial failure is the impossibility result's actual content, not an exotic edge case.
- Phi-accrual failure detector — Part 10 — the principled answer to "is this node dead", with continuous suspicion replacing binary verdicts.
- Lease + fencing token — Part 9 — how the storage layer rejects stale leaders' writes when the protocol layer cannot detect them as stale.
- The tail at scale — Part 7 — hedged requests and tail-tolerant techniques for fail-slow.
- Bulkheads and circuit breakers — Part 7 — resource-isolation patterns to bound partial-failure blast radius.
The lesson to carry forward is that every distributed protocol's safety analysis assumes a binary failure model — a node is correct or it is faulty — and partial failure is the gap between that model and the production reality. Every design choice in later chapters can be evaluated by asking: what does this protocol do when a node is partly working? What does it do during a 9-second GC pause? What does it do when 1-in-200 fsyncs take 4 seconds? Reading the rest of this curriculum, every protocol's "behaviour under failure" section is implicitly answering that question. The skill of distributed systems engineering is to read that question fluently.
A small drill: take your most recent production incident and answer four questions. Was the failure binary or partial? Which observation channels (TCP, heartbeat, /health, p50, p99, end-to-end correctness) reported red, and which stayed green? What was your detection lag from failure onset to first page? Which of P2C, hedging, per-route p99 alerts, synthetic probes, bulkheads, circuit breakers would have shortened it? Doing this for ten incidents in a row teaches the partial-failure modes better than any survey paper. The modes then stop being theoretical categories and start being the diagnostic shortcuts you reach for during the next 19:42-on-a-Friday page.
The deeper observation, looking ahead: partial failure is not an edge case the textbook protocols handle imperfectly; it is the central case the textbook protocols barely handle at all. Most of the engineering complexity in production distributed systems — fencing tokens, phi-accrual, P2C, hedged requests, bulkheads, circuit breakers, synthetic probes, end-to-end checksums — is mitigation for partial failure. The textbook gives you the consensus protocol; the production deployment gives you the partial-failure mitigations stacked on top of it. The skill of building reliable distributed systems is recognising that the protocol is the easy half and the mitigations are the hard half, and budgeting your engineering effort accordingly.
References
- Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems — Gunawi et al., FAST 2018. The empirical fail-slow study across 12 large-scale systems; the calibration baseline for fail-slow as a distinct failure class.
- Gray Failure: The Achilles' Heel of Cloud-Scale Systems — Huang et al., HotOS 2017. The paper that named gray failure and articulated its observability problem.
- The Tail at Scale — Dean & Barroso, CACM 2013. Hedged requests, tied requests, and the broader thesis that tail latency dominates user-visible performance under partial failure.
- Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis — Guo et al., SIGCOMM 2015. Microsoft's per-server probe mesh — a large-scale synthetic-probe deployment exactly aimed at partial failures.
- The network: partitions, asymmetric reachability, gray failures — internal cross-link to the previous chapter on the network-layer specialisation of partial failure.
- Designing Data-Intensive Applications — Kleppmann, O'Reilly 2017. Chapter 8's section on unreliable networks and partial failure is the practitioner's introduction to the modes catalogued here.
- Erlang and OTP in Action — Logan, Merritt, Carlsson 2010. The "let it crash" philosophy translated to production patterns; the canonical fail-fast counter-position to graceful degradation.
- Power of Two Random Choices — Mitzenmacher 2001. The mathematical foundation for tail-aware load balancing as a partial-failure mitigation.