Fail-stop, fail-slow, fail-silent

It is 02:14 on a Sunday and KapitalKite's overnight reconciliation job has flagged a ₹4.2 lakh discrepancy between the orders ledger and the trades ledger. Aditi pulls up the cluster: every node is up, every heartbeat is landing, every health check is green. The leader's commit index is 18,442,019 and the followers all match within 12 ms of replication lag. Yet somewhere in the last 6 hours, 14 trade confirmations were written with quantity=10 on the leader and quantity=10 on two followers — and quantity=100 on the third follower. There is no error in any log. The problem is not that a node is down; the problem is that a node is lying, with full conviction, in a way that no aggregate metric will ever catch. This chapter is about the three failure modes that demand three different detectors — and about why the casual word "down" is the most expensive shortcut in distributed-systems vocabulary.

Fail-stop is a node that crashes cleanly and stops responding. Fail-slow is a node that responds correctly but late. Fail-silent is a node that responds on time but with corrupted answers. Each mode escapes a different detector — and a system that conflates them deploys one detector and assumes coverage. The production cost is paid in incidents that no dashboard predicted.

The three modes — what each one looks like on the wire

A fail-stop node is the easy case. The process exits, the kernel reaps it, the TCP stack sends RST on the open sockets, the load-balancer health check fails, the heartbeat thread stops emitting, and within one detection interval the cluster's failure detector marks the node as failed. Every observable is consistent: TCP says down, heartbeat says missing, /health returns nothing because nothing is listening, and the process table on the host shows no pid for the service. The recovery sequence — drain, re-elect, restart, re-join — runs end-to-end without ambiguity. This is the failure mode every textbook protocol assumes; it is also the rarest one in production.

A fail-slow node is the hard case for the detector. The process is alive. Sockets accept connections. Heartbeats arrive on schedule. /health returns 200 OK. The application is doing real work — but for some fraction of requests, that work takes 100×–1000× longer than the median. A SATA SSD with a failing controller might fsync 1-in-200 writes in 8 seconds instead of 250 µs; a JVM service might pause for 9 seconds during an old-gen GC; a thread pool might exhaust on one request class while serving another fine. From the outside, the aggregate metrics — mean latency, error rate, heartbeat liveness — all look normal. The only signal is in the tail: the p99.9 of the affected request class spikes, and the customer-visible impact is concentrated in the slow tail.

A fail-silent node is the hard case for recovery. The process is alive. Sockets accept connections. Heartbeats arrive on schedule. Latency is normal. The application is doing real work and returning answers within the expected time — but the answers are wrong. A bad RAM module flips a bit after the CPU's CRC was computed; a buggy code path miscomputes a checksum; a misconfigured serialiser writes the wrong byte order to disk; a clock skew causes a stale read to be served as fresh. The detector cannot help — by every observable an external monitor can collect, the node is healthy. The only signal is the content of the response, which the detector does not parse, does not understand, and would have no ground truth to compare against even if it did.

The headline distinction: fail-stop fails the liveness check; fail-slow fails the timing check; fail-silent fails the correctness check. Three different checks, three different detectors. A cluster that runs only the liveness detector — one heartbeat thread, one /health endpoint, one binary alive/dead verdict — is detecting only the easiest of the three failure modes, and is functionally blind to the other two. The harder modes are also the more common modes, which is the heart of the operational pain.

Fail-stop vs fail-slow vs fail-silent — what each detector catchesThree columns, one per failure mode. Each column shows the same six observation channels stacked vertically: TCP, heartbeat, /health, p50 latency, p99 latency, response correctness. Fail-stop column has all six red. Fail-slow column has TCP/heartbeat/health/p50 green, p99 red, correctness green. Fail-silent column has all five timing channels green and only correctness red. The visual point is that each mode flips a different subset of channels.Three failure modes — six observation channels — three different signatures Fail-stop Fail-slow Fail-silent TCPRST heartbeatmissed /healthno reply p50 latency p99 latency correctnessn/a all 6 fire — easy heartbeat detector wins TCPUP heartbeaton time /health200 p50 latency12 ms p99 latency4200 ms correctness99.9% only tail latency fires need percentile-aware detector TCPUP heartbeaton time /health200 p50 latency12 ms p99 latency84 ms correctness98.4% only correctness fires need end-to-end probe Each mode flips a different subset of channels — one detector per mode. Illustrative — three signatures distilled from production fault catalogues.
Illustrative — fail-stop trips every channel; fail-slow only the tail-latency channel; fail-silent only the correctness channel. The detector you do not have is the failure mode you cannot see.

Why "fail-silent" is the worst of the three to recover from: in fail-stop the cluster knows exactly what to do (drain, re-elect, restart). In fail-slow the cluster has options (P2C around the slow node, hedged requests, drain after threshold). In fail-silent the cluster does not know there is anything to recover from — the bad data is replicated, persisted, served, and read back with full ack chains. By the time end-to-end reconciliation surfaces the discrepancy, the corrupt state has spread; recovery is not a node operation, it is a data-recovery operation, and that is orders of magnitude more expensive.

Walking through the three modes in one simulation

The cleanest way to feel the difference is to simulate all three side-by-side and watch which detectors fire. The script below sets up four replicas: one stays healthy, one fails-stop at request 200, one fail-slows on 1-in-8 calls, and one fail-silents by corrupting 2% of payment-amount fields. The harness routes 1000 requests round-robin and reports which signal each detector raises.

# three_modes.py — how fail-stop, fail-slow, fail-silent show up to detectors
import random, time

class Replica:
    def __init__(self, name, mode="ok"):
        self.name = name
        self.mode = mode
        self.alive = True
        self.calls = 0

    def handle(self, req_id, amount):
        self.calls += 1
        if self.mode == "fail_stop" and self.calls >= 200:
            self.alive = False                              # crashed; raises
            raise ConnectionError(f"{self.name}: dead")
        if self.mode == "fail_slow" and random.random() < 1/8:
            time.sleep(0.0005)                              # 500 us in test; pretend 500ms
            return ("ok", amount, 500.0)                    # latency_ms reported
        if self.mode == "fail_silent" and random.random() < 0.02:
            return ("ok", amount * 10, 12.0)                # CORRUPTED amount, reported ok
        return ("ok", amount, 12.0)

def run():
    replicas = [Replica("r0"), Replica("r1", "fail_stop"),
                Replica("r2", "fail_slow"), Replica("r3", "fail_silent")]
    heartbeat_misses = {r.name: 0 for r in replicas}
    p99_buckets       = {r.name: [] for r in replicas}
    correctness_fail  = {r.name: 0 for r in replicas}
    for i in range(1000):
        r = replicas[i % 4]
        amount = 100  # the canonical correct answer
        try:
            status, returned_amount, lat = r.handle(i, amount)
            p99_buckets[r.name].append(lat)
            if returned_amount != amount:
                correctness_fail[r.name] += 1
        except ConnectionError:
            heartbeat_misses[r.name] += 1                   # detector fires here
    for r in replicas:
        lats = sorted(p99_buckets[r.name]) or [0]
        p99 = lats[int(0.99 * len(lats)) - 1] if lats else 0
        print(f"{r.name:3s} mode={r.mode:11s} hb_miss={heartbeat_misses[r.name]:3d} "
              f"p99={p99:6.1f}ms corrupt={correctness_fail[r.name]:3d}")

random.seed(42); run()

Sample run:

r0  mode=ok          hb_miss=  0 p99=  12.0ms corrupt=  0
r1  mode=fail_stop   hb_miss= 51 p99=  12.0ms corrupt=  0
r2  mode=fail_slow   hb_miss=  0 p99= 500.0ms corrupt=  0
r3  mode=fail_silent hb_miss=  0 p99=  12.0ms corrupt=  6

Read the four rows. The healthy r0 is silent on every detector — as it should be. The fail-stop r1 is screaming on the heartbeat channel: 51 connection errors, the cluster's failure detector trips on the first one. The fail-slow r2 is silent on heartbeat (zero misses) but the p99 is 41× normal; only a percentile-aware detector catches it. The fail-silent r3 is silent on every channel except correctness — heartbeat is fine, p99 is fine, but six payment amounts came back wrong, ten times the input. The simulation makes the headline concrete: a system that runs only the heartbeat detector misses two of three failure modes, and the missed modes are the most damaging ones.

Why the fail-silent rate of 2% looks "low" but is catastrophic: at 2% of payment requests being corrupted, on a service handling 1000 transactions per second, that is 20 corrupt records per second, 1.7 million corrupt records per day. Even if a reconciliation job catches 99% of them within 24 hours, the 17,000 surviving corrupt records compound — they are read by downstream services, written to derived tables, sent to customer statements, used to compute risk scores. By the time the original bad RAM is replaced, the corrupt-data blast radius has spread across dozens of systems, and recovery is a data-engineering project measured in weeks.

Why fail-slow's p99 of 500 ms is invisible to a "mean latency" detector: with 250 calls to r2, 31 of which are slow at 500 ms and 219 of which are fast at 12 ms, the mean is (31×500 + 219×12)/250 = 72.5 ms. That is a 6× regression from baseline 12 ms — visible on a fine-grained dashboard, easily missed on a noisy one with a 200 ms threshold. The p99, by contrast, lands deep in the slow tail at exactly 500 ms — a 41× regression, impossible to miss on any sensible alert. The shift from mean-based alerting to percentile-based alerting is the single highest-leverage SRE practice change of the last decade, and fail-slow is the failure mode that forced it.

Why detection of fail-silent demands an "expected answer" — not just an HTTP status: the canonical mistake in microservice health-checking is to grade a response by its status code (200 OK ⇒ healthy). Fail-silent specifically returns 200 OK with wrong content; status-grading is fundamentally inadequate. The fix is end-to-end synthetic probes that assert on the response body: send a known input, expect a known output, alert on mismatch. This requires test data that does not pollute production state — synthetic tenants, sentinel transactions, dry-run flags — and is more operationally expensive than ping-style probes. That cost is why most production deployments do not have it, which is why most fail-silent incidents take days to surface.

Production stories — one war story per mode

Each of these stories is a different mode failing under realistic load on a fictional but production-faithful cluster.

Fail-stop — RailWala's seat-locking service, Tatkal hour, June 2024. At 10:00:14, a network-partition event isolated lock-3 from the rest of the cluster. The container's TCP stack went silent; heartbeats stopped within 200 ms. The cluster's phi-accrual detector raised suspicion past threshold at 800 ms; the leader marked lock-3 as failed at 10:00:15.1 and the load balancer drained it. Total impact: 1 second of degraded capacity (4-of-5 instead of 5-of-5 replicas), zero customer-visible errors because the retry layer transparently resent in-flight locks to the remaining four replicas. This is fail-stop working perfectly — the detector caught it, the protocol routed around it, the customer did not notice. The textbook protocols handle this case correctly because the textbook protocols are designed for this case.

Fail-slow — MealRush's restaurant-search service, lunch hour, August 2024. At 12:18, one of 12 search-index replicas began returning 1-in-12 queries with a 6-second latency instead of the median 18 ms. Investigation later showed the underlying cause was a degraded NVMe drive whose SMART attribute Reallocated_Sector_Ct had been creeping up for two weeks; the controller was issuing internal retries for sectors that had silently developed soft errors. From the outside, the replica looked healthy: heartbeats fine, /health returning 200, query-success rate 99.99%. The signal that surfaced the failure was the customer-support queue: a thread of "search is so slow today" tickets correlated to one shard. Detection lag: 47 minutes. The fix was a per-shard p99 alert (not just per-cluster) that would have fired in 90 seconds, plus replacement of the drive. Cost of the lag: 47 minutes × 12 lakh queries/min × ~6% slow rate × ~₹3 customer-recovery cost = approximately ₹1 crore in promo-credit handouts to placate customers, plus three engineering weeks of follow-up incident review and dashboard rebuild.

Fail-silent — KapitalKite's order-matching daemon, October 2024. Sometime between 14:30 and 14:42, a cosmic-ray bit-flip on a ECC-disabled DIMM corrupted a single byte in the quantity field of an in-flight order before the application's CRC was computed. The corrupted order — quantity=100 instead of quantity=10 — was replicated cleanly to two of three followers; one follower's network jitter caused the original packet to be retransmitted with the correct value. The result: the leader and two followers had quantity=100; one follower had quantity=10. Every consensus check (commit_index, term, log hash up to that entry) was satisfied — the corruption happened before the entry was logged, so the log's hash was internally consistent. The discrepancy surfaced 11 hours later when the overnight reconciliation job compared the orders ledger against the trades ledger and found ₹4.2 lakh of "phantom" trades. Recovery took 6 days: identify the affected orders, reverse the trades, restore the customer balances, refund the brokerage fees, send written explanations to 47 retail customers, file a regulatory disclosure. The fix, deployed three weeks later, was end-to-end CRC checksums computed at the API ingress and verified at the storage layer — before the in-process memory operations that the bit-flip can corrupt — plus enabling ECC on every DIMM in the cluster.

Detection lag by failure mode — what each detector catches and how fastA horizontal bar chart with three rows. Fail-stop: detection lag of about 1 second by heartbeat detector. Fail-slow: detection lag of 90 seconds by per-percentile alert OR 45 minutes by mean-latency dashboard. Fail-silent: detection lag of 11 hours by reconciliation job OR potentially never without end-to-end checks.Detection lag — same incident, three mode signatures, three detector budgets log-scale (seconds) 1s 10s 100s 1000s ~10ks fail-stop heartbeat detector — ~1 s fail-slow p99 alert: 90 s mean-only dashboard: 45 min fail-silent reconciliation job — 11 hours; no e2e probe — never Illustrative — log scale on x-axis. Numbers from the three war-stories above.
Illustrative — detection lag stretches over four orders of magnitude depending on mode and detector. The y-axis is the failure mode; the bar length is "how long until you know". Spend your observability budget proportionally.

The pattern across the three stories is the detection lag is a function of mode × detector quality, not of failure severity. Fail-stop with a good detector is one second. Fail-slow with a good detector is 90 seconds; with only a mean-latency dashboard, it is 47 minutes. Fail-silent with end-to-end reconciliation is 11 hours; without it, the failure can persist indefinitely until a customer complains or a quarterly audit catches it. The cost of an incident scales roughly linearly with detection lag, so the dollar value of investing in the right detector for each mode is enormous — the budget should not be allocated equally across detectors but proportional to (mode frequency × cost per minute of lag).

What each detector actually does

A detector is a function from observable signals to a verdict. The three detectors that match the three modes are structurally different programs.

Heartbeat / liveness detector — for fail-stop. A thread on each node sends a small probe to every peer every few hundred milliseconds. If responses stop arriving for T consecutive intervals, the peer is suspected; if suspicion crosses a threshold, the peer is marked failed. The cost is O(N²) probes per interval (every node probes every other), but the per-probe cost is tiny. The verdict is binary: alive or failed. Phi-accrual (Hayashibara et al. 2004) is the principled refinement — instead of binary, it outputs a continuous suspicion value calibrated to historical inter-arrival times — but the underlying signal is the same heartbeat stream. This detector's blind spot: any failure mode where the heartbeat path is uncorrelated with the workload path. Fail-slow doesn't slow heartbeats; fail-silent doesn't corrupt heartbeats; both escape entirely.

Percentile-tracking / tail-latency detector — for fail-slow. Every request through the system is sampled into a per-replica, per-route latency histogram. Periodic alerts fire when any (replica, route) combination's p99 or p99.9 exceeds a calibrated threshold. The cost is O(routes × replicas × percentiles) of metric storage and CPU per minute, plus an alerting backend that can evaluate millions of time-series. Implementations: Prometheus's histogram type, OpenTelemetry's exponential histograms, HDR Histograms streamed to a backend. The detector's signal is the shape of the latency distribution, not the mean. This detector's blind spot: latency-correct corruption — fail-silent passes through the percentile detector unscathed because the response time is normal.

End-to-end synthetic probe / correctness detector — for fail-silent. A small fleet of probe clients periodically issues real requests against the system with known inputs and known expected outputs, then asserts on the response content. A canonical PaySetu probe: every 5 seconds, transfer ₹1 between two test wallets and verify both balances updated correctly within 200 ms. A KapitalKite probe: every 10 seconds, place a synthetic order at a price guaranteed not to match, verify the order appears on the book with the exact quantity submitted. The cost is high — synthetic transactions consume real cluster capacity, require careful isolation from production data, and the assertions are workload-specific. The signal is correctness of behaviour, the only signal that catches fail-silent. This detector's blind spot: failure modes the probe doesn't simulate. A probe that exercises payment transfers does not catch a failure in the order-book engine.

The deeper pattern is that detector cost grows with the mode's subtlety. Heartbeat detection is nearly free — a few bytes per second per peer. Percentile tracking costs real metric infrastructure. End-to-end probes cost workload-specific engineering, dedicated test data, and ongoing maintenance as the API surface grows. The cost trajectory is exactly the inverse of the mode's frequency: fail-stop is rare and cheap to detect; fail-silent is common and expensive to detect. The economic gravity of distributed-systems observability is therefore not "monitor everything" but "decide which modes you care about, then budget the detector for each one separately".

Why a single detector for all three modes is structurally impossible: each detector's input signal is on a different channel — TCP-state-of-connection for fail-stop, latency-distribution-over-many-requests for fail-slow, response-content-against-ground-truth for fail-silent. A unified detector would need to consume all three channels simultaneously and compose verdicts. In practice, the engineering effort to do this well is more than the effort to deploy three specialised detectors with separate alerting paths, which is why every production observability stack ends up looking like a heartbeat plane plus a metrics plane plus a synthetic-probe plane stacked on top of each other.

Common confusions

Going deeper

The Schneider classification — fail-stop in the formal model

Fred Schneider's 1984 paper "Byzantine Generals in Action" formalised the fail-stop processor model: a process that either executes correctly or stops, and whose stopped state is detectable by other processes. This is not the same as the practical fail-stop mode. Schneider's fail-stop is a theoretical abstraction — it assumes detection is reliable, which in real networks it is not. The asynchrony of real networks means a stopped node is indistinguishable from a slow node, which is the entire content of the FLP impossibility result (Part 8). The practical "fail-stop" used in this chapter is a behavioural mode, not Schneider's strict model — a node that crashes cleanly enough that detection works in practice, conditional on appropriate timeouts and a phi-accrual-style suspicion mechanism.

The gap between Schneider's idealised fail-stop and the production reality is exactly the gap between the textbook protocols (which assume reliable detection) and the production deployments (which require fencing tokens, lease epochs, and PreVote checks to handle the cases where detection is wrong). Reading Schneider's paper is worth two hours; the modelling moves are exactly the moves the rest of this curriculum will build on.

Fail-stutter — the mode that doesn't fit any of the three names

A fourth mode worth naming, even though the three-name taxonomy doesn't include it: fail-stutter. Arpaci-Dusseau and Arpaci-Dusseau coined this in 2001 to describe nodes whose performance varies wildly between requests but each request is correct and not extreme — a node that is sometimes 12 ms and sometimes 200 ms, where neither value is "fail-slow" but the variability itself is pathological. The signal is the coefficient of variation of latency, not the percentile; a node with high CoV is fail-stuttering even if its p99 is acceptable.

Fail-stutter shows up most often as a result of GC pauses, kernel scheduler decisions, NUMA imbalance, and shared-resource contention. Detection requires looking at latency-stability metrics (e.g. CoV per replica per minute) rather than percentile thresholds. This is the failure mode that motivated the pivot to virtual threads (Project Loom in Java, goroutines in Go), the move to user-space schedulers, and the deployment of CPU-pinning in latency-sensitive services. The mode does not have a single sharp signature like the other three, which is why the three-name taxonomy survives despite missing it — but a serious production engineer treats it as the fourth mode and budgets observability for it.

Why ECC matters more than you think

The KapitalKite war story above is not invented. Cosmic-ray bit-flips on non-ECC RAM are a well-documented production failure mode at scale. Google's 2009 DRAM error study (Schroeder, Pinheiro, Weber) on 1.5 years of production data found that 1 in 3 servers experienced at least one correctable error per year, and the rate of uncorrectable errors on non-ECC hardware was high enough that any cluster of 1000+ machines would see one per day. ECC RAM detects single-bit errors and corrects them transparently; non-ECC silently propagates them. The cost of ECC is modest — typically 10–15% memory price premium — and the cost of one fail-silent incident on a payments cluster easily dominates years of ECC premium across the fleet.

The deeper observation is that fail-silent is a memory-system property as much as a software-protocol property. End-to-end checksums at the application layer catch many fail-silent modes, but they catch them after the corruption has already entered the system. ECC, kernel-level memory scrubbing, end-to-end CRC on storage paths, and TCP checksums on the wire all push the detection earlier, before the corrupt byte is acted on. The defence-in-depth strategy is to push checks as close to the corruption source as possible.

How real systems pick which detectors to run

Production deployments do not run all detectors at full fidelity — they choose. The choice follows from the cost of a missed mode. A read-mostly content-delivery service can afford to skip end-to-end probes because fail-silent on a CDN edge node is a stale-cache problem with bounded blast radius. A payments cluster cannot afford to skip them because a fail-silent on a wallet replica writes wrong balances. A consensus log cannot afford to skip percentile detection because fail-slow on the leader cascades to all clients. The detector mix is workload-specific.

A useful heuristic: for every distributed system you operate, ask "what does it cost me if a node is fail-silent for an hour?". If the answer is "₹50 in support credits", run heartbeat-only and call it done. If the answer is "₹1 crore in customer balance restoration plus a regulatory filing", run all three detectors with sub-minute thresholds. The intermediate cases — a recommendation service, a fraud-detection model — usually warrant the heartbeat plus percentile detector, with end-to-end probes deferred until incident frequency justifies the engineering cost. The decision is always economic, never aesthetic.

Reproduce this on your laptop

# Reproduce the three-mode simulator
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
python3 three_modes.py
# Vary the fail rates: change 1/8 to 1/4 (more fail-slow), 0.02 to 0.10 (more fail-silent)
# and observe how the per-replica detectors react.

# Inject a real fail-slow on Linux for live cluster testing:
sudo tc qdisc add dev eth0 root netem delay 500ms 30ms 25%
# 25% of packets get 500 +/- 30 ms delay — observe which detectors fire
sudo tc qdisc del dev eth0 root

# Verify ECC is enabled on your hardware:
sudo dmidecode -t memory | grep -i 'error correction'
# 'Multi-bit ECC' or 'Single-bit ECC' are healthy; 'None' is risk for fail-silent

Where this leads next

The three-mode taxonomy is the framing every later chapter uses when discussing what a protocol survives.

The lesson to carry is that a protocol's "behaviour under failure" specification must enumerate which modes it survives, not merely "node failure". Raft survives fail-stop with elegance, fail-slow with caveats (heartbeat-cadence sensitivity), fail-silent only when paired with end-to-end checksums on the application layer (Raft itself does not protect against bit-flips inside the state machine). When you read a paper claiming "fault tolerance", the next question is always: against which mode? Against fail-stop and fail-slow but not fail-silent is a meaningfully different guarantee than against all three.

A drill: take your most recent production incident and classify the failure mode. Was the node fully crashed, or alive-but-late, or alive-on-time-but-wrong? Which detector would have caught it fastest, given perfect tuning? Which detector did catch it, and how much lag did the wrong-detector choice cost? The answers compound across incidents into a sharp intuition for which detectors your service most needs to invest in next quarter. The taxonomy stops being a textbook category the moment you start using it as the first question in every postmortem.

The deeper observation, looking ahead: most distributed-systems textbooks, including the foundational papers, are written in the fail-stop model because it is the only model where consensus is possible (FLP) and progress is provable. Real production engineering is the discipline of stretching the fail-stop machinery to handle fail-slow and fail-silent, using mechanisms (fencing tokens, end-to-end checksums, percentile alerting) that are not in the original protocol but are mandatory for the protocol to survive in production. Every later chapter's "production hardening" section is, in effect, an answer to "how do we make this fail-stop algorithm robust against the modes the algorithm itself doesn't model".

References

  1. Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems — Gunawi et al., FAST 2018. The empirical taxonomy and frequency baseline for the fail-slow mode.
  2. Why do computers stop and what can be done about it? — Jim Gray, 1985. The classic paper that distinguished hardware Bohrbugs (deterministic, fail-stop) from Heisenbugs (transient, fail-silent / fail-slow); foundational vocabulary.
  3. DRAM Errors in the Wild: A Large-Scale Field Study — Schroeder, Pinheiro, Weber, SIGMETRICS 2009. Google's bit-flip frequencies; the empirical case for ECC against fail-silent.
  4. The φ Accrual Failure Detector — Hayashibara et al., SRDS 2004. The principled liveness detector for fail-stop, calibrated to inter-arrival distributions.
  5. The Tail at Scale — Dean & Barroso, CACM 2013. The fail-slow mitigation playbook (hedged requests, tied requests).
  6. Partial failures and why they're the worst — internal cross-link to the operational deep-dive on these modes inside one service.
  7. Designing Data-Intensive Applications — Kleppmann, O'Reilly 2017. Chapter 8's failure-model section is the canonical practitioner introduction to the three modes.
  8. Byzantine Generals in Action — Schneider, 1984. The fail-stop processor model in its formal guise; the theoretical scaffolding under "fail-stop" as a term of art.