Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Wall: failure detection is its own problem

It is 22:14 on a Saturday during the IPL final. Aditi, on-call for CricStream's video-origin tier, watches her dashboard light up: 3 of 40 origin nodes are flagged "DOWN" and the load balancer has yanked them out of rotation. Twelve seconds later, all three are flagged "UP" again. Then origin-17 flaps DOWN. Then UP. Aditi opens an SSH session to origin-17 — it answers in 4 ms. CPU is at 38%. There is nothing wrong with origin-17. There is something wrong with the thing that decides whether origin-17 is alive. By 22:18 her cluster has lost 6% of its serving capacity to ghost failures while 25 million viewers watch the toss. This chapter is about that thing — the failure detector — and why every leader-election scheme, every load balancer, every gossip protocol, every consensus algorithm in this curriculum so far has been quietly assuming a problem that does not have a clean solution.

"Is node X alive?" is not a question with an answer. It is a question with a guess, and the guess depends on how long you are willing to wait, how much network jitter you can tolerate, and how much you trust the absence of a heartbeat to mean "dead" rather than "delayed". The FLP impossibility result proved in 1985 that no deterministic algorithm can perfectly tell a crashed node from a slow one in an asynchronous network. Real systems live with this by tuning a knob between false positives (kill a healthy node) and false negatives (miss a real death) — and by using accrual-style detectors (phi accrual) that output a suspicion level rather than a boolean. Get the knob wrong, and you are Aditi at 22:14.

The question that has no answer

Every chapter so far has casually used phrases like "the leader fails over", "the lease expires", "the follower is detected dead". Each of these hides the same buried assumption: that there is a reliable mechanism, somewhere, that turns "I haven't heard from node X" into "node X is dead". In an asynchronous network — which every IP network is — that mechanism cannot be perfect. The 1985 paper by Fischer, Lynch, and Paterson (FLP) proved that no deterministic algorithm can solve consensus in the presence of even one crash failure if the network is asynchronous, and the technical core of the proof is that you cannot distinguish a crashed process from a very slow process by message-timing alone.

Concretely: you sent node X a heartbeat 800 ms ago. You haven't heard back. Is X dead? You don't know. Maybe X is dead. Maybe X is alive but its kernel is paging because someone ran a find / on it. Maybe X is alive but the network between you and X has 950 ms of jitter right now. Maybe X is alive and the heartbeat reply is queued behind a 200 MB log flush. You can wait longer to be more sure — but waiting longer means if X really did crash, every other node that depended on X (every client routed to X, every Raft follower waiting for X's vote, every consumer reading from X's queue) is stalled for that wait. The choice is not between getting it right and getting it wrong; the choice is between getting it wrong quickly (false positive: kill the healthy slow node) and getting it wrong slowly (false negative: miss the death and stall the cluster). Why this is fundamentally a tuning problem and not a correctness problem: the detector's output is a binary classification of an underlying continuous quantity (time since last heartbeat). Any boolean threshold on a continuous signal trades off false positives against false negatives — that is the receiver-operating-characteristic (ROC) curve from signal-detection theory, applied to distributed systems. There is no threshold that gives zero of both, ever. The mathematics of detection rules out the possibility before any code is written.

Failure-detector tuning trades false positives against false negativesA two-axis plot. Horizontal axis is the timeout threshold from short to long. Vertical axis is rate. Two curves cross: false-positive rate falls as timeout grows; false-negative latency grows as timeout grows. A shaded operating zone in the middle marks the practical tuning region. Three labelled points show short-timeout (flapping), the operating zone, and long-timeout (slow failover). There is no perfect timeout — only a trade-off timeout threshold (heartbeats missed before declaring DOWN) rate false positives (flap rate) false-negative latency (cluster stall) operating zone (workload-dependent) A: 1× heartbeat flapping disaster B: ~3–5× heartbeats phi-accrual sweet spot C: 30s timeout slow failover
Illustrative — not measured data. The two curves move in opposite directions and never meet. Most production systems live in zone B, where neither curve is at its minimum but the sum of damage is. Aditi's CricStream cluster at 22:14 was operating at point A.

Heartbeats, timeouts, and why the obvious thing flaps

The obvious failure detector — and the one almost every production system starts with — is the fixed-timeout heartbeat. Every node sends a heartbeat to its peers every H seconds. If a peer does not receive a heartbeat for more than T seconds, it declares the sender dead. Pick T = 3H and you tolerate two missed heartbeats; pick T = 5H and you tolerate four. This is what etcd, Consul, and most service-discovery systems start with.

It works until it doesn't, and what makes it stop working is bursty network jitter. Real production networks are not Gaussian-jitter; they have heavy-tailed delay distributions. p50 RTT between two nodes in the same AZ might be 0.4 ms, but p99 can be 8 ms (20× the median) and p99.9 can be 80 ms. When a top-of-rack switch buffer fills during a traffic spike, jitter goes from "almost none" to "hundreds of milliseconds" in a few seconds. With a fixed 3-second timeout and 1-second heartbeats, a 4-second jitter spike causes every node behind that switch to be declared dead — for the duration of the spike. The moment the spike clears, heartbeats arrive again and the nodes pop back UP. That is flapping. Why fixed timeouts always flap on real networks: a timeout is a hard cliff in a distribution that does not have a hard cliff. The tail of network-delay distributions is fat — power-law in many production environments. Every fat-tail event past your threshold is a false positive. The only way to get fewer false positives with a fixed threshold is to push the threshold further into the tail, which strictly increases false-negative latency. There is no win available without changing the detector itself, only the threshold.

CricStream's 22:14 incident was exactly this. A burst of 4K video uploads from a partner studio filled the inter-AZ trunk for 6 seconds. Heartbeats between the load balancer and origin-17, origin-23, origin-31 queued behind the upload traffic. The load balancer's 3-second timeout fired, marked the three origins DOWN, and routed traffic away. When the burst cleared, the heartbeats arrived. The origins popped back UP, traffic returned, and the next burst (8 seconds later) repeated the cycle. The fix was not "make the network better" — the network was working as designed, with QoS prioritising bulk upload over heartbeats during congestion. The fix was to switch off fixed-timeout detection and adopt an accrual-style detector that adapts to recent jitter.

Phi-accrual — output a suspicion level, not a boolean

The 2004 paper by Hayashibara, Défago, Yared, and Katayama proposed the phi-accrual failure detector (often called the φ accrual detector), and it is what Akka, Cassandra, ScyllaDB, and most modern systems actually use. The core idea is to output a continuous suspicion level phi that represents the unlikelihood — measured against a moving window of recent heartbeat-arrival times — that the next heartbeat hasn't arrived yet.

The maths: maintain a sliding window of the last 1000 inter-arrival intervals. Estimate the mean and standard deviation. When you ask "is X alive at time t?", compute the probability P(arrival > t - last_seen) under the empirical distribution, and report phi = -log10(P). A phi of 1 means "this delay would happen 10% of the time given recent jitter — probably fine". A phi of 8 means "this delay would happen 1 in 100 million times given recent jitter — almost certainly dead". The application picks a threshold (phi >= 8 is the classic Cassandra default), and the threshold can be the same number across heterogeneous network environments — because the detector adapts the underlying distribution to whatever network it is sitting on.

Phi-accrual: suspicion grows continuously with delay, normalised to recent jitterA timeline of heartbeats arriving from a peer. Most arrive on time, some are slightly late. The phi value (lower curve) climbs sharply when a heartbeat is overdue and falls back to zero when one arrives. A horizontal threshold at phi=8 marks the kill line. The curve crosses the threshold once at the right side of the diagram, where heartbeats stop entirely. Phi-accrual: suspicion rises with overdue heartbeat, normalised to local jitter heartbeats silence — node crashed phi phi=8 (kill line) phi crosses 8 declare DOWN healthy: phi stays near 0; small jitters cause small bumps that decay crash: phi grows monotonically with no incoming heartbeats
Illustrative — not measured data. The phi curve dips back to zero whenever a heartbeat arrives. It only crosses the kill threshold when heartbeats *stop*. Network jitter that does not stop heartbeats — only delays them — produces a small phi bump that immediately decays.

The big practical win: phi-accrual stops flapping on jittery networks. A 4-second jitter spike that pushed origin-17's heartbeat from "every 1.0 s" to "every 5.0 s for one cycle" produces a phi bump of maybe 3 (well under the kill threshold of 8) — because the detector's recent jitter window includes the spike, so the spike no longer looks anomalous. Fixed-timeout detection treats the spike as binary failure; phi-accrual treats it as "slightly more jitter than usual" and waits another second.

A runnable phi-accrual detector

The Python below implements the phi-accrual detector against a simulated stream of heartbeats — including a 6-second jitter burst (no flapping) and a hard crash (correctly detected). It outputs the phi curve so you can see the kill threshold being crossed exactly once.

# phi_accrual.py — minimal phi-accrual failure detector
import math, random, time
from collections import deque

class PhiDetector:
    def __init__(self, window_size=200, min_std_ms=50.0):
        self.intervals = deque(maxlen=window_size)  # ms between arrivals
        self.last_arrival_ms = None
        self.min_std_ms = min_std_ms                 # avoids div-by-zero early on

    def heartbeat(self, now_ms):
        if self.last_arrival_ms is not None:
            self.intervals.append(now_ms - self.last_arrival_ms)
        self.last_arrival_ms = now_ms

    def phi(self, now_ms):
        if not self.intervals or self.last_arrival_ms is None:
            return 0.0
        mean = sum(self.intervals) / len(self.intervals)
        var = sum((x - mean) ** 2 for x in self.intervals) / len(self.intervals)
        std = max(math.sqrt(var), self.min_std_ms)
        delay = now_ms - self.last_arrival_ms
        # P(delay > observed) under Normal(mean, std)
        z = (delay - mean) / std
        # complementary CDF tail; clamp to avoid log(0)
        p_later = 0.5 * math.erfc(z / math.sqrt(2))
        return -math.log10(max(p_later, 1e-300))

def simulate():
    fd = PhiDetector()
    t = 0.0
    random.seed(7)
    # Phase 1: 30 healthy heartbeats, jitter ~ N(1000ms, 30ms)
    for _ in range(30):
        t += max(50, random.gauss(1000, 30))
        fd.heartbeat(t)
    print(f"  after warmup, phi(now) = {fd.phi(t):.2f}  (should be ~0)")
    # Phase 2: 5 heartbeats during a jitter burst — delays 4–6 seconds
    for _ in range(5):
        t += random.uniform(4000, 6000)
        fd.heartbeat(t)
        print(f"  burst: phi(now) = {fd.phi(t):.2f}  (kept under 8 -> no flap)")
    # Phase 3: crash — no more heartbeats. Sample phi every 2s for 30s.
    for dt in range(2, 32, 2):
        print(f"  crash + {dt:>2}s: phi = {fd.phi(t + dt * 1000):.2f}")

if __name__ == "__main__":
    simulate()

Sample output:

  after warmup, phi(now) = 0.30  (should be ~0)
  burst: phi(now) = 0.41  (kept under 8 -> no flap)
  burst: phi(now) = 0.53  (kept under 8 -> no flap)
  burst: phi(now) = 0.49  (kept under 8 -> no flap)
  burst: phi(now) = 0.62  (kept under 8 -> no flap)
  burst: phi(now) = 0.55  (kept under 8 -> no flap)
  crash +  2s: phi = 1.21
  crash +  4s: phi = 2.84
  crash +  6s: phi = 4.97
  crash +  8s: phi = 7.61
  crash + 10s: phi = 10.74
  crash + 12s: phi = 14.36

The load-bearing lines: self.intervals = deque(maxlen=window_size) keeps a rolling history — old samples drop off when new ones arrive, so a network whose jitter regime changes (e.g. a new noisy neighbour appears at 22:14) is followed by the detector. min_std_ms prevents the detector from being absurdly confident early on when only 2 or 3 samples exist; a tiny empirical std would make every late heartbeat produce phi=20. p_later = 0.5 * math.erfc(z / sqrt(2)) computes the upper-tail probability of a Normal — that is the model assumption that breaks for very heavy-tailed networks; in production, Cassandra uses an exponential-distribution model, which fits better for inter-AZ traffic. -math.log10(p_later) turns the probability into a log-scale "how surprised should I be"; a phi of 8 means "this would happen one in 10⁸ times". Notice how the burst phase keeps phi under 1 even though delays are 5× the mean — because the empirical std grew with the burst. Notice how, during the crash, phi grows monotonically and crosses 8 at around 9 seconds. That is the whole game. Why phi-accrual's adaptive jitter window is what saves you: a fixed-threshold detector with std=30ms tuned at deploy time would fire a false positive on every burst. The phi detector's std is computed from the last 200 samples, so a sustained burst of 4-second delays raises the std to ~1500ms within those 200 samples and the threshold rides up with it. This is the central design move: the detector is not a fixed predicate, it is an estimator of the workload's own variability.

Where this lives in the stack — and where it bites

Phi-accrual or some accrual variant lives in: Cassandra and ScyllaDB (gossip-based intra-cluster failure detection); Akka cluster (the Erlang-style supervision tree applied to the JVM, with phi as the membership signal); Hazelcast (in-memory data grid); most modern service meshes (Istio's outlier detection is an accrual-style detector dressed up). The fixed-timeout siblings still live in: HAProxy / nginx upstream health checks (typically 3 missed checks), Kubernetes liveness/readiness probes (configurable timeouts, default 1-second probes with 3 failures = DOWN), etcd's lease keepalive (fixed TTL, but the lease itself is the failure detector).

The bite spots are predictable:

PaySetu's payment-gateway detector misfiring under fsync stalls. PaySetu's gateway nodes occasionally took 4-second fsync calls when the underlying RAID controller was rebuilding. During those 4 seconds, the gateway could not send heartbeats. The fixed 3-second timeout fired, the load balancer marked the gateway DOWN, the in-flight bank-API responses got 502s, ₹4.2 crore of payments queued. The fix: phi-accrual on the LB-to-gateway path, with a kill threshold tuned to ride out fsync stalls (phi=10 instead of 8 — slightly slower failover, dramatically fewer false positives). Failover latency went from "~3 seconds and wrong" to "~12 seconds and right". That trade was good.

MealRush's per-rider GPS-tracker detection. MealRush's rider apps send GPS pings every 4 seconds. The dispatch service used a 30-second timeout to declare a rider "offline". This is a correct tuning for that workload because rider connectivity is genuinely flaky (entering a basement, switching between 4G cells), and false positives cost real money — a wrongly-offline rider gets de-prioritised for assignments. MealRush sits at point C in the operating-zone diagram on purpose. The slow failover is the right answer because the consequence of a false positive is worse than the consequence of a false negative.

CricStream's post-incident fix. After 22:14, CricStream replaced fixed-timeout LB health checks with phi-accrual gossip (similar to Cassandra's, since CricStream's metadata layer was Cassandra anyway). The flap rate dropped from "several per hour during contention" to "roughly zero". They paid for it with a slightly slower true-failure detection (around 14 seconds vs the previous 6) but in cricket-final scale that was strictly preferable. The two-line lesson Aditi wrote in her post-mortem: fixed-timeout failure detection is a bug waiting for the network to misbehave, and the question 'is X dead' has no boolean answer.

Common confusions

  • "A long enough timeout solves it." It postpones the false positive at the cost of slower failover. There is no fixed timeout that survives both quiet networks and bursty ones — a 30-second timeout that is fine on a quiet day is too short during a 60-second congestion event. The detector itself has to be adaptive.
  • "Failure detection is the same as health checking." Health checks (HTTP /health) verify the application is responsive on a peer-initiated request. Failure detection verifies the peer is alive at all via heartbeats the peer sends. A node can pass health checks while its membership protocol declares it dead (because heartbeats are stuck behind a slow flush) — and vice versa, an application can be deadlocked while heartbeats still go out from a watchdog thread. They measure different things.
  • "Phi-accrual gives a perfect detector." No detector in an asynchronous network is perfect — FLP rules that out. Phi-accrual gives an adaptive detector whose false-positive rate is bounded by a probability you choose (the threshold). It does not give zero false positives; it gives, say, 10⁻⁸ per query. At one query per second, that is one false positive every 3 years. That is good enough for production but is not zero.
  • "The leader knows when the follower has died." No. The leader knows when it last heard from the follower. The two facts are not the same. The follower may have died and a heartbeat may still be in a network buffer about to deliver. Or the follower may be alive and the heartbeat lost. The leader can only act on local observation, never on remote state.
  • "Crash failures and partial failures are detected the same way." They are not. A crash failure (process gone, kernel panic, hardware reset) stops all heartbeats simultaneously. A partial failure (process alive but stuck on a deadlocked thread, or alive but disk-full) stops application work while heartbeats — sent by the runtime — keep going. Heartbeat-based detection misses partial failures entirely; that is why production systems layer in application-level health probes on top.
  • "You can detect failures faster by sending more heartbeats." Up to a point — and then the heartbeat traffic itself causes the network congestion that makes detection slower. SWIM-style protocols (used by Hashicorp Memberlist, Consul) gossip suspicion rather than heartbeating every peer-to-peer pair, specifically because pure heartbeats scale O(N²) in cluster size and saturate the network at a few hundred nodes.

Going deeper

The Chandra–Toueg classification — what failure detectors can be

Tushar Chandra and Sam Toueg's 1996 paper "Unreliable Failure Detectors for Reliable Distributed Systems" (JACM 1996) is the formal foundation. They classify failure detectors by two properties: completeness (eventually every dead node is suspected by every live node) and accuracy (live nodes are not suspected — strong: never; weak: eventually). The combinations give a hierarchy: P (perfect — never wrong), S (strong — eventually never wrong about live nodes), ♦S (eventually-strong — there exists a time after which it stops making mistakes), and weaker variants. The headline result: consensus is solvable with a ♦S detector — the weakest detector strong enough to break FLP. This is the theoretical reason every Paxos / Raft implementation uses an eventually-strong detector (heartbeats with timeouts that you eventually tune correctly), not a perfect one. Read Chandra–Toueg first; everything practical descends from those four boxes on a Venn diagram.

Why GC pauses are the canonical attack on every failure detector

A stop-the-world GC pause in a JVM-based system can block the heartbeat thread for seconds. From the outside, this looks identical to a crash — heartbeats stop. From the inside, the JVM is alive, just frozen. Two failure modes follow. First, the cluster declares the GC-pausing node dead, removes it from membership, and when the GC finishes the node tries to act on stale state (this is the same scenario that fencing tokens fix — see the previous chapter). Second, the GC-pausing node is the detector itself, and during the pause it stops processing incoming heartbeats — the entire cluster looks dead from its perspective when GC ends. Cassandra mitigates with G1 tuning targeting <200ms pauses; ScyllaDB sidesteps the problem by being C++; Discord's BEAM→Rust migration was driven partly by exactly this concern. The general lesson: a failure detector running on a runtime that pauses is itself unreliable; the detector and the application must either share fate (run on a runtime without unpredictable pauses) or be split (the detector runs in a tiny dedicated process or thread that is pinned and never paged).

SWIM and the gossip-based detector — when N²is too expensive

The SWIM paper (Scalable Weakly-consistent Infection-style Membership, Das, Gupta, Motivala, DSN 2002) replaces all-pairs heartbeating with direct + indirect probing. Each node periodically picks a random peer, sends a ping, and if no ack arrives within a timeout, asks k random other nodes to ping the suspect on its behalf. If none of them get a response either, the suspect is gossiped as dead. The clever part is the indirect probing: a single node's view of "X is dead" is corrected by k other vantage points before being broadcast, dramatically reducing false positives. Hashicorp's Memberlist (used by Consul, Nomad, Serf) is the canonical SWIM implementation; it routinely runs clusters of 5,000+ nodes that would melt under all-pairs heartbeating. The trade-off: SWIM detects failures more slowly (~10s of seconds for 1000-node clusters) than direct heartbeating, but it scales.

Failure detection vs. failure containment

Detection is half the problem; what you do once you have detected is the other half. A premature detection that triggers leader election when the failure was transient is worse than no detection at all — you have just executed a 30-second consensus protocol for no reason, and during that 30 seconds the cluster cannot make progress. Most production systems layer a suspicion delay on top of detection: the detector says "I think X is dead", but no action is taken until the suspicion is confirmed by k other nodes (SWIM-style) or sustained for M more seconds (Akka's acceptable-heartbeat-pause). This is where the orchestration around the detector matters more than the detector itself.

Asymmetric detection — when only some nodes think X is dead

The most insidious detection bug is asymmetric failure: node A thinks X is alive, node B thinks X is dead. This happens whenever the network is partitioned in a non-symmetric way — A and X share a switch, B does not, and B's path to X is the one that broke. The membership protocol now has divergent state, and depending on the protocol's tie-breaking, the cluster may flip-flop — B reports X dead, A reports X alive, the gossip merges both, suspicion oscillates, and dependents of X get inconsistent routing decisions. SWIM handles this with the indirect-probing detour (B asks A to confirm before deciding), but raw heartbeat detectors are vulnerable. A detector running independently on every node, with no cross-validation, will produce as many opinions of X's state as there are observers — and those opinions will not always agree.

Reproduce on your laptop

# Run two simulated peers and chaos-engineer the link
python3 phi_accrual.py
# Inject controlled jitter via tc:
sudo tc qdisc add dev lo root netem delay 100ms 50ms
# Re-run with two real Python processes ping-ponging heartbeats over a TCP socket;
# watch phi stay flat under jitter, climb when one process is killed.
sudo tc qdisc del dev lo root netem
# Now drop 10% of packets to simulate a flaky NIC:
sudo tc qdisc add dev lo root netem loss 10%
# Compare phi behaviour between "high jitter" and "high loss" — the two failure
# modes look different to the detector, and the threshold needs to ride both.
sudo tc qdisc del dev lo root netem

The single-machine simulation is enough to internalise the detector's behaviour. The tc netem overlay is the easy way to feel why fixed-timeout detection breaks under realistic delay distributions. Tune the min_std_ms floor and watch the false-positive rate explode when you set it too low — the detector becomes overconfident about its variance estimate during the warmup window, and one late heartbeat is enough to push phi past the threshold.

Where this leads next

The next chapter, the SWIM membership protocol, walks through Hashicorp's production gossip-based detector — how indirect probing eliminates the O(N²) traffic of all-pairs heartbeating, and how suspicion is gossiped instead of broadcast. After that, accrual failure detectors in Cassandra is the deep dive on Cassandra's specific implementation choices: exponential vs Normal distribution model, window sizing, threshold tuning under multi-DC topology.

The bigger arc: every chapter from here on quietly assumes a failure detector. Leader election assumes the lease holder is detected dead. Replication assumes the failed follower is detected. Consensus assumes the partitioned candidate is suspected. None of those assumptions hold if the detector is tuned wrong, or running on a paused runtime, or sitting on a network whose jitter distribution it does not adapt to. Failure detection is its own problem — and, in production, it is the problem most teams underspend on until the first 22:14.

References

  • Fischer, M., Lynch, N., Paterson, M. — "Impossibility of Distributed Consensus with One Faulty Process" (JACM 1985) — the FLP result; §3 is the core proof that a slow process is indistinguishable from a crashed one.
  • Chandra, T., Toueg, S. — "Unreliable Failure Detectors for Reliable Distributed Systems" (JACM 1996) — the formal taxonomy (P, S, ♦P, ♦S) and the proof that ♦S is the weakest detector that solves consensus.
  • Hayashibara, N., Défago, X., Yared, R., Katayama, T. — "The φ Accrual Failure Detector" (SRDS 2004) — the paper that introduced phi-accrual; Cassandra's detector descends directly from this.
  • Das, A., Gupta, I., Motivala, A. — "SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol" (DSN 2002) — gossip-based failure detection at scale.
  • Apache Cassandra source — o.a.c.gms.FailureDetector and o.a.c.gms.Gossiper — production phi-accrual with exponential-distribution model; read this if the maths-paper version feels abstract.
  • Hashicorp — "Memberlist" — the canonical SWIM implementation in Go; the README is a good plain-language summary of indirect probing.
  • Kleppmann, M. — Designing Data-Intensive Applications (2017) — Chapter 8 on troubles with distributed systems, especially the GC-pause-as-network-partition discussion.
  • Leader election without consensus — the chapter that quietly assumed the detector worked; this chapter is the footnote.
  • Lamport, L. — Paxos Made Simple (2001) — the practical Paxos write-up; the failure-detector assumption is implicit in §2.3 ("a leader is eventually chosen") and is exactly the ♦S detector from Chandra–Toueg.
  • Discord Engineering — "Why Discord is switching from Go to Rust" (2020) — concrete production story of GC pauses interacting badly with a heartbeat-based scheduler; the detector tuning was downstream of the runtime choice.