Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Heartbeats: the naive approach
It is 03:42 and Rahul, on-call for PaySetu's UPI authoriser cluster, is staring at a 23-line bash script someone committed in 2021. It runs every two seconds on every node, sends a UDP packet to every peer, and writes the timestamp to a file. A second cron job reads the file every five seconds, and if any peer's timestamp is older than 10 seconds, it removes that peer from the membership list. This is PaySetu's failure detector. It has been running for four years. It has triggered 38 outages, all of them at exactly the wrong moment — Diwali sale traffic, end-of-quarter settlement, the morning after a kernel upgrade — and yet nobody has replaced it, because every replacement proposal hits the same wall: the existing one is so simple it cannot be wrong, only mistuned. Rahul is about to learn that this belief is wrong in three ways, each of which has its own production scar.
A heartbeat is a periodic "I am alive" message; a timeout is a deadline past which absence implies death. The design is correct in spirit and wrong in detail: the heartbeat interval H, the timeout T, and the choice between push and pull each interact with network jitter, GC pauses, and clock skew in ways that the naive scheme does not see. Every production heartbeat detector starts as four lines of Python and grows into a 400-line state machine with adaptive timeouts, suspicion gossip, and indirect probing. This chapter is about the four lines and the three specific reasons they are not enough.
The four-line failure detector
The simplest possible failure detector is four lines of pseudocode. Every node runs two threads. The first thread, every H seconds, sends a HEARTBEAT(self, now()) message to every peer. The second thread, every H seconds, scans the table of last-seen timestamps and marks any peer whose last_seen + T < now() as DOWN. That is it. There is no probability theory, no sliding window, no quorum. H = 1s and T = 3s are the defaults you will find in 80% of self-rolled detectors — etcd's --heartbeat-interval defaults to 100 ms with --election-timeout of 1000 ms (the same ratio); Consul's serf defaults are similar; PostgreSQL streaming replication's wal_sender_timeout defaults to 60 s with no explicit heartbeat (it piggybacks on the WAL stream). The pattern is everywhere because it is the minimum viable detector — it is the answer to "what is the dumbest thing that could possibly work".
The reason it is the right starting point: the FLP impossibility result (covered in the wall chapter) tells you that no detector is perfect. So you might as well start with the dumbest one and add complexity only where you can point to a specific production failure that the dumb one missed. Why simplicity is a feature here and not a bug: every detector is a state machine that runs on every node in the cluster, and every line of code in it is a line that can have a bug, a race, or a memory leak. The detector is in the critical path of every leader election, every membership change, every replica failover. A buggy detector takes the cluster down in ways that are hard to debug because the detector is itself the thing you would normally use to debug a cluster. Four lines of code is auditable in five minutes; 400 lines is not. Every team that has migrated from "naive heartbeat + timeout" to "phi-accrual + indirect probing + suspicion gossip" has, at some point, wished they could go back, because the new detector had a subtle bug and the old one didn't.
The state on each node is a single dictionary last_seen: peer_id -> timestamp. The state grows linearly with cluster size. The traffic grows quadratically — N nodes each send N-1 heartbeats every H seconds, so the network carries N(N-1)/H heartbeats per second. At N=10 and H=1s that is 90 packets per second across the cluster — fine. At N=200 it is 39,800 packets per second — still fine on a 10G fabric, but you can feel it on a small switch. At N=2000 it is 4 million packets per second of heartbeats alone, and your detector is now causing the congestion that makes its own heartbeats late. Pure all-pairs heartbeating runs out of network somewhere between 500 and 2000 nodes; this is the threshold at which gossip-based detectors (covered in SWIM protocol) become not-optional. For most clusters — every Raft group, every five-replica Postgres, every nine-node etcd — N is small enough that the all-pairs design is correct.
Push vs pull — the choice that nobody documents
There are two ways to do heartbeats and the choice matters more than the textbooks suggest. Push (the design above): every node periodically sends "I am alive" to every peer; receivers maintain last_seen. Pull: every node periodically asks every peer "are you alive?"; senders maintain last_responded. Push is what etcd, Consul, Cassandra, and Kafka use. Pull is what HTTP load-balancers, Kubernetes liveness probes, and AWS ELB health-checks use. They look identical from a distance and they have different failure modes.
Push fails silently from the receiver's perspective. If the sender is alive but its heartbeat-sending thread has crashed (a bug in the runtime, an unhandled exception that killed only that thread), the sender keeps doing application work but stops sending heartbeats. Every receiver thinks the sender is dead. The sender thinks it is fine. This is the zombie sender failure mode: the application is healthy, the detector says it is dead, the cluster has already failed it over. Pull fails the opposite way. If the receiver's response thread is broken but the application threads are still alive, the prober declares the receiver dead. The receiver is also still doing application work, but it is now answering requests that the load balancer thinks should go elsewhere — split-brain by liveness probe. Why this matters more than the choice of H or T: the failure mode of the detector itself defines what kind of partial failures you can survive. A push-based detector is fragile to bugs in the sender's heartbeat thread; a pull-based detector is fragile to bugs in the receiver's response thread. Production systems often combine both: Cassandra uses push for membership but layers TCP-level keepalive (which is a pull from the OS kernel's perspective) for connection liveness. The combination catches both failure modes that either one alone misses.
The choice also has a quieter property: push is fanout-asymmetric and pull is fanin-asymmetric. With push, the sender pays the network cost of (N-1) outbound packets every H seconds; the receiver only handles (N-1) inbound packets — usually cheaper because receive is interrupt-driven and less CPU per packet. With pull, the prober pays both the outbound probe and the inbound response. If your monitoring is centralised — one machine probes all 5,000 backend instances — the prober becomes a bottleneck and a single point of failure. If your monitoring is decentralised, push scales to higher node counts before the senders become the bottleneck. This is why most large-cluster systems converge on push: the cost is amortised across all nodes rather than concentrated in a prober.
Why the obvious thing fails — three specific scars
The four-line detector breaks in production in three distinct ways. Each has its own war story, its own fix, and its own production scar across the industry.
Scar 1: Network jitter under load. A naive timeout — T = 3H — is a hard cliff in a distribution that doesn't have one. Production network delays follow a heavy-tailed distribution: median 0.4 ms within an AZ, p99 of 8 ms, p99.9 of 80 ms, p99.99 of 800 ms or worse during congestion. With H=1s and T=3s, a 4-second spike is enough to falsely declare every peer behind a congested switch as DOWN. CricStream had this exact failure on the night of an IPL final: a burst of 4K video uploads filled the inter-AZ trunk for 6 seconds, and 7 origin nodes flapped DOWN-UP-DOWN-UP for 90 seconds. Why "increase the timeout" is not the answer: if heartbeat delays are heavy-tailed (which they are on every real network — bursty TCP, bufferbloat at routers, retransmits under loss), pushing T further into the tail strictly trades false-positive rate for false-negative latency. At T=10s you survive a 6-second jitter spike, but you also take 10 seconds to detect a real crash; multiply that by every cluster-state-change downstream of detection and your real-failure recovery time goes from 3s to 30s+. The shape of the distribution does not change — only your operating point on the ROC curve. The structural fix is to abandon fixed timeouts and use an adaptive detector that learns the local jitter distribution. That is what phi-accrual does.
Scar 2: GC pauses and stop-the-world freezes. A JVM stop-the-world GC pause can block all threads for hundreds of milliseconds — sometimes seconds. From the outside, this looks identical to a crash: heartbeats stop. From the inside, the JVM is alive and will resume in a moment. With T=3s, a 4-second GC pause kicks the node out of the cluster; when GC finishes, the node tries to act on stale state. KapitalKite had this in 2024: a JVM-based market-data fan-out service ran a major GC during open-of-market, missed three heartbeats, was failed over, and when the GC finished the original node started serving stale price quotes for 12 seconds because it had not yet learned it was no longer the leader. The fix is fencing tokens (covered in split-brain and fencing) plus a detector that does not penalise short pauses — phi-accrual's adaptive variance helps here, but not enough; you also need lease-based ownership so the failed-over node loses its grant before it wakes up.
Scar 3: Asymmetric partitions. The third scar is the meanest. The network breaks in a non-symmetric way — node A can send to B but not receive; B can send to A but not receive. With independent push detectors, A thinks B is alive (A is receiving B's heartbeats) and B thinks A is alive (B is receiving A's). Both nodes think the cluster is healthy. Application traffic, however, is broken — neither can reply to the other. This is the gray failure mode: the detector says everything is fine, the application is failing, and there is no alarm. The classic cure is bilateral heartbeating — A's heartbeat to B includes A's view of B, and B notices that "A says I am dead, but I am alive, so the path B-to-A must be broken". SWIM's indirect probing handles this elegantly; raw all-pairs heartbeating does not. BharatBazaar saw this during a Big Billion Days sale: one of two redundant network paths between availability zones developed asymmetric loss, both detectors said "all green", checkout latency p99 went from 80 ms to 4.2 s, and the team spent 23 minutes finding the problem because the dashboards were lying.
Building it from scratch — the runnable version
Here is the four-line detector turned into runnable Python. Two threads, an in-memory last_seen table, UDP heartbeats over the loopback. This is the minimum viable detector; tune it and break it on your laptop before reading the next chapter on phi-accrual.
import socket, threading, time, json
from collections import defaultdict
class NaiveHeartbeat:
def __init__(self, self_id, peers, port=9100, H=1.0, T=3.0):
self.self_id, self.peers, self.port = self_id, peers, port
self.H, self.T = H, T # heartbeat interval, timeout
self.last_seen = defaultdict(lambda: time.time())
self.alive = {p: True for p in peers if p != self_id}
self.sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
self.sock.bind(('127.0.0.1', port + self_id))
def sender(self):
while True:
msg = json.dumps({'from': self.self_id, 'ts': time.time()}).encode()
for p in self.peers:
if p == self.self_id: continue
self.sock.sendto(msg, ('127.0.0.1', self.port + p))
time.sleep(self.H)
def receiver(self):
while True:
data, _ = self.sock.recvfrom(1024)
m = json.loads(data); self.last_seen[m['from']] = time.time()
def scanner(self):
while True:
now = time.time()
for p, t in self.last_seen.items():
was = self.alive.get(p, True)
is_now = (now - t) <= self.T
if was != is_now:
state = 'UP' if is_now else 'DOWN'
print(f'[{self.self_id}] peer {p}: {state} (gap={now-t:.2f}s)')
self.alive[p] = is_now
time.sleep(self.H)
def start(self):
for fn in (self.sender, self.receiver, self.scanner):
threading.Thread(target=fn, daemon=True).start()
# launch 4-node cluster on loopback
for i in range(4):
NaiveHeartbeat(self_id=i, peers=[0,1,2,3]).start()
time.sleep(60)
Run it. Output (steady state):
[0] peer 1: UP (gap=0.04s)
[1] peer 2: UP (gap=0.05s)
[2] peer 3: UP (gap=0.06s)
[3] peer 0: UP (gap=0.04s)
Now break it. In another terminal: sudo tc qdisc add dev lo root netem delay 0ms 4000ms distribution normal. The loopback now has Gaussian jitter with σ=4s. Within seconds:
[0] peer 1: DOWN (gap=4.21s)
[2] peer 1: DOWN (gap=4.18s)
[1] peer 0: DOWN (gap=4.07s)
[0] peer 1: UP (gap=0.61s)
[2] peer 1: UP (gap=0.59s)
This is scar 1, live on your laptop. Reverting the jitter (sudo tc qdisc del dev lo root netem) restores the steady state. The interesting tuning lever is T: increase it to 10 seconds and the flapping stops, but now run kill -STOP <pid> on one of the processes to simulate a GC pause — the detector takes 10 seconds to notice. There is no value of T that survives both jitter and real failure simultaneously. That is the lesson the four-line detector teaches before you discard it. The line self.last_seen = defaultdict(lambda: time.time()) initialises new peers as alive, which is the right default — otherwise a slow startup looks like a dead peer. The was != is_now check ensures only state transitions are logged, not steady-state — without it the scanner would scream at every tick.
Common confusions
- "A heartbeat is the same thing as a TCP keepalive." They are different. TCP keepalive is the OS kernel's check that a connection-level peer is reachable; it triggers at the socket layer with default intervals of 2 hours on Linux (yes, hours —
tcp_keepalive_time = 7200). Application-level heartbeats run at 0.1–10 s intervals, are visible to the application, and survive across socket reconnects. TCP keepalive catches "the network is gone"; heartbeats catch "the application is gone". Most production systems use both, layered. - "If the timeout T is bigger than the heartbeat interval H, you are safe." Only if T is bigger than
H + max_jitter + max_GC_pause, and "max" is unbounded in the tail. The folk ruleT = 3Hcovers two consecutive lost heartbeats, which is enough for clean networks and useless on congested ones. The right framing is "what fraction of heartbeats can be lost or delayed before I declare DOWN?" — and the answer depends on the local jitter distribution, which fixed timeouts cannot adapt to. - "More frequent heartbeats give faster detection." Up to a point — and then the heartbeat traffic itself causes the network congestion that delays heartbeats. With all-pairs heartbeating at
N=500andH=100ms, you are sending 2.5 million heartbeats per second cluster-wide. The detector is now a denial-of-service against itself. - "A heartbeat detector and a load-balancer health check do the same thing." They are different problems. A heartbeat detector decides cluster membership — who is part of the consensus group, who can vote, who holds shards. A health check decides traffic routing — should this instance receive requests right now. Membership transitions are expensive (rebalance, leader election); routing transitions are cheap (one connection less). Conflating them gives you cluster-wide rebalances every time one instance has a slow garbage collection.
- "Detection latency and detection accuracy are independent." They are not — they are the two axes of the detector's ROC curve. You buy more accuracy with more latency, and vice versa. Every detector design is a point on that curve. The four-line detector sits at one specific point that is rarely the right one for production.
Going deeper
The folk rule T = 3H and where it comes from
Three is the smallest integer larger than 2, and 2 is the number of consecutive missed heartbeats that you can blame on transient packet loss without committing to "the peer is dead". The folk rule is T = (k+1) × H where k is the number of missed heartbeats you tolerate. k=2 (so T=3H) is the unwritten default. Cassandra's phi_convict_threshold=8 corresponds to roughly k=4 in moderately jittery networks. Akka's acceptable-heartbeat-pause is 3 seconds with a 1-second interval — k=2. The default is rarely the right answer for a specific workload, but it is rarely catastrophically wrong either. Where the folk rule breaks: the jitter distribution has a heavy tail, so the probability of k consecutive late arrivals is much higher than (probability of one late arrival)^k. The events are not independent — they are correlated by whatever caused the network event in the first place.
Why receive-side and send-side timestamps disagree
A heartbeat carries the sender's timestamp ts. The receiver records its own arrival time. The two clocks drift; in cloud VMs, drift between AZs can be 5–50 ms; without NTP/PTP it can be seconds. The naive detector uses receive-side timestamps (the last_seen table), which is correct — what matters for liveness is "when did I last see them", not "when did they say they sent it". But many teams instinctively log the sender-side timestamp in their detector and end up with a clock-skew bug: a sender whose clock is fast looks "stale" on receivers whose clocks are slow. The discipline: heartbeats can carry sender timestamps for diagnostic purposes (round-trip-time estimation), but the detector's decision must run on receive-side time, exclusively.
Why network keepalive at the OS layer is not enough
Linux has TCP keepalive (SO_KEEPALIVE), which the kernel uses to probe the peer of an established TCP connection. The defaults are absurd for distributed systems — tcp_keepalive_time = 7200 (2 hours) before the first probe, and 9 probes 75 seconds apart before declaring dead. You can tune these per-socket, but the deeper problem is that TCP keepalive only tells you about the transport, not the application. A peer can have a healthy TCP connection while its application thread is deadlocked. Application-layer heartbeats are necessary precisely because liveness is an application concept, not a transport concept. The two layers catch different failures.
Heartbeat piggybacking and why most production systems do it
Sending a separate HEARTBEAT packet wastes a packet when there is already application traffic between the peers. Production systems often piggyback heartbeats onto existing traffic: a Raft leader's AppendEntries carries an implicit heartbeat (an empty AppendEntries is the heartbeat); a Cassandra gossip exchange carries cluster-state updates that double as liveness signals; a Kafka broker's Fetch response from a follower acts as the follower's heartbeat. The benefit is reduced packet count (~50% in busy clusters); the cost is that during application-traffic lulls, heartbeats may be artificially delayed and the detector gets confused. Most piggyback designs add a fallback: if no application traffic has been sent for H seconds, send an explicit empty heartbeat. This hybrid is what makes most production systems fast in steady state and still detect failures during application-quiet periods.
Detector self-test — every production detector needs one
A heartbeat detector is a state machine that is expected to keep running for years without restart. It is therefore the canonical place where bugs that only manifest after billions of iterations live. Cassandra's failure detector had a bug, fixed in 2018, where the sliding window of arrival intervals could become corrupted after a clock-jump event (NTP slew larger than the heartbeat interval) and the detector would output phi=NaN permanently. The lesson: every production detector needs an internal self-test — a watchdog that verifies the detector is itself producing well-formed output. If phi is NaN or last_seen is in the future or the variance estimate has gone negative, the detector should fail loud (declare itself broken and refuse to participate in elections) rather than silently producing nonsense. The naive four-line detector has none of this; the production version of it has 200 lines, 80% of which are self-tests.
What about MealRush — the hybrid push-pull case
MealRush's order-fan-out service runs 1,200 instances behind a service mesh. The mesh's load balancer pulls health checks every 2 seconds; the consensus layer (a Raft group of 5 instances coordinating shard ownership) pushes heartbeats every 100 ms. The two detectors run independently and disagree about 0.3% of the time — for a few seconds, the mesh routes traffic away from an instance the Raft group still considers a member, or vice versa. The team's resolution rule: the mesh's view is authoritative for traffic (don't send orders to an instance that just GC'd), the Raft group's view is authoritative for state (don't reshard until consensus says so). The two detectors are tuned for different failure modes — the mesh tolerates short pauses (because re-routing is cheap), the Raft group does not (because reshards are expensive). The lesson: one detector per cluster is a simplification that breaks down in any system with both stateful and stateless services. Each domain wants its own detector, tuned for its own cost of false positives and false negatives.
Where this leads next
The next chapter, phi-accrual failure detector, replaces the fixed timeout T with a continuous suspicion level that adapts to the local jitter distribution. The chapter after, SWIM protocol, replaces all-pairs heartbeating with gossip and indirect probing — the fix for the O(N²) traffic and the asymmetric-partition scar in one design. Gossip-based membership (Serf) is the production walkthrough: Hashicorp's Memberlist, the canonical SWIM implementation, used at 5,000-node scale.
The deeper arc: this entire part of the curriculum is one long answer to the question "what is the right failure detector?". The naive heartbeat is the starting point. Phi-accrual fixes the timeout. SWIM fixes the scaling. Lifeguard and Rapid (covered later in this part) fix the asymmetric-partition and false-positive cascades that even SWIM can produce. By the end of part 10 you will have a detector that survives all three scars from this chapter — and the design will look nothing like four lines of bash. That journey is worth taking only if you remember why you started: because four lines of bash have been running PaySetu's authoriser cluster for four years and have triggered exactly the wrong number of outages.
References
- Chandra, T., Toueg, S. — "Unreliable Failure Detectors for Reliable Distributed Systems" (JACM 1996) — the foundational paper that classifies detectors by completeness and accuracy; the naive heartbeat sits in the weakest box (
♦Pat best,♦Srealistically). - Gupta, I., Chandra, T., Goldszmidt, G. — "On Scalable and Efficient Distributed Failure Detectors" (PODC 2001) — the paper that quantifies the O(N²) traffic problem and motivates gossip-based detection.
- etcd documentation —
--heartbeat-intervaland--election-timeoutconfiguration — production defaults and their rationale; the 1:10 ratio (100 ms : 1000 ms) is the etcd team's calibrated answer to "whichT/Hsurvives gigabit-LAN jitter without flapping". - Cassandra source —
o.a.c.gms.FailureDetector.java— production heartbeat detector with phi-accrual on top; read the comments aroundinterArrivalTimefor the variance-estimation discipline. - Kleppmann, M. — Designing Data-Intensive Applications (2017) — Chapter 8 §"Truth defined by the majority" is the practical read on why heartbeat detectors and quorums interact.
- Linux
tc-netem(8)— the canonical tool for chaos-engineering network jitter and packet loss on Linux; what the runnable example uses. - Wall: failure detection is its own problem — the chapter that frames why no detector is perfect; this chapter is the simplest detector that the wall describes the ceiling of.
- Hayashibara, N., Défago, X., Yared, R., Katayama, T. — "The φ Accrual Failure Detector" (SRDS 2004) — the next chapter's foundation; preview reading.
- Discord Engineering — "Why Discord is switching from Go to Rust" (2020) — concrete production story where heartbeat-style detection interacted badly with a runtime's GC pauses; relevant to scar 2.