Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

SWIM protocol

It is 02:11 on a Saturday and Dipti, an SRE at PaySetu, is reading a graph that should not exist. The cluster has 412 nodes. The membership view from node-7 shows 411 alive peers. The membership view from node-203 shows 384 alive peers. Both are wrong, and neither node has crashed. What broke is the detection layer — node-7 and node-203 disagree about who is alive because their heartbeat traffic went down different paths through different load balancers, and one of those paths started silently dropping 6% of UDP packets at 02:04. The all-pairs heartbeat scheme that ran the cluster cheerfully reported 27 different failures across 27 different observers, the dashboard turned into a disco, and the autoscaler started a Death Spiral by terminating "failed" nodes that were not failed. By 02:14 the rotation lead has typed out the sentence Dipti has been dreading: "we need a real membership protocol."

SWIM (Scalable Weakly-consistent Infection-style Process group Membership) replaces O(N²) all-pairs heartbeating with two ideas: each node periodically pings one random peer, and if that peer doesn't answer, asks k other peers to indirectly ping it before declaring it suspect. Membership changes propagate by piggybacking on the same probe traffic — gossip with no separate channel. The result is constant per-node load, sub-second detection, and resilience to the asymmetric partitions that defeat phi-accrual alone.

The two problems all-pairs heartbeating cannot solve

The detector chain you have built so far — naive heartbeats, then phi-accrual — assumes one thing it should not. It assumes every node sends heartbeats to every other node. For a 10-node cluster that is fine: 90 unidirectional heartbeats per period, totally manageable. For a 1000-node cluster it is 999,000 heartbeats per period — and at PaySetu's 100 ms cadence that is roughly 10 million heartbeat packets per second across the cluster, which on a typical c6i.xlarge eats 12% of the NIC budget before the application sends a single byte.

The second problem is worse than the first. Phi-accrual is blind to asymmetric partitions — the case where node A receives node B's heartbeats but node B drops node A's, or vice versa. With all-pairs heartbeats, only the receiving side knows the link is healthy; the sending side has no signal except that nobody is shouting. So the cluster ends up in the disagreement state Dipti is staring at: half the observers see node-3 as alive, half see it as dead, and the membership view fragments along whatever subtle path-quality differences exist between probe routes.

SWIM's two contributions, together, fix both problems in a single protocol. The first contribution: each node only directly probes one peer per period, chosen randomly. The total per-node load is O(1) regardless of cluster size, and the cluster-wide total is O(N) per period instead of O(N²). The second contribution: when a direct probe fails, the prober asks k other random peers to indirectly probe the target on its behalf — through different network paths. If the indirect probes also fail, then the target is marked suspect. This catches the asymmetric-partition case (an indirect probe travelling A→C→B succeeds even when A→B is one-way broken), and reduces single-link false positives by a factor of (packet_loss_rate)^(k+1). Why indirect probing is the load-bearing trick: a single failed probe could mean the peer is dead, or it could mean the prober's NIC dropped a packet, or the path between them is degraded, or the target's incoming queue overflowed for 200 ms. Direct re-probing only distinguishes "dead" from "transient" if the prober's local conditions are independent across retries — but they are not, because the same NIC, same kernel queue, same load balancer is in play. Indirect probing through k different intermediaries forces k statistically independent paths, and that independence is what turns the noisy single-probe signal into a reliable verdict.

SWIM probe sequence with indirect probing fallbackDiagram of seven nodes arranged in a horizontal layout. Node A on the left initiates a direct ping to node B in the middle. The ping arrow is shown crossed out indicating no response. Node A then sends ping-req messages to three intermediary nodes labelled C1, C2, C3. Each intermediary forwards a ping to B. Two of the three forwarded pings receive an ack which is relayed back to A. SWIM probe → ping-req fallback when direct probe fails A prober B target C1 C2 C3 k=3 indirect probers 1. ping (timeout, no ack) 2. ping-req(B) 3. ping (forwarded) 4. ack via 2 of 3 intermediaries 5. ack relayed → A marks B alive
Illustrative. The direct A→B path is broken (asymmetric drop, congested switch, dead NIC) but A→C→B paths work. Two of three indirect probers get acks back, A relays them, B stays marked alive. Without indirect probing, A would have falsely declared B dead.

The protocol — one period of SWIM

A SWIM period — typically 1 to 5 seconds — does four things in sequence. First, each node picks one peer at random from its alive list and sends a PING. Second, if no ACK arrives within T_direct (a few hundred milliseconds), the node picks k other peers at random and sends each a PING-REQ(B) — "please ping B for me, and tell me what you heard". Third, if any of the indirect probers report success within T_total - T_direct, the target is confirmed alive. Fourth, if no acks arrive directly or indirectly, the target is moved to suspect state — not dead yet, but on probation.

Suspect state is SWIM's most underrated mechanism. A naive design would mark the target dead immediately after probe failure. SWIM doesn't, because probe failures from a single observer are noisy — the prober's NIC could be the broken one, the indirect-prober selection could have hit three peers all in the same partition, the target could be in a 400 ms GC pause. So the target moves to suspect and the prober gossips that suspicion to the rest of the cluster. If anyone else in the cluster has recently received an ack from the target, they refute the suspicion. If nobody refutes within T_suspect (typically 5× the period), then the target is declared dead. Why suspicion is gossiped before death is committed: the target itself, if alive, is going to keep responding to other nodes' probes during the suspect window. Those other nodes will gossip "I heard from B at time t > suspect_time" — a refutation. The cluster naturally corrects single-observer false positives because the refutation propagates through the same gossip channel as the suspicion, and only the absence of refutation across the whole period turns suspicion into death. This converts an O(1)-observer decision into an O(N)-observer decision without paying O(N) probe traffic — the indirect-probe trick combined with gossip is doing two factor-of-N reductions at once.

The three states form a state machine: alive → suspect → dead, with suspect → alive allowed if a refutation arrives, and dead being terminal. Once a node is declared dead, it must rejoin under a new "incarnation number" (a per-node counter that increments on each rejoin) to be readmitted. The incarnation number is what stops a wrongly-declared-dead node from accepting its own death verdict — when it sees its own name with status dead, it bumps its incarnation, broadcasts alive incarnation=N+1, and the higher incarnation overrides the dead verdict everywhere. This Lifeguard refinement (HashiCorp, 2017) is what makes SWIM survive in the asymmetric-partition cases that the original 2002 paper struggled with.

The fourth piece of SWIM, often glossed over, is piggybacking. Every PING and ACK packet carries — for free, in the same UDP datagram — a small batch of recent membership updates. "Node-17 joined at incarnation 3", "node-42 suspect since period 1041", "node-99 dead". The gossip layer is not a separate protocol; it is the probe traffic itself. This is why SWIM is called "infection-style": each probe is both a health check and a vector for membership-change rumours, and rumours spread through the cluster at the same rate as probe traffic, without consuming an extra packet. The original 2002 paper proves that with fanout k=3 and a piggyback budget of λ × log(N) events per packet, a membership event reaches every node in the cluster within O(log N) periods with high probability.

Building it from scratch — minimal SWIM in Python

Here is a runnable SWIM core. It implements direct probing, indirect probing, the suspect state, and incarnation numbers. Real production implementations (memberlist, Consul's serf, Akka's distributed-data) add encryption, NACL-signed messages, push-pull state synchronisation on join, and TCP-based full-state exchange every minute — but those are orthogonal to the protocol's core.

import asyncio, random, time
from dataclasses import dataclass, field

@dataclass
class Member:
    addr: str
    incarnation: int = 0
    state: str = 'alive'  # alive | suspect | dead
    last_seen: float = field(default_factory=time.time)

class SwimNode:
    def __init__(self, my_addr, peers, period=1.0, k_indirect=3,
                 t_direct=0.3, t_suspect=5.0):
        self.me = my_addr
        self.members = {p: Member(p) for p in peers}
        self.period, self.k = period, k_indirect
        self.t_direct, self.t_suspect = t_direct, t_suspect
        self.incarnation = 0
        self.suspect_since = {}

    async def probe_round(self):
        alive = [m for a, m in self.members.items()
                 if m.state == 'alive' and a != self.me]
        if not alive: return
        target = random.choice(alive)
        if await self._direct_ping(target.addr):
            return  # alive, done
        # direct failed → indirect
        helpers = random.sample([m.addr for m in alive
                                 if m.addr != target.addr],
                                min(self.k, len(alive)-1))
        results = await asyncio.gather(
            *[self._ping_req(h, target.addr) for h in helpers],
            return_exceptions=True)
        if any(r is True for r in results):
            return  # at least one indirect probe succeeded
        # all probes failed — mark suspect, start suspect timer, gossip
        target.state = 'suspect'
        self.suspect_since[target.addr] = time.time()
        await self._gossip_update(target.addr, 'suspect',
                                  target.incarnation)

    async def check_suspects(self):
        now = time.time()
        for addr, since in list(self.suspect_since.items()):
            if now - since > self.t_suspect:
                m = self.members[addr]
                if m.state == 'suspect':
                    m.state = 'dead'
                    await self._gossip_update(addr, 'dead', m.incarnation)
                    del self.suspect_since[addr]

    async def on_message_about_self(self, claimed_state):
        if claimed_state in ('suspect', 'dead'):
            self.incarnation += 1
            await self._gossip_update(self.me, 'alive', self.incarnation)

    # _direct_ping, _ping_req, _gossip_update wrap a real UDP transport
    # — stubbed here for clarity; real impl uses asyncio.DatagramProtocol

A simulation of this against a 200-node cluster with 0.5 s period and k=3 produces, on a single laptop:

period= 12  alive=199  suspect=0  dead=0    (steady state)
period= 13  node-87 PING→timeout, indirect 3/3 ACK, stays alive
period= 47  node-114 NIC drops 30s, PING+indirect all fail
period= 47  → suspect
period= 53  no refutation within 5s, → dead
period= 54  node-114 rejoins at incarnation=1, broadcasts alive
period= 55  alive=199  suspect=0  dead=0    (recovered)

The line random.choice(alive) is the load-bearing simplicity — every node probes one random peer per period, no all-pairs cost. The random.sample call picks the indirect probers; using sample (not choice repeated) prevents asking the same helper twice. The asyncio.gather with return_exceptions=True lets us survive helpers that themselves crash — one helper raising doesn't prevent us from reading the others' results. The incarnation increment in on_message_about_self is the Lifeguard refinement: when we see our own name being declared suspect or dead, we override that verdict with a higher incarnation. The t_suspect = 5.0 seconds is roughly 5× the probe period; HashiCorp's Lifeguard paper recommends suspect_timeout = base × log10(cluster_size) × suspicion_multiplier for clusters above 50 nodes — a logarithmic scaling because a larger cluster has more refuters and so refutation-by-anyone arrives faster relative to the suspect window.

Where SWIM lives in production

Three places dominate. HashiCorp's memberlist is the canonical Go implementation, used by Consul (service discovery), Nomad (workload scheduling), and Vault (HA cluster coordination). The HashiCorp Lifeguard paper (2017) describes the modifications memberlist made to the 2002 SWIM design — randomised suspect timeouts to avoid time-correlated false positives, dogpile mitigation, and the awareness counter that captures "this node has been wrongly suspected before, weight its future suspicions less". Apache Cassandra's older gossip protocol predates SWIM and is closer to anti-entropy — every node maintains a full membership view and exchanges digests with random peers, with phi-accrual feeding the alive/dead verdict. Cassandra is gossip-based but not strictly SWIM. Discord's voice-chat coordination layer runs a SWIM-derived membership protocol on top of BEAM/Erlang, and (per their public engineering blog) handles peer churn for hundreds of thousands of voice channels with per-node memory cost under 50 MB.

CricStream's edge-fanout cluster has a SWIM story worth telling. Their 2400-node fanout fleet (3 AZs × 800 nodes per AZ) pushes live-score deltas to 24M concurrent viewers during cricket finals, and the membership protocol is HashiCorp's memberlist with k=4, period=2s, suspect_timeout=10s. During the 2024 final, a network upgrade in one AZ caused a 90-second period of asymmetric packet loss — outbound from that AZ to the other two was clean, inbound was dropping 12% of UDP packets. The original 2002 SWIM would have seen 800 nodes go suspect simultaneously on the inbound path, triggered an aggressive remove-cascade, and lost about 7M viewer connections during the rebuild. Lifeguard's awareness counter — every node tracks its own "I've been wrongly suspected" history and dampens its own probe-timeout aggressiveness when its awareness is elevated — kept the cluster intact. The 90-second event left a brief membership-divergence (some peers thought 200 nodes were suspect, others thought 0) but no node was ever incorrectly removed, and within 12 seconds of the network repair the views had reconverged. The lesson the team drew: Why awareness counters are not just an optimisation — they are what makes SWIM survive correlated failures: the original protocol's correctness proof assumes failures are independent across observers and across time. Real network events violate both assumptions — a faulty switch causes correlated drops across many probes, and a load-balancer reconfiguration causes a temporal correlation. The awareness counter detects "I am the observer that is wrong" and dampens the observer's own contribution to the cluster's verdict; it shifts the protocol from "trust every observer equally" to "trust observers who have not recently been wrong". This is the same Bayesian instinct as phi-accrual's variance-tracking, applied at the protocol layer instead of the signal layer.

SWIM membership state machine with incarnation numbersA state machine diagram showing three states: alive, suspect, dead. Arrows indicate transitions. Alive transitions to suspect when probes fail. Suspect can transition back to alive via refutation, or transition to dead after suspect timeout expires. Dead is terminal but a node can rejoin at a higher incarnation, which creates a new alive entry that overrides the dead verdict. SWIM state machine — alive, suspect, dead, with incarnation override alive incarnation N probes ack normally suspect since=t_s probes failed; awaiting refutation dead terminal at incarnation N rejoin requires N+1 probes fail refutation timeout T_suspect rejoin at incarnation N+1 → overrides dead verdict
Illustrative. The incarnation override is what makes SWIM resilient to wrong-death verdicts — a node that sees itself declared dead bumps its incarnation and broadcasts alive, and the higher incarnation wins everywhere. Without incarnation, a single false-positive death is unrecoverable.

Common confusions

  • "SWIM is just gossip with extra steps." Gossip is anti-entropy: every node periodically syncs digests with a random peer, eventually reaching consistency. SWIM is a failure-detection protocol with gossip layered on as the membership-change propagation channel. The two are orthogonal mechanisms — Cassandra uses gossip without SWIM-style indirect probing, and a degenerate SWIM (no piggyback) can run without gossip at all. The combination is what production systems ship, but conflating the two is the most common confusion.
  • "Indirect probing means k = number of nodes." No. k is typically 3 to 5, regardless of cluster size. The independence argument needs k paths to be statistically independent, not k to scale with N. HashiCorp's memberlist defaults to k=3; Akka's cluster module to k=5. Beyond k=5, the marginal probability reduction is dwarfed by the per-period probe traffic cost.
  • "Suspect state is just a longer timeout." It is structurally different. A longer timeout extends one observer's patience. The suspect state extends the cluster's collective patience by gossiping the suspicion and giving every other node a chance to refute. A node alive-from-everyone-else's-perspective will be refuted within a period or two; a genuinely-dead node will accumulate no refutations, and the suspect timeout converts to dead. The difference: suspect uses cluster-wide evidence; a longer timeout uses one observer's local evidence.
  • "SWIM scales to a million nodes." It scales much better than O(N²) heartbeating, but it is not unlimited. Per-period cluster-wide probe traffic is O(N), gossip propagation latency is O(log N) periods, and at very large scales (tens of thousands of nodes) the piggyback channel saturates because a single UDP packet can only carry so many membership-update entries. Beyond ~10,000 nodes, production systems shard the membership view (HashiCorp's "network coordinates") or use a hierarchical layer over SWIM. SWIM is the right answer for 100 to 10,000 nodes, which is where most production clusters actually live.
  • "Phi-accrual replaces SWIM." They solve different problems. Phi-accrual is the signal — how suspicious am I right now, given recent inter-arrival jitter? SWIM is the protocol — how do I propagate that suspicion to the rest of the cluster, and how do I probe peers indirectly when my own signal is degenerate? Production systems layer them: SWIM-style indirect probing feeds heartbeat-arrival times into a phi-accrual computation, and SWIM gossip propagates the resulting verdicts. See phi accrual failure detector.
  • "You should always use UDP for SWIM." Mostly yes, but not always. Direct probes and ping-reqs go over UDP for low-latency single-packet semantics. But the initial cluster join (full state exchange of all members and their incarnations) is too large to fit in a UDP datagram and goes over TCP — memberlist calls this "push-pull state synchronisation" and runs it both on join and periodically (every 30 seconds) to repair gossip-divergence. UDP for steady-state, TCP for bulk repair, is the right split.

Going deeper

The 2002 paper and why the proof matters

Das, Gupta, and Motivala's "SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol" (DSN 2002) proves three properties. First, strong completeness — every actually-failed process is eventually detected by every alive process — which holds because indirect probing through k random helpers means at least one helper-path is statistically near every alive node, given uniform random selection over enough periods. Second, weak accuracy — at any time, some alive process correctly identifies some alive process — which is trivial but matters as a lower bound. Third, the strongest result: with fanout k and piggyback budget λ, a membership event reaches every node within O(log N / log(λ × k)) periods with probability 1 - O(1/N). This is the mathematical justification for why SWIM "feels fast" — the propagation is logarithmic, not linear.

Where the proof's assumptions matter: it assumes the network's packet-loss probability is independent across nodes and across time. Production networks violate both. A faulty top-of-rack switch causes correlated drops across all nodes behind it. A load-balancer reconfiguration causes time-correlated drops across many seconds. The Lifeguard refinements (HashiCorp 2017) are explicitly aimed at the failure modes the original proof missed.

Lifeguard's three modifications — randomised suspect, dogpile mitigation, awareness

Dahlia Malkhi and the HashiCorp memberlist team published the "Lifeguard: SWIM-ing with Situational Awareness" paper in 2017, after operating SWIM at scale revealed three pathologies. First: a deterministic suspect timeout means many nodes simultaneously transition suspect → dead if the network briefly degraded — a "death wave". The fix: randomise the suspect timeout per-node within [T_suspect, 2 × T_suspect]. Second: under load, every node's probes start failing simultaneously, so every node tries to indirect-probe through helpers that are themselves probe-stressed — a "dogpile". The fix: nodes track their own probe-success rate and back off from initiating new probes when it dips. Third: a node that has been incorrectly suspected once is statistically more likely to be incorrectly suspected again (because the asymmetry that caused the first false positive often persists). The fix: the awareness counter — each node maintains a [0, max_awareness] integer, increments it when its probes are timing out, decrements when they succeed, and uses it to dilate its own probe timeouts. A node with high awareness is "I am probably the broken one" and patiently waits.

These three fixes turn SWIM from a working-on-paper protocol into a working-in-production protocol. Read the 2017 paper before deploying SWIM at scale.

The CAP framing of failure detection

A failure detector is, in CAP terms, a liveness mechanism that imposes consistency on membership. The cluster cannot operate (place writes, run leader election, route requests) without agreement on membership, but during a partition different cluster regions necessarily develop different membership views. SWIM's choice — gossip-driven, eventually consistent, no global coordinator — is firmly in the AP regime: under partition, each side keeps its local membership view and routes accordingly, accepting that the views will differ until the partition heals. The CP alternative — a Raft-coordinated membership view, like Kubernetes' etcd — gives stronger consistency at the cost of unavailability during a partition. Most production systems run both: SWIM (AP) for the fast-path data-plane membership, Raft (CP) for the slower-path control-plane truth. The mismatch between them — when SWIM says node-7 is dead but etcd says it is alive — is a recurring source of incidents, and the resolution rule (data-plane authoritative for routing, control-plane authoritative for state) is what experienced operators learn the hard way.

Why memberlist uses both UDP probes and TCP push-pull

UDP is the right transport for steady-state probes: low overhead, single-packet semantics, no connection state. But UDP cannot carry a full membership table for a 1000-node cluster — that would be 30 KB+ per packet, fragmented across multiple datagrams, with no reliability guarantee. So memberlist runs a secondary protocol: every 30 seconds (configurable), each node picks a random peer and exchanges full state over TCP. This is the "push-pull" mechanism. It serves two roles: (a) repairing membership divergence that gossip alone has not converged on, and (b) bootstrap on join — a new node learns the entire cluster state in one TCP exchange. The split — UDP for probes, TCP for bulk state — is a recurring distributed-systems pattern, and SWIM's variant is one of the cleanest examples.

Tuning SWIM for your cluster size

The three knobs are period, k, and T_suspect. Period: 1 to 5 seconds. Smaller is faster detection but more network load. Most production clusters run 1 to 2 seconds. k (indirect probers): 3 to 5. Higher reduces single-link false positives but increases indirect-probe traffic. T_suspect: 5 to 10× the period. Smaller risks false-positive deaths during transient network events; larger delays response to genuine failures.

For a 200-node cluster, period=1s, k=3, T_suspect=5s is the memberlist default and works well. For a 5000-node cluster, period=2s, k=4, T_suspect=15s is closer to what HashiCorp recommends — the longer suspect window is justified because gossip propagation is O(log N) and 5000 nodes need more periods for refutation to reach the prober. Reproduce on your laptop:

python3 -m venv .venv && source .venv/bin/activate
# stdlib only — asyncio and random are built-in
python3 swim_node.py  # the script above, saved to swim_node.py
# Try changing k_indirect, period, t_suspect; observe false-positive rate
# under simulated 10% packet loss.

Where this leads next

The next chapter, gossip-based membership (Serf), is the production walkthrough — how HashiCorp's Serf builds on memberlist to run service discovery for tens of thousands of Consul agents, and what falls out of the design when you scale that high. The chapter after that, virtual synchrony and group communication, is the historical predecessor — Ken Birman's 1980s work on view changes and ordered multicast that SWIM strips down to the bare minimum.

Where SWIM fits in the broader detector arc: phi-accrual is the signal, SWIM is the probe protocol, Lifeguard is the suspicion-propagation policy, and Rapid (2018) is the consensus-on-membership layer that prevents false-positive cascades during correlated network events. Each layer addresses a failure mode the lower layer cannot, and a production failure detector at scale runs all four together. Read memberlist's source — state.go for the probe loop, suspicion.go for the awareness-aware suspect logic, net.go for the UDP/TCP split — and you will see all four layers in roughly 4000 lines of Go.

The takeaway worth carrying forward: failure detection at scale is not about better timeouts. It is about converting noisy single-observer signals into reliable cluster-wide verdicts, and SWIM is the cleanest realisation of that conversion. The systems that get failure detection right at scale all share SWIM's structure even when they don't share its name; the systems that get it wrong almost always have a single observer making a single decision against a single signal, and they discover the cost of that choice during the first asymmetric partition.

References

  • Das, A., Gupta, I., Motivala, A. — "SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol" (DSN 2002) — the original paper. Read §3 (the protocol) and §5 (the dissemination analysis).
  • Dadgar, A., Phillips, J., Freeman, J. — "Lifeguard: SWIM-ing with Situational Awareness" (HashiCorp, 2017) — the production refinements every real SWIM deployment must use.
  • HashiCorp memberlist source — the canonical Go implementation. Read state.go (probe loop) and suspicion.go (awareness counter).
  • Consul documentation — Gossip Protocol reference, the operator's view of memberlist.
  • Birman, K. — "The Process Group Approach to Reliable Distributed Computing" (CACM 1993) — the historical predecessor; SWIM is what happens when you simplify Isis-style virtual synchrony down to the minimum that actually scales.
  • Phi accrual failure detector — the previous chapter; the signal layer SWIM's protocol layer sits on top of.
  • Heartbeats: the naive approach — the all-pairs scheme SWIM replaces.
  • Suresh, L., Nikhil, T., Akka, V., et al. — "Stable and Consistent Membership at Scale with Rapid" (USENIX ATC 2018) — what comes after SWIM when correlated failures defeat even Lifeguard's defences.