The network: partitions, asymmetric reachability, gray failures

At 03:11 on a Tuesday, Riya is on-call for PaySetu's settlement cluster. Her phone vibrates with a page that has no precedent in the runbook: wallet-leader in ap-south-1a reports the cluster is healthy, every follower's heartbeat is green, and the replication lag dashboard shows 4 ms p99. Meanwhile wallet-follower-2 in ap-south-1b reports the leader has been unreachable for 47 seconds, has just held an election, and is now serving reads from itself with a stale commit-index. Two replicas, both convinced the cluster is fine, both convinced the other one is the problem. Neither of them is wrong about what they observe. The network between them is doing the one thing the textbook never quite covers — it is letting traffic flow in one direction and dropping it in the other, while the load-balancer's /health probe happily reports both nodes as 200 OK. This chapter is what you do when the network has stopped being a binary on/off and started being a graph that disagrees with itself.

Network failure is rarely a clean partition. Real production networks fail asymmetrically — A reaches B but B does not reach A — and they fail partially — heartbeats survive but data-plane traffic is dropped. The protocol you build must treat reachability as a per-direction, per-flow probability, not a cluster-wide truth, or it will produce split-brains and stuck quorums on every borderline incident.

What "partition" actually means in production

The textbook picture of a network partition is a clean cut: cluster on the left, cluster on the right, no traffic flows across the gap, the partition heals at some later time and the two halves reconcile. Real production networks almost never do this. The actual taxonomy of network failure modes, ranked by how thoroughly they violate the textbook intuition, is the following.

Mode	What you see	What's actually happening	Rough frequency in cloud datacentres
Complete partition	All traffic between two sets of nodes drops	Switch port down, ACL misconfiguration, fibre cut	~5–10% of all incidents
Asymmetric partition	A→B traffic flows, B→A traffic drops (or vice versa)	One-way ACL, asymmetric routing, NAT timeout	~20–25%
Partial partition	Some flows between A and B drop, others succeed	Buggy NIC, MTU mismatch, ECMP hash collision dropping one flow-tuple	~15–20%
Gray failure	Health-checks succeed, real workload fails	TCP keepalive survives but `epoll` queue is wedged; `/health` cached but `/api` blocked	~30–40% — most common
Flapping	Connectivity oscillates on second-to-minute timescales	Spanning-tree convergence, BGP route flap, congestion-induced retransmit	~10–15%

The headline finding of the empirical studies (Microsoft 2011, Google 2016, Facebook 2020) is that complete, symmetric partitions are the rarest mode — they are the easiest to detect, the easiest to handle, and the least common. The frequent mode in production is gray failure: a node is partly working, in a way that some observers see as healthy and some see as dead. The reason this matters is that every consensus protocol's safety/liveness analysis is written for the symmetric, total partition model — a node is in cohort A or cohort B, the cohorts cannot talk, the protocol elects within each cohort, the smaller cohort blocks. That analysis does not survive contact with asymmetric or gray failure, and the production behaviours you observe under those modes — dual leaders, stuck followers, read-your-writes violations on a "healthy" cluster — are exactly the protocol's behaviour outside its assumed model.

Illustrative — the four canonical modes drawn on a small cluster. The bottom-right (gray failure) is the most common in production and the hardest for naive failure detectors to catch.

Why asymmetric reachability is so common in cloud: every cloud network is a stack of NAT gateways, security groups, route tables, and ECMP-hashed multipath links. Each layer is a direction-specific policy — an ACL allowing inbound TCP/443 from 10.0.0.0/16 says nothing about outbound traffic, which is filtered by a separate egress ACL. A misconfiguration on either side produces an asymmetric partition. The 2017 AWS S3 outage's cascading failures included an asymmetric reachability incident traced to a stale NAT gateway entry — packets from the storage tier could reach the metadata tier, but replies could not return until the entry expired 90 seconds later.

Walking through asymmetric partition in code

Reading about asymmetric partitions is one thing; watching what a consensus protocol does when one direction of communication fails is another. The Python simulation below builds a five-node Raft-style cluster, lets it elect a leader, then injects an asymmetric partition (the leader can send to followers but cannot hear their replies). The point is to watch the protocol's failure-detection and re-election logic decide what is happening — and to notice that, from the leader's perspective, everything looks fine until it does not.

# asymmetric_partition.py — simulate a Raft-style cluster under asymmetric partition
import random, time
from collections import defaultdict

class Network:
    def __init__(self, n):
        self.n = n
        self.allow = defaultdict(lambda: True)  # (src,dst) -> bool

    def can_send(self, src, dst):
        return self.allow[(src, dst)]

    def break_link(self, src, dst):
        self.allow[(src, dst)] = False

class Node:
    def __init__(self, idx, net, n):
        self.idx, self.net, self.n = idx, net, n
        self.term, self.role = 0, "follower"
        self.last_heartbeat = time.time()
        self.election_timeout = random.uniform(0.30, 0.45)
        self.votes_for_me = 0

    def send_heartbeat(self, peers, t):
        delivered = 0
        for p in peers:
            if p.idx != self.idx and self.net.can_send(self.idx, p.idx):
                p.last_heartbeat = t; delivered += 1
        return delivered

    def tick(self, peers, t):
        if self.role == "leader":
            self.send_heartbeat(peers, t)
            return
        if t - self.last_heartbeat > self.election_timeout:
            self.term += 1; self.role = "candidate"; self.votes_for_me = 1
            for p in peers:
                if p.idx != self.idx and self.net.can_send(self.idx, p.idx) and self.net.can_send(p.idx, self.idx):
                    if p.term < self.term:
                        p.term = self.term; self.votes_for_me += 1
            if self.votes_for_me > self.n // 2:
                self.role = "leader"; self.last_heartbeat = t

def run(steps=120, partition_at=40):
    net = Network(5); nodes = [Node(i, net, 5) for i in range(5)]
    nodes[0].role = "leader"; nodes[0].term = 1
    t = 0.0
    for s in range(steps):
        t += 0.05
        if s == partition_at:
            # leader 0 can SEND to followers but cannot HEAR replies
            for j in range(1, 5): net.break_link(j, 0)
            print(f"--- t={t:.2f} ASYMMETRIC PARTITION INJECTED ---")
        for nd in nodes: nd.tick(nodes, t)
        if s % 10 == 0 or s in (partition_at, partition_at+5, partition_at+15):
            roles = "".join(("L" if n.role=="leader" else ("C" if n.role=="candidate" else "f")) for n in nodes)
            terms = ",".join(str(n.term) for n in nodes)
            print(f"t={t:5.2f}  roles={roles}  terms=[{terms}]")

run()

Sample run:

t= 0.05  roles=Lffff  terms=[1,1,1,1,1]
t= 0.55  roles=Lffff  terms=[1,1,1,1,1]
t= 1.05  roles=Lffff  terms=[1,1,1,1,1]
t= 1.55  roles=Lffff  terms=[1,1,1,1,1]
--- t=2.05 ASYMMETRIC PARTITION INJECTED ---
t= 2.05  roles=Lffff  terms=[1,1,1,1,1]
t= 2.30  roles=Lffff  terms=[1,1,1,1,1]
t= 2.55  roles=LCfff  terms=[1,2,1,1,1]
t= 3.05  roles=LLfff  terms=[1,2,2,2,2]
t= 3.55  roles=LLfff  terms=[1,2,2,2,2]
t= 5.05  roles=LLfff  terms=[1,2,2,2,2]

Read what just happened. At t=2.05 the asymmetric partition is injected — leader 0 can still send heartbeats to followers (and does), but followers cannot send replies or election votes back to leader 0. Followers continue receiving heartbeats from the leader, so for a few hundred milliseconds they think the cluster is fine. But the leader has no way to know whether its heartbeats are being received — it never gets an ack — so from the leader's perspective the cluster has gone silent. Meanwhile, follower 1 happens to have a slightly tighter election timeout and decides at t=2.55 that it has not heard a heartbeat (in this simulation, last_heartbeat updates on the delivery side, but in real Raft the leader stops being convinced of leadership without ack quorum) and starts an election. The other followers vote for it (since they cannot reach node 0 anyway, its higher-term claim is moot from their cohort) and t=3.05 shows two leaders — node 0 still believes it is leader at term 1, and node 1 is leader at term 2. The "split-brain" in this run is not a bug in the simulator; it is exactly what asymmetric partition does to a protocol that assumes symmetric reachability. In a real Raft, the fencing token (term number) eventually causes node 0 to reject any client write that conflicts with the higher-term leader, but until that conflict is observed, both leaders accept reads and the dual-leader window can extend for the full duration of the asymmetric partition — minutes, sometimes hours.

Why this is worse than a symmetric partition: in a symmetric partition, both sides know they cannot reach each other; the minority side blocks (cannot form quorum), the majority side proceeds, and on heal the minority gets re-replicated from the majority. In an asymmetric partition, the send-able side does not know it has been isolated — its outbound traffic still seems to flow — and so it does not block. Both sides may accept writes simultaneously, and reconciling is hard because the two write streams may overlap in time and key-space. Fencing tokens (Part 9) close this gap by making the storage layer reject the stale leader's writes; phi-accrual failure detectors (Part 10) close it by treating "I sent a heartbeat but got no ack" as escalating suspicion.

Why the election timeout in the simulation has a random.uniform(0.30, 0.45) jitter: Raft requires randomised election timeouts to break ties when multiple followers detect leader silence simultaneously. Without the jitter, all followers would time out at the same instant, all become candidates, all split the vote, and no one wins — the cluster would oscillate. The random.uniform is the same trick Ethernet uses for collision avoidance, transposed into the temporal dimension of consensus. Ongaro's Raft thesis recommends [T, 2T] where T is the heartbeat interval × 10 — the simulation uses 50 ms heartbeats and timeouts in [300, 450] ms, giving the leader ~6 heartbeat opportunities before any follower considers election.

Production stories — recognising network pathology in your dashboard

Each of the modes above has a distinctive signature on the dashboards a senior on-call engineer reads. The skill is matching dashboard pattern → failure mode → mitigation, fluently, without re-reading the spec.

Asymmetric reachability — PaySetu's wallet cluster, ap-south-1, August 2024. A security-group change pushed at 14:22 inadvertently dropped inbound TCP/9100 (the consensus port) to one wallet replica from one specific peer's CIDR — but allowed all other peers and allowed all outbound traffic. The replica continued sending heartbeats and AppendEntries acks happily. The peer it could not receive from happened to be the leader. For 23 minutes the leader treated the replica as dead (no acks coming back), the replica treated the leader as healthy (heartbeats arriving), and the cluster ran with degraded availability (4-of-5 quorum instead of 5-of-5) without anyone noticing because the user-facing latency was unaffected. The breakage was found only when a second, unrelated rolling upgrade left the cluster at 3-of-5, and writes paused. Postmortem: ₹0 of customer impact, but the security-group review process now requires a synthetic data-plane probe, not just /health, before any change is considered rolled out.

Partial partition (flow-specific) — CricStream's edge cache during the 2024 World Cup final. The CDN's anycast routing gave each viewer one of 12 ECMP paths from the edge POP to the origin cluster. One specific path through one specific ASN had a buggy router dropping packets with a particular flow-tuple hash. Approximately 1 in 12 viewers experienced 3-second buffering events every ~40 seconds while the other 11 in 12 saw flawless playback. The dashboard showed 99.97% of requests succeeding (the global p99) — the affected 8% was hidden in the noise of normal connection churn. The detection signal was a Twitter spike from the affected viewers, not the dashboards. Mitigation: per-ECMP-path active probing was added so a single bad path drops a flow-tuple from the rotation within 30 seconds.

Gray failure — KapitalKite's order-matching cluster, market-open spike October 2024. The order-matching node om-3 had a kernel epoll queue that wedged after 2 million events — the bug was a known kernel regression but the patch had not yet been rolled out. The node's /health endpoint, served by a separate lightweight thread, continued returning 200 OK. The matching engine itself stopped accepting orders. The load balancer kept routing 1/5 of orders to om-3 for 14 minutes. The detection signal was that order-confirmation latency p99 went from 12 ms to 9 seconds for a slice of traffic, while every health-check stayed green. Postmortem: synthetic order-placement probes with end-to-end correctness checks now run every 250 ms; the load balancer treats those signals as authoritative, not /health.

Illustrative — the gray-failure pattern is two observers disagreeing about the same node. The load balancer's `/health` probe says healthy, the actual application's RPC says dead. Mitigations always come down to "make the load balancer's probe look more like the real workload".

The pattern across all three stories is the same — the failure detector and the workload disagreed, and the failure detector won. The detector said "healthy" because its observation path was structurally narrower than the workload's path. Fixing this requires either making the detector's path wider (synthetic transactions that exercise the real workload) or making the workload's failure feedback faster (deadlines and circuit breakers that classify a slow node as failed without waiting for the detector to catch up). Most mature production systems do both.

A second pattern, less obvious: the detection lag — the time between failure onset and dashboard red — is itself the operational cost of the failure mode. For complete partitions, detection lag is one heartbeat-timeout interval, typically 1–3 seconds. For asymmetric partitions, it is until something forces the discrepancy to surface (a write that requires both directions to ack, a peer-to-peer probe, a rolling restart) — easily minutes to hours. For gray failures, detection lag is the time until either a synthetic probe detects the wedged data plane or a customer-support ticket surfaces the user-visible impact — typically tens of minutes. The protocol's "self-healing" capacity is bounded above by the detection lag — the cluster cannot start re-electing or re-routing until it knows there is a problem, and during the lag window any writes accepted are at risk of becoming the corrupting writes the post-mortem will discuss. Optimising detection lag — by adding probe diversity, lowering thresholds where false-positives are tolerable, paying for synthetic transactions — is the single most leveraged investment in distributed-systems reliability, and it is what most modern SRE teams spend their reliability-engineering budget on.

How protocols defend against this — and what is still unsolved

The mitigations stack from cheap-and-partial to expensive-and-complete. You almost never deploy all of them; you deploy the cheapest combination that keeps your specific failure budget below your specific tolerance.

Bidirectional probes. Heartbeats sent in both directions, every interval, with separate counters for inbound and outbound success rates. A node tracks "I sent N heartbeats and received M acks; I received P heartbeats and sent Q acks". Asymmetric reachability shows up as M ≠ N or Q ≠ P, and the cluster's failure detector escalates suspicion accordingly. The cost is doubled heartbeat traffic and two-counter bookkeeping per peer; the benefit is asymmetric partitions become visible, which is most of what a senior on-call engineer wants from the failure detector.

Synthetic data-plane probes. Instead of /health, the failure detector executes a synthetic version of the real workload — for KapitalKite, a synthetic 1-share order in a test ISIN; for PaySetu, a synthetic ₹1 transfer between two test wallets; for CricStream, a synthetic playback request fetching the first segment of a known asset. The probe goes through every layer the real request would, hits every queue, every cache, every database. A wedged epoll queue fails the probe; a stuck WAL-fsync fails the probe; a misconfigured ACL fails the probe. The cost is non-trivial workload (synthetic transactions cost real CPU and real database writes), the benefit is gray failures become observable.

Indirect probes (SWIM-style). When node A suspects node B is dead, instead of just marking B dead, A asks K other nodes "can you reach B right now?". If at least one of them succeeds, A's verdict downgrades from "dead" to "I cannot reach B but the cluster can". This is the SWIM protocol's contribution (Das, Gupta, Motivala 2002) and it converts asymmetric partitions from "node B looks dead from A's view" into "node B is unreachable from A specifically — investigate routing". Hashicorp's Memberlist library (used by Consul, Nomad) ships SWIM out of the box.

Fencing tokens. When the protocol cannot prevent two nodes from believing they are the leader, the storage layer rejects writes from any leader holding a stale token. The token is a monotonically-increasing number assigned at lease renewal; the storage compare-and-swaps on the token. A stale leader, even one that does not know it is stale, cannot persist writes — its writes get a token-conflict rejection at commit time. The cost is a token-aware storage layer; the benefit is the asymmetric-partition dual-leader window cannot corrupt durable state. Part 9 of this curriculum is dedicated to fencing tokens.

Phi-accrual failure detection. Replace binary "alive / dead" with a continuous probability of failure based on inter-arrival time of heartbeats. Suspicion (phi) rises smoothly when heartbeats are late, falls smoothly when they resume. The cluster's protocol layer reads phi and decides — at threshold 8, treat as failed; at threshold 16, expedite re-election; at threshold 4, no action. This decouples the detector's job (estimate the probability) from the protocol's job (decide what to do). Hayashibara's 2004 paper formalised this; Cassandra and Akka use it in production. Part 10 is dedicated to phi-accrual.

Why combining multiple defences is non-optional rather than just nice-to-have: each defence catches a different mode. Bidirectional probes catch asymmetric partitions but not gray failures (the /health thread still answers). Synthetic data-plane probes catch gray failures but not flow-specific partial partitions (the probe might hash to a working ECMP path while real workload hashes to the broken one). Indirect probes catch asymmetric partitions but assume the K relayers themselves are not partitioned. Fencing tokens prevent split-brain damage but do not detect it. The deployed defence-in-depth stack is the union of these — each catches the modes the others miss, and the failure modes that escape every layer are the ones that produce the rare, headline incidents.

What is still unsolved: gray failures whose probe path is structurally identical to the real workload's probe path, and the real workload still fails some of the time on a heavy ECMP-fanned-out distributed system. The 2017 Microsoft "Gray Failure" paper's core observation is that no probe can detect every gray failure, because the probe is itself one observation of a node and the workload is another, and they may legitimately disagree about reachability when the network is doing something subtle. The pragmatic answer is to combine probe-based detection with circuit-breaker-style client-side detection — the workload itself is the most accurate probe of the workload — and to accept that a small fraction of gray failures will surface as user-visible incidents that the failure detector did not catch. The reliability budget for these is part of every modern SRE conversation.

Common confusions

"A network partition means all packets are dropped between two cohorts." No. The textbook partition is the rarest mode in production. Most "partitions" in cloud datacentres are asymmetric, partial, or gray. Designing only for the textbook case leaves your protocol exposed to the modes that actually happen.
"Health-check failures are how I detect partitions." Health checks detect health-check failures. Most gray failures pass health checks while failing real workload, because the health check's path through the system is narrower than the workload's. Synthetic data-plane probes are necessary; /health alone is not sufficient.
"Asymmetric partitions are rare exotic events." Empirically (Microsoft 2011, Google 2016) they account for ~20–25% of all reachability incidents — more common than complete partitions. The reason they feel exotic is that protocols designed for symmetric partitions silently degrade under asymmetric ones, and the degradation is invisible until something external changes (a rolling upgrade, a load spike).
"TCP retransmits hide most network failures." TCP's reliability covers single packet loss. It does not cover an upstream switch dropping every packet of one specific flow-tuple, an ACL change blocking new SYN packets while existing connections continue, or a NAT timeout that drops packets older than 90 seconds. The application sees these as RPC timeouts, not as TCP errors, and they are exactly the cases the failure detector must catch.
"If the load balancer says the node is healthy, the cluster is healthy." The load balancer is one observer; the cluster has many observers (other nodes via heartbeat, the application via real RPCs, the SRE staring at Grafana). When the load balancer disagrees with the application, the application is right — because the application's path is the path that pays for the gray failure.
"Network partitions are short — minutes at most." The empirical median is minutes, but the long tail extends to hours and (in published outage reports) days. The 2011 Microsoft Azure DNS-resolver issue that triggered the regional outage was an asymmetric reachability event lasting 50 minutes; the 2017 AWS S3 outage included a multi-hour partial-reachability period. Plan for the long tail or you will be surprised when it arrives.

Going deeper

The Bailis-Kingsbury "Network is Reliable" survey

Peter Bailis and Kyle Kingsbury's 2014 ACM Queue article "The Network is Reliable" is the single most important survey of empirical network-failure data. They aggregate failure reports from Microsoft, Google, Facebook, and Amazon datacentres along with the Jepsen test suite's per-database findings, and the headline result is that production networks fail constantly — the question is not whether a partition will happen but whether your protocol survives the specific kind that does. The paper's per-system breakdown of how MongoDB, ZooKeeper, Cassandra, and others behave under partition is the calibration baseline every distributed-systems engineer should know. Kingsbury's blog (aphyr.com) continues this work as the Jepsen series, with each post subjecting one database to rigorous partition testing and publishing the linearisability violations found.

The deeper insight is that the Jepsen tests do not cause failures the protocol designer didn't think about — they cause exactly the failures the designer's informal mental model didn't cover. Every Jepsen-found bug was, after the fact, classifiable into "the protocol was correct under the assumed model, but the model didn't match the partition mode the test produced". The recurring fix pattern is to broaden the model — switch from symmetric to asymmetric, add fencing tokens, add bidirectional health probes — and re-prove safety under the broader model. This is exactly the move Part 9 will perform when it introduces fencing tokens for leader leases.

The Microsoft datacentre failure study (Gill, Jain, Nagappan 2011)

The 2011 SIGCOMM paper "Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications" analysed a year of network failures across multiple Microsoft datacentres. The headline finding most relevant to this chapter: load balancers were the most failure-prone class of device — top-of-rack switches and core routers failed less often than the load balancers in front of the application tier. This is counter-intuitive (you would expect cheap commodity ToRs to fail more than expensive enterprise LBs) and operationally critical: the device most likely to be the cause of the partition you are debugging is the one you trust to tell you what is reachable. The fix is to never use the load balancer's view as the authoritative truth for cluster membership — always cross-check with peer-to-peer heartbeats and synthetic application probes.

The paper also quantified the fraction of failures resolved by simply waiting (the natural time-to-heal distribution) versus operator intervention — a substantial fraction healed without action within 5 minutes, motivating the modern conservative approach of "wait one minute before paging an SRE, except for revenue-affecting alerts". The decision of how long the protocol should hold its breath before re-electing is a function of this distribution.

A practitioner-relevant finding: the same Microsoft data showed that most production network failures last less than 30 seconds. This shapes the design of leader-election timeouts — too short and you re-elect on every transient blip (re-election is itself disruptive, dropping in-flight writes), too long and the cluster remains unavailable longer than necessary. The Raft default of 150–300ms election timeout exists because the empirical sub-30-second failure distribution has a sharp knee around the 1-second mark — failures shorter than that are usually self-healing; failures longer than that usually require intervention. Tuning the timeout below the knee captures the long-tail failures while leaving short ones for TCP retries to handle.

Asymmetric partition formal model

Formally, an asymmetric partition is a relation R ⊆ Nodes × Nodes where R(a,b) = "a can deliver messages to b". For symmetric networks, R(a,b) ⇔ R(b,a); for asymmetric networks, this biconditional fails. The consequence for consensus protocols is that quorum-based reasoning, which implicitly assumes that "a is in the same connected component as b" is a transitive equivalence relation, no longer holds. Two nodes a and b may both be able to communicate with c but not with each other, leading to a non-transitive reachability graph in which any two-out-of-three majority vote is ambiguous.

The formal fix is to define "a sees b alive at time t" as a directional, time-stamped predicate, and to require that quorum decisions be based on bidirectional, recent sightings — a quorum of nodes such that each pair of nodes in the quorum has recently exchanged messages in both directions. This is more expensive (every pair must heartbeat both ways) but it is the only correct definition under asymmetric reachability. Most production protocols approximate this with "leader-driven heartbeating" (only the leader needs bidirectional confirmation with each follower), which is structurally weaker but operationally cheaper. The trade-off is the entire surface area of the asymmetric-partition problem in real systems.

Partial-partition and ECMP hash collisions

A network using equal-cost multipath routing (ECMP) hashes each flow-tuple (src-IP, dst-IP, src-port, dst-port) onto one of K parallel paths. If one of those K paths has a hardware fault, then 1/K of all flows between any given pair of nodes drops — silently, with no error indication, just packet loss. The mode is flow-specific, deterministic per flow (the same tuple always hashes to the same path), and randomised across pairs (different src-port choices give different hashes). The failure mode is "intermittent connectivity that depends on which port the client picked", and it is one of the harder ones to debug — ping works (different flow-tuple), tcpdump shows the SYN going out and never returning, traceroute works (yet another tuple).

The mitigation is per-flow active probing — every long-lived RPC connection is paired with a short probe connection on a different ephemeral port; if probes succeed but the main connection is dropping, the application closes and reopens the main connection (forcing a new ephemeral port and a new ECMP hash). Twitter Finagle's "Aperture" load balancer and Netflix's "Concurrency Limits" library both do versions of this. The cost is connection churn; the benefit is partial-partition survival.

A subtlety worth surfacing: ECMP-induced partial partitions are deterministic per flow-tuple but random across the population, which means the bug presents as a fraction of users seeing the symptom. The fraction is bounded above by 1/K where K is the number of paths and below by zero (all of an attacked tuple-space happens to hash to the working paths). For any given customer, the symptom is either "always broken" (if their flow-tuple consistently hashes to the bad path) or "always fine" (if it does not). This is what makes the customer-support ticket pattern so confusing — the most-vocal users are the ones whose hash happened to land on the bad path, and from their perspective the service is completely down; from the dashboard's perspective the service is at 99.something%. The fix is to monitor tail-percentile error rates per flow-tuple-class, not just aggregate, and to alert when any tuple-class crosses a threshold even if the aggregate looks fine.

The role of TCP's reliability illusion

A subtle point that catches every junior engineer: TCP's reliability guarantees apply at the segment level, not the application level. TCP guarantees that if a connection eventually delivers bytes, those bytes are in order and uncorrupted; it does not guarantee that any particular write will eventually be delivered. Under partial partition, TCP retries packets that fail and eventually surfaces a RST or a timeout — but the application sees this as an RPC error, the same way it would see a complete-partition error. The mistake junior engineers make is assuming "if my TCP write returned 0 errors, my message was delivered". TCP write returning 0 means the message was queued in the kernel send buffer; the kernel may still drop it on connection failure, the network may still drop it before the peer acks it, and the application has no way to distinguish "delivered to the peer's kernel" from "delivered to the peer's application" from "lost in transit and the connection silently failed". This is why every distributed protocol layers application-level idempotency keys, sequence numbers, and explicit acks on top of TCP — TCP's reliability is necessary but not sufficient, and the application protocol must close the gap.

A second subtlety, often missed: TCP keepalives have nothing to do with detecting application failures. Keepalives detect that the peer's kernel is reachable on the configured interval (typically 2 hours by default on Linux — far too long for any production system). They do not detect that the peer's application has wedged, deadlocked, or stopped servicing the socket; the kernel happily acks keepalives even while the application's accept() loop is blocked. Application-level heartbeats — RPCs the application explicitly handles, with deadlines on both sides — are the only way to detect application-level liveness. Every protocol that relies on TCP keepalives for liveness is silently exposed to gray failures, because gray failure is defined as the case where the kernel is healthy but the application is not. If your service relies on TCP keepalives to detect peer death, your service has a gray-failure latent bug — finding the latent bug means rolling out application-level heartbeats with sub-second intervals before the latent bug becomes a customer-impacting incident.

Reproduce this on your laptop

# Reproduce the asymmetric-partition simulator
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
python3 asymmetric_partition.py
# Try varying partition_at, the cluster size, and the election timeout window.
# For real partition testing on a real distributed database, see Jepsen below.

To inject a real asymmetric partition on Linux and observe a real cluster's behaviour, iptables is the simplest tool:

# Drop only inbound traffic from peer 10.0.0.5 to this host on consensus port 9100
sudo iptables -A INPUT -s 10.0.0.5 -p tcp --dport 9100 -j DROP
# Outbound traffic to 10.0.0.5 still flows — this is the asymmetric mode.
# Watch your cluster's failure detector for ~60 seconds, then:
sudo iptables -D INPUT -s 10.0.0.5 -p tcp --dport 9100 -j DROP
# For finer-grained partial-partition simulation use `tc qdisc netem` to drop
# a fraction of packets, or `comcast` (Shopify) for higher-level chaos injection.

Where this leads next

The four modes catalogued in this chapter — complete, asymmetric, partial, gray — set the failure surface that every later chapter's protocol must survive. The next two chapters develop the operational consequences:

Partial failures and why they're the worst — the half-up-half-down state, where a node is partly alive, and why it is the worst of all worlds for the failure detector.
Fail-stop, fail-slow, fail-silent — the failure-mode taxonomy adapted to the production-shape variants the textbook does not name.
How real networks actually fail (studies) — empirical Microsoft / Google / Facebook studies, with real numbers attached to the failure-mode frequencies estimated in the table at the top of this chapter.
Phi-accrual failure detector — Part 10 — the principled answer to "is this node dead", treating reachability as a continuous probability rather than a binary verdict.
Lease + fencing token — Part 9 — how the storage layer rejects stale leaders' writes even when the protocol layer cannot detect them as stale.

The lesson to carry into Part 8 (consensus) is that every consensus protocol's safety proof is conditional on a model of network reachability, and the model is almost always symmetric, eventually-synchronous, with bounded message loss. Real production networks violate every clause of that model occasionally and one clause of it constantly. The art of building a production consensus system is building one whose behaviour outside the model is graceful — degrading to "writes blocked, no data loss" rather than "two leaders, silent corruption". The ingredients are everything in this chapter: bidirectional probes, synthetic data-plane probes, fencing tokens, phi-accrual, indirect probes. Each costs something; each catches a different mode; the production deployment is some weighted combination of all of them, tuned to the specific failure budget of the specific service.

A small concrete drill to internalise the modes: take the most recent network-related incident your team handled and answer four questions. Which of the four modes (complete / asymmetric / partial / gray) was the underlying network behaviour? Which mode did your failure detector assume? If the two don't match, was the gap in the detector's design or in the deployment of the detector (probe path too narrow, threshold too loose)? What new probe or threshold change would close the gap, and what is the false-positive cost? Doing this for ten incidents in a row teaches the modes better than any survey paper. The modes then stop being theoretical categories and start being diagnostic shortcuts during the next incident.

The deeper observation, looking ahead, is that the network's failure modes shape every protocol layer above it. Part 8's consensus protocols (Raft, Paxos) need to be analysed under asymmetric and partial partition, not just symmetric. Part 10's failure detectors need to be tuned for the specific gray-failure rate of your environment, not the textbook rate. Part 17's geo-distribution adds long-haul WAN partitions whose duration distributions are heavier-tailed than intra-datacentre ones. Reading the rest of this curriculum, every protocol's behaviour-under-partition section is implicitly answering the question this chapter raised: which of the four modes did the designer assume, and which one will actually hit you. The skill of distributed systems engineering is reading that question fluently — turning the abstract "what does this protocol do under partition" into the concrete "what does this protocol do under the specific 23% asymmetric, 35% gray failure mix that my datacentre exhibits", and accepting the design trade-offs that mix forces.

References

The Network is Reliable — Bailis & Kingsbury, ACM Queue 2014. The empirical-failure-rate survey across Microsoft, Google, Facebook, Amazon datacentres; the calibration baseline for every conversation about partition frequency.
Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications — Gill, Jain, Nagappan, SIGCOMM 2011. The Microsoft year-long datacentre failure study.
Gray Failure: The Achilles' Heel of Cloud-Scale Systems — Huang et al., HotOS 2017. The paper that named the gray-failure mode and articulated its observability problem.
SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol — Das, Gupta, Motivala, DSN 2002. Indirect probes as the answer to asymmetric reachability.
Jepsen analyses — Kyle Kingsbury, ongoing. Per-database partition testing with linearisability checks; the operational ground truth for "what does this database actually do under partition".
The φ Accrual Failure Detector — Hayashibara et al., SRDS 2004. The continuous-suspicion failure detector that decouples detection from decision.
Crash, omission, timing, Byzantine — internal cross-link to the previous chapter, where the four failure models are catalogued; this chapter is the network-layer specialisation.
Designing Data-Intensive Applications — Kleppmann, O'Reilly 2017. Chapter 8's section "Unreliable Networks" is the practitioner's introduction to the modes catalogued in this chapter.