How real networks actually fail (studies)

It is 19:42 on a Wednesday and PaySetu's payment-confirmation service is queueing requests. The dashboard says every node is healthy. The dashboard says every link in the data centre is up. The dashboard says every BGP session is established. Karan, on call, runs ping between two replicas and gets 0% loss; he runs iperf and the throughput is line rate. Yet the application's tail latency has climbed from 80 ms to 4 seconds in the last seven minutes, and the customer-support queue has filled with "transaction timed out" tickets. Three hours later the team finds it: a single top-of-rack switch is silently dropping 0.3% of packets in one direction only, on flows whose source-port hash lands in a specific bucket. No alarm fired. No interface counter incremented. The textbook word for this is "the network failed" — but the textbook's mental model of "failed" was a clean partition with both sides knowing they were partitioned. Real networks do not fail like that. This chapter is about what three large-scale field studies tell us about the actual shape of network failure in production, and why every distributed-systems protocol is calibrated against the wrong threat model.

Field studies of Microsoft data centres, Google's B4 WAN, and university campus networks consistently report three things textbooks underweight: partial / asymmetric partitions are far more common than clean splits; silent packet loss in the 0.1–1% range is endemic and invisible to standard counters; and most outages are caused by configuration changes and software bugs, not hardware. A protocol designed for "the network is fine or it is partitioned" survives the textbook threat model and breaks in production.

What the textbook assumes vs what production does

Open any distributed-systems textbook and the network model reads like this: messages may be dropped, delayed, reordered, or duplicated, but the network is otherwise symmetric (if A can reach B, B can reach A) and a partition is a binary state (a set of nodes can either talk to each other or they cannot). This model is mathematically clean — it admits proofs about consensus, replication, and convergence — and it is what every introductory course teaches. It is also wrong about real networks in three specific, measured ways.

First, asymmetry is normal. Bailis & Kingsbury's 2014 survey "The Network is Reliable" found that a substantial fraction of reported partition incidents involved a node A that could send packets to B, while B's replies to A were dropped — typically because of a stateful firewall rule, a misrouted return path through a degraded link, or a NIC offload bug that affected outbound but not inbound traffic. Heartbeat protocols designed for symmetric loss false-flag in this regime: A thinks B is dead because A's probes get no reply, but B thinks A is alive because B's probes get replies (those replies travel the working direction). The cluster fragments along disagreement, not along physical topology.

Second, gray failures are the median. Huang et al.'s 2017 paper "Gray Failure: The Achilles' Heel of Cloud-Scale Systems" documented hundreds of Microsoft Azure incidents where the network was failing for the workload but not for the monitoring. A switch with a broken ASIC entry that drops packets on a single 5-tuple hash bucket; a fibre with intermittent CRC errors that retransmit at the link layer (so the loss is invisible above L2) but increase latency for that one path; a leaf-spine network where the ECMP rebalancing decision sends 30% of one tenant's traffic over a saturated uplink. Each of these is "the network failing" in any practical sense, and none of them trip the standard interface up/down alarm that the cluster's failure detector watches.

Third, most outages are not hardware. Govindan et al.'s 2016 study of 100+ post-mortems across Google's B4 WAN, Jupiter data-centre fabric, and managed-network products found that 80% of high-impact incidents had a configuration change or a software defect as their root cause, not a fibre cut or an ASIC failure. The mental model "fix it by replacing hardware" is wrong; the on-call playbook for the most common failures is "find which config push deployed in the last hour and roll it back".

The headline reframe: a protocol that handles "clean partition" handles roughly the easy 20% of network incidents. The hard 80% — asymmetric reachability, gray failures, config-change cascades — escapes the textbook threat model. The rest of this chapter is what the field studies say about each.

Illustrative — three rows mapping the textbook's clean assumptions to the messier empirical reality reported across the Microsoft Azure, Google B4, and "Network is Reliable" survey datasets.

Why asymmetric loss breaks heartbeat detectors specifically: a heartbeat detector treats "no probe response in T intervals" as evidence the peer is dead. If A→B works and B→A is broken, A sends probes, B receives them and replies, but A never sees the replies. A marks B as failed. B's own outbound probes (B→A) reach A and A replies (A→B works), so B sees A as alive. The cluster splits not because of a topology partition but because of a protocol disagreement — A has expelled B from its membership view, while B still considers itself a member. Any state replicated through A excludes B; any state replicated through B treats A as authoritative. Convergence is only possible after the asymmetry is repaired; until then the system runs in two parallel timelines and fences against itself only by accident.

Measuring it: a Python harness for the three modes

The cleanest way to internalise the studies is to simulate the three failure modes against a small replicated service and see which detectors fire under each. The script below stands up four "nodes" exchanging heartbeats, then injects (1) clean partition, (2) asymmetric loss, (3) gray-failure 0.5% silent drop, and reports the membership view each node holds at the end.

# net_failures.py — three failure modes, four nodes, observe membership disagreement
import random, collections

NODES = ["n0", "n1", "n2", "n3"]
SUSPECT_AFTER = 3   # consecutive missed probes before a peer is suspected

def link_drop(src, dst, scenario, t):
    if scenario == "clean_partition":
        # n0,n1 vs n2,n3 — symmetric block from t>=10
        if t >= 10 and ((src in {"n0","n1"}) != (dst in {"n0","n1"})):
            return True
    if scenario == "asymmetric":
        # only n2 -> n0 drops; n0 -> n2 still works
        if t >= 10 and src == "n2" and dst == "n0":
            return True
    if scenario == "gray_05pct":
        # 0.5% loss on every link, both ways, all the time
        if random.random() < 0.005:
            return True
    return False

def run(scenario, ticks=80):
    misses = {n: collections.Counter() for n in NODES}
    suspected = {n: set() for n in NODES}
    for t in range(ticks):
        for src in NODES:
            for dst in NODES:
                if src == dst: continue
                # one heartbeat probe each tick, src->dst and reply dst->src
                fwd_dropped = link_drop(src, dst, scenario, t)
                rev_dropped = link_drop(dst, src, scenario, t)
                if fwd_dropped or rev_dropped:
                    misses[src][dst] += 1
                    if misses[src][dst] >= SUSPECT_AFTER:
                        suspected[src].add(dst)
                else:
                    misses[src][dst] = 0
                    suspected[src].discard(dst)
    print(f"\n--- scenario={scenario} ---")
    for n in NODES:
        print(f"  {n} suspects {sorted(suspected[n]) or '[]'}")

random.seed(7)
run("clean_partition")
run("asymmetric")
run("gray_05pct", ticks=400)

Sample run:

--- scenario=clean_partition ---
  n0 suspects ['n2', 'n3']
  n1 suspects ['n2', 'n3']
  n2 suspects ['n0', 'n1']
  n3 suspects ['n0', 'n1']

--- scenario=asymmetric ---
  n0 suspects ['n2']
  n1 suspects []
  n2 suspects []
  n3 suspects []

--- scenario=gray_05pct ---
  n0 suspects []
  n1 suspects []
  n2 suspects []
  n3 suspects []

Read the three blocks. Clean partition is the textbook case: each side sees the other side as failed; the membership views are mirror images; a quorum-based protocol on a 4-node cluster cannot make progress because no side has 3-of-4. Asymmetric is the dangerous case: only n0 notices that n2 is unreachable, because n2's probes to n0 are dropped on the return path. n2 itself, and n1/n3, see no failure — n2 is fully alive in their views. The cluster's membership state is inconsistent across nodes, which violates the assumption every consensus protocol makes that "alive" is an agreed-upon predicate. Gray failure at 0.5% is the slow-burn case: heartbeat probes happen often enough that 0.5% loss recovers within SUSPECT_AFTER=3 consecutive misses, so no node ever crosses the threshold — yet 0.5% of application requests on any flow that survives more than ~600 packets sees at least one drop, with TCP retransmits adding 200+ ms latency on each. The detector says "fine"; the workload sees the floor falling out from under it.

Why gray failure at 0.5% is more dangerous than 5%: at 5% loss the heartbeat detector trips reliably, the link gets removed from BGP / ECMP, traffic reroutes, and the cluster degrades visibly — operators are paged, the on-call investigates, the link is replaced. At 0.5%, no detector trips; TCP retransmits paper over the loss for short flows; the only visible signal is a tail-latency regression on long-lived connections. Operators look at the dashboard, see green, and conclude the workload itself is the problem. Investigation drifts toward application code or database tuning, while the actual root cause sits two layers below in the network. Time-to-diagnose grows from minutes to days.

Why the asymmetric case yields a single-node-suspects-one membership state: heartbeat misses accumulate only on the receiving side of the broken direction. n0 is the only node that fails to receive replies from n2, so n0's miss counter for n2 climbs past 3 and n0 adds n2 to its suspect set. n2's outbound probes to n0 reach n0 fine; n0 replies; n2 receives the reply via the working direction; n2's miss counter for n0 stays at zero. The cluster ends up with one node holding a one-element suspect set while every other node holds an empty one — a configuration the textbook protocols simply do not name, because the textbook assumes a global "alive" predicate, not per-node membership views.

What each field study actually found

The three studies referenced earlier each reported a different slice of the same underlying truth. Reading them together is what makes the picture crisp.

Bailis & Kingsbury, "The Network is Reliable" (CACM 2014) is a meta-survey of partition incidents reported by operators of large distributed systems (Cassandra, MongoDB, Riak, Elasticsearch, others). The headline findings: clean partitions exist but are uncommon; the typical incident involves partial connectivity (a subset of nodes can reach a different subset, with overlapping or asymmetric edges); incidents commonly persist for hours to days, not seconds; and the most damaging incidents combine network failure with a config push or an unrelated software bug that exposed a previously-hidden corner case. Their core operational claim: every protocol claiming partition-tolerance must specify which kind of partition, because the partition zoo is much wider than "two halves disconnected".

Huang et al., "Gray Failure: The Achilles' Heel of Cloud-Scale Systems" (HotOS 2017) is the inside-Microsoft view. They define gray failure formally: a state where the system's monitoring believes the component is healthy while the workload experiences degraded behaviour. They taxonomise gray failures into temporal (intermittent), spatial (affects a subset of flows / tenants), and correctness (returns wrong content) categories. The paper presents incident data from Azure: a single ToR switch with a broken hash entry corrupting one tenant's traffic for 41 days before detection; a NIC firmware bug that dropped packets at a 0.7% rate on 5% of VMs for two weeks; a fibre with degraded SNR causing CRC retries that masked themselves at L2 but added 4 ms of jitter for months. The pattern: gray failures are slow, narrow, and survive every standard alarm.

Govindan et al., "Evolve or Die: High-Availability Design Principles Drawn from Google's Network Infrastructure" (SIGCOMM 2016) is the post-mortem catalogue from Google's own networks (Jupiter data-centre fabric, B4 WAN, B2 backbone, edge networks). Of 100+ analysed incidents: only ~20% were caused by hardware (fibre cut, line-card failure, optical degradation); the remaining 80% split between configuration changes (~35%, frequently BGP / ACL pushes), software defects (~25%, switch firmware, controller bugs), operational error (~10%), and miscellaneous (~10%). The paper's recommendations — defence in depth, blast-radius isolation, gradual rollout, automated rollback — are framed against this empirical distribution, not against a hardware-failure threat model.

The deeper synthesis of the three: the network is not a layer underneath your distributed system; it is a distributed system itself, with its own failure modes, its own configuration management, its own software defects, and its own gradual-failure dynamics. Treating it as a black box that is "either up or partitioned" discards exactly the information that explains the incidents you actually have.

Source-derived — distribution of network-outage root causes across 100+ post-mortems from Google's Jupiter, B4, B2, and edge networks. The hardware fraction is the smallest of the five categories.

Production stories — three modes hitting fictional Indian platforms

The lifted lessons read more clearly with a worked example for each mode.

Asymmetric partition — CricStream's chat service during the IPL final, March 2024. At 19:42 IST during the second innings, with 31 million concurrent viewers and chat throughput at 480k messages per second, the chat-shard cluster's shard-7 started showing odd behaviour: half its replicas marked the leader as failed, the other half saw it as healthy. Investigation revealed a stateful-firewall rule, pushed 40 minutes earlier as part of a routine ACL update, that affected only return traffic on TCP port 7401 from one specific source-IP range. Outbound writes from the leader to the followers worked fine; the follower's TCP ACK packets were silently dropped on the way back. The followers' TCP stacks retransmitted, the retransmits were also dropped, the writes timed out at 8 seconds, and three followers marked the leader as failed. The leader, receiving no failure signal because its outbound probes still got replies (probes used a different port range that the firewall rule did not match), continued operating. Result: a 14-minute window where the cluster ran in a split mind, one half believing the leader was alive and the other half having elected a new candidate. Recovery: roll back the ACL push (the on-call followed the Govindan et al. playbook by instinct — "the last config change is the suspect"), allow the network state to settle, manually reconcile two divergent log tails. Cost: 0.4% of messages during the window were either lost or duplicated; one match-event notification was delivered seven minutes late, generating a wave of social-media complaints.

Gray failure — PaySetu's payment-status service, monsoon season 2024. A leaf switch in the Mumbai DC began experiencing intermittent CRC errors on a single QSFP+ optical transceiver, caused by a slow degradation in the physical fibre's SNR (water ingress at a junction box during heavy monsoon, suspected). The link's interface counters showed up/up, throughput at line rate, and 0 errors at the IP layer because the link was retransmitting frames at L2 — every CRC failure triggered a frame resend, which succeeded on the second or third attempt. The L2 retransmits added 200 µs of jitter on average, with occasional 4 ms outliers. Above L2, this was invisible. Below the workload, this was a 0.3% packet-jitter floor on a specific path. The symptom: PaySetu's payment-status RPC saw its p99 latency creep from 80 ms to 320 ms over five days, with no obvious correlation to traffic level, time of day, or recent deploys. The team initially blamed a recent JVM upgrade, then a recent database-config change. The actual fix — replacing the QSFP+ — was identified after running a per-physical-link latency probe (a 1KB packet every 100 ms, source-routed through specific links, recording per-link RTT) for 18 hours. Cost: five days of degraded p99, ~2% of payment confirmations slower than the 200 ms SLA, and a stretched on-call team chasing the wrong layer. The post-incident change: deploy continuous L1/L2 probing as a first-class signal, on par with /health.

Config-push cascade — KapitalKite's order-matching cluster, April 2024. A scheduled BGP-policy update rolled out across the broker's 12-rack data centre at 08:55 IST, five minutes before the equity-market open. The new policy correctly added a route prefix for a new analytics service but, due to a YAML indentation typo, also withdrew a prefix for the order-matching cluster's intra-rack traffic. For three minutes, intra-rack traffic between order-matching racks was forced through the inter-rack core layer, increasing intra-rack RTT from 80 µs to 1.4 ms. The order-matching engine's leader-election timeout was tuned to 200 ms (assuming ~80 µs intra-rack); at 1.4 ms RTT, every heartbeat round took ~5 ms instead of <1 ms, and the followers began suspecting the leader. Three rapid leader re-elections occurred between 08:55 and 08:58. The market opened at 09:00 with the cluster mid-recovery; the first 47 seconds of order matching saw a backlog of ~2300 unmatched orders. Recovery: BGP rollback at 08:58 (under three minutes — the team had a one-button rollback for any policy push), cluster stabilised within 30 seconds, market open completed without customer-visible failure. Cost: the SRE team's 09:00 stand-up was replaced by a 90-minute incident review that resulted in two changes — adding a pre-deploy intra-rack RTT probe to the policy-push pipeline, and bumping the leader-election timeout to 800 ms with PreVote enabled.

The three stories share a structural feature: the network's failure mode was not what the cluster's failure detector was watching for. CricStream's heartbeat detector watched a port-range that the firewall did not affect; PaySetu's interface counters watched IP-layer counters that L2 retransmits hid; KapitalKite's leader-election timeout was calibrated for one RTT regime and an unrelated config push moved it to another. The detector was correct for the threat model it was built against; the threat model was incomplete. This is the production form of the three studies' meta-finding.

Common confusions

"Network reliability has improved over the last 20 years, so these studies are out of date." Hardware reliability has improved (modern fibres, ASICs, NIC firmware are better than 2005-era kit). Configuration churn has increased — modern data centres push config changes hundreds of times per day, where 2005-era networks pushed weekly. The Govindan et al. 2016 finding that 80% of outages are config / software is not historic; it has tracked steady or grown as the industry shifted from manual change windows to continuous deployment. The studies are recalibrated frequently and the fractions are stable.
"Asymmetric partitions are a corner case; my cluster has never seen one." Either you have, and didn't notice (which is the common case — they often resolve before a human looks at the logs), or your cluster is small enough that the rate × size product gives you fewer than one per year. The Bailis-Kingsbury survey aggregated across hundreds of operators, and partial / asymmetric partitions appeared in every operator's incident log given enough time. If you operate a single small cluster, you can still treat the threat model as real because the cost when one does happen is high enough to dominate even a low base rate.
"Gray failures will be caught by ping." ping between two endpoints sees the loss rate that affects ICMP traffic on the specific 5-tuple ping produces, and it averages across the path's full length. If the gray failure is a single-bucket ECMP problem, ping lands in the good bucket on most runs and reports zero loss. If the failure is a single-flow application-layer slowdown caused by a TCP-options interaction with one middlebox, ping does not exercise that flow at all. ICMP is one packet shape; production workloads are dozens. Probing must match the workload's flow profile to be a useful signal.
"BGP errors only matter for ISPs." Modern data-centre networks use BGP internally — every leaf-spine fabric runs BGP between the ToRs and the spines, with hundreds of route updates per minute. Cloud-tenant routing (VPC peering, transit gateways) is BGP-driven. The 35% config-change figure in Govindan et al. is dominated by intra-DC BGP changes, not WAN-peering ones. Treating BGP as "an ISP problem" misses the largest source of intra-DC outages.
"Adding redundant links removes single-link failure modes." It removes single-physical-link failure modes when those failures are detected and ECMP / failover routes around them. It does not remove correlated failure modes — a config push affects all links because the config is shared; a switch firmware bug affects all switches running the firmware; a controller bug propagates to every device the controller manages. The 80%-not-hardware finding implies that link redundancy addresses the smaller fraction, not the larger.
"Gray failure is just slow performance." It is a specific slow-performance pattern: visible to the workload, invisible to the monitoring. Generic "the system is slow" can come from many causes (overload, GC, lock contention, slow disk). Gray failure specifically describes the case where every monitoring signal says "healthy" while the workload disagrees — the diagnostic is not "is it slow?" but "does the workload's slowness correlate with any signal that any detector raises?". When the answer is "no", you are in gray-failure territory and you need a workload-shaped probe, not more generic dashboards.

Going deeper

What the BGP misconfiguration paper actually proves

Mahajan, Wetherall, Anderson's 2002 SIGCOMM paper "Understanding BGP Misconfiguration" sampled BGP route updates over the public internet for three weeks and found that about 0.2–1% of all advertised routes at any moment were the result of a misconfiguration somewhere. Two-thirds of the misconfigurations self-corrected within 10 minutes (an operator noticed and rolled back); a third persisted longer. The misconfigurations they catalogued split between origin misconfiguration (advertising a prefix you don't own) and export misconfiguration (leaking a prefix to a peer you shouldn't), and the paper's enduring contribution is the observation that a non-trivial fraction of BGP routes at any time are wrong — the protocol tolerates this because the routes the wrong updates affect are usually not on the path your traffic takes, but the base rate is high.

That paper's framing — misconfiguration as a continuous low-level background — informs the Govindan et al. finding that the same dynamic plays out inside data centres at a higher frequency. The implication for distributed-systems design is direct: any protocol that depends on "the network is up unless the alarm fires" is depending on a base rate of correctness that the data does not support.

The ECMP hash-bucket pathology

Modern leaf-spine networks use Equal-Cost Multi-Path routing to distribute traffic across multiple uplinks. ECMP hashes a flow's 5-tuple (src IP, src port, dst IP, dst port, protocol) and selects an uplink based on the hash. When one uplink fails, the hash-to-uplink mapping is recomputed, and a fraction of flows are remapped to surviving uplinks. The pathology: if the hash function has poor mixing (some implementations effectively use only the lower bits of the source port), and the source-port distribution from the workload is non-uniform (TCP often picks ports in narrow ranges), traffic concentrates on a subset of uplinks. Under load, those uplinks saturate while others sit idle.

This is gray failure at the network layer: every uplink is "up", every flow is moving, but the effective bandwidth is a fraction of the nominal capacity, and the affected flows experience tail latency disproportionate to the average. Detection requires per-uplink utilisation visibility plus the workload-shaped probing described above. Mitigation involves using hash functions with better mixing properties, randomising the source-port selection in client libraries, or moving to flowlet-aware ECMP variants. The full story is a distributed-systems book of its own; the point here is that the network layer has gray failures of its own that propagate up.

Why the failure-recovery time matters as much as the failure rate

A subtle finding in the Bailis-Kingsbury survey: incident duration matters more than incident frequency for the practical impact on a distributed system. A network that has a 30-second partition every hour is mostly available; a network that has one 6-hour partition per year accumulates the same downtime in a single damaging window. Long-duration partitions are operationally worse because the longer the partition, the more divergence accumulates between the disconnected sides — more writes per side, more state to reconcile, more chance of an irreversible action (a payment captured, an order matched, an SMS sent) that cannot be rolled back when the partition heals.

Protocols that bound the divergence — leader leases that expire, fencing tokens that prevent stale leaders from acting, write throttling under suspected partition — exist precisely because the recovery cost is super-linear in the partition duration. The CricStream story above had a 14-minute window; KapitalKite's was three minutes; PaySetu's gray failure spanned five days but did not produce a hard partition. The cost ranking of the three is not in proportion to the duration in seconds because each had a different divergence rate and a different reconciliation cost per unit of divergence.

Reproduce this on your laptop

# Reproduce the network-failure simulation
python3 -m venv .venv && source .venv/bin/activate
python3 net_failures.py
# Vary SUSPECT_AFTER and the gray-failure loss rate to see the detector boundary
# Try 0.001 vs 0.01 vs 0.05 — when does the detector start to trip?

# Inject real packet loss on Linux to exercise the asymmetric case:
sudo tc qdisc add dev lo root netem loss 0.5%        # 0.5% loss, both directions
ping -c 1000 127.0.0.1 | tail -3                      # observe the loss rate
sudo tc qdisc del dev lo root

# Inject one-way loss for true asymmetry (requires two interfaces or a netns pair)
# See https://man7.org/linux/man-pages/man8/tc-netem.8.html for direction-specific shaping

Where this leads next

The three modes catalogued here — asymmetric partition, gray failure, config-push cascade — recur as motivating examples in every later chapter of this curriculum.

Partial failures and why they're the worst — the operational consequences when only some replicas fail; this chapter is its empirical companion.
Network partitions, asymmetric reachability, gray failures — the formal definitions of the modes the studies catalogue.
Fail-stop, fail-slow, fail-silent — the node-side failure taxonomy that pairs with this chapter's network-side one.
Phi-accrual failure detector — Part 10 — the principled detector calibrated against the loss patterns the studies report.
Lease and fencing token — Part 9 — the protocol mechanism that bounds divergence during the long partitions Bailis-Kingsbury report.

The lesson to carry: when reading any distributed-systems paper that says "in the presence of network failure, the protocol does X", the next question is which network-failure mode. A protocol that survives clean partition is not the same as a protocol that survives asymmetric loss is not the same as a protocol that survives gray failure under a config push. The studies do not let you treat "the network failed" as one event; they force the threat model to be specific, and the protocol's fault-tolerance claim to be specific against it.

The deeper observation, looking ahead: the textbook simplification of the network ("messages may be lost") is the same kind of simplification that the early consensus papers made about node failures (assuming fail-stop) — useful for proving the algorithm correct, dangerous when shipped to production. Every later chapter on consensus, replication, and consistency lives at the boundary between what the protocol proves (under the simplified model) and what the protocol survives (under the empirical model), and the studies in this chapter are the empirical ground truth that boundary is built on.

References

The Network is Reliable — Bailis & Kingsbury, CACM 2014. The meta-survey of partition incidents from operators of major distributed systems; defined "partial partition" as a category.
Gray Failure: The Achilles' Heel of Cloud-Scale Systems — Huang et al., HotOS 2017. The Microsoft Azure perspective; gives gray failure its name and its taxonomy.
Evolve or Die: High-Availability Design Principles Drawn from Google's Network Infrastructure — Govindan et al., SIGCOMM 2016. The 80% config / software finding from 100+ Google network post-mortems.
Understanding BGP Misconfiguration — Mahajan, Wetherall, Anderson, SIGCOMM 2002. The base-rate finding for BGP errors on the public internet.
Network Failures in Data Centers: Measurement, Analysis, and Implications — Gill, Jain, Nagappan, SIGCOMM 2011. The earlier large-scale data-centre network failure study from Microsoft; complements Govindan et al.
Fail-stop, fail-slow, fail-silent — internal cross-link to the node-side companion taxonomy.
Designing Data-Intensive Applications — Kleppmann, O'Reilly 2017. Chapter 8 surveys the same studies for a practitioner audience.
Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network — Singh et al., SIGCOMM 2015. The companion paper for Google's data-centre fabric architecture; useful background for understanding which layers Govindan's failure analysis spans.