The physics argument: latency and failure
Aditi runs the order-routing service for KapitalKite. Her dashboard alerts at 09:14:23 IST: cross-region replication lag from Mumbai to Singapore has spiked from a usual 38 ms to 74 ms for thirty seconds. She opens the runbook expecting a noisy-neighbour incident on the leader. The on-call from the network provider replies in three minutes: a submarine cable repair has rerouted Mumbai-Singapore traffic via Chennai, adding 3,600 km of fibre. Aditi cannot fix this with code, with a bigger box, or with a circuit breaker. The universe added 18 ms to her round-trip and she will live with it until the cable is spliced.
This is the physics argument. The previous chapter showed that cost forces distribution at scale; this one shows that two physical facts force distribution before cost ever enters the conversation — the speed of light is finite, and components fail at a rate set by the materials they are made of. Every primitive in the next 137 chapters is a response to one of these two facts.
Light traverses fibre at roughly 200,000 km/s — about 5 µs per kilometre — and round-trip latencies are bounded below by twice that geometric distance. Hardware fails at rates measured in failures-per-billion-device-hours that, once you multiply by fleet size, guarantee something is broken right now. Distribution is the only honest response: locality solves the speed-of-light bound, redundancy solves the failure rate. Caching is just locality with a different name; replication is just redundancy with a protocol attached.
The speed-of-light floor — what fibre actually buys you
The first physical fact: information cannot travel faster than light, and light in optical fibre travels at about two-thirds of c — roughly 200,000 km/s, or 5 microseconds per kilometre, one way. There is no engineering trick that beats this. You can shorten the cable, you can reduce serialisation overhead, you can pipeline more aggressively — but the geometric distance between two cities, divided by 200,000 km/s, multiplied by two for the round trip, is the lower bound on any RPC's latency, and your application sees that floor whether you like it or not.
The numbers in the figure compound brutally for synchronous designs. PaySetu replicates UPI transactions from Mumbai to a hot-standby in Singapore for disaster recovery. If they choose synchronous replication — wait for the Singapore replica to ack before returning success to the merchant — every payment now pays a 38 ms minimum penalty that no infrastructure choice can remove. At 8,000 TPS, that 38 ms is 304 thread-seconds of latency burned per real-time second; you need 304 threads parked just to absorb the speed of light.
# rtt_floor.py — measure the irreducible round-trip and the practical RTT
import time, socket, statistics
# Replace with hosts you control in two regions; this is a measurement skeleton.
TARGETS = [
("ap-south-1.local", "10.0.1.5", 443), # same VPC, same AZ
("ap-south-1b.local", "10.0.2.7", 443), # same VPC, other AZ
("ap-southeast-1.dr", "10.20.1.3", 443), # Singapore DR replica
]
def tcp_rtt(host_ip, port, samples=200):
rtts = []
for _ in range(samples):
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
t0 = time.perf_counter_ns()
s.connect((host_ip, port)) # 1.5-RTT TCP handshake
t1 = time.perf_counter_ns()
s.close()
rtts.append((t1 - t0) / 1e6 * (2 / 3)) # subtract one-half RTT for SYN-ACK
return rtts
def fibre_floor_ms(km):
return 2 * km / 200_000 * 1000 # 200,000 km/s, doubled for round trip
print(f"{'target':24} {'p50_ms':>8} {'p99_ms':>8} {'fibre_floor_ms':>16}")
for name, ip, port in TARGETS:
samples = tcp_rtt(ip, port)
p50, p99 = statistics.median(samples), sorted(samples)[int(0.99 * len(samples))]
floor = {"ap-south-1.local": 0.04, "ap-south-1b.local": 0.30,
"ap-southeast-1.dr": fibre_floor_ms(3800)}[name]
print(f"{name:24} {p50:>8.2f} {p99:>8.2f} {floor:>16.2f}")
Sample run:
target p50_ms p99_ms fibre_floor_ms
ap-south-1.local 0.18 0.42 0.04
ap-south-1b.local 1.38 2.10 0.30
ap-southeast-1.dr 52.40 74.80 38.00
The same-AZ p50 of 0.18 ms is 4.5× the speed-of-light floor of 0.04 ms — that 4.5× is overhead from NIC queues, kernel scheduling, and TCP handshake state machines, and it is the best you will ever do. The cross-AZ p50 of 1.38 ms is also about 4–5× its 0.30 ms floor, the same overhead ratio. The Mumbai-Singapore p50 of 52.40 ms is 1.4× its 38 ms floor — a tighter ratio because long-distance paths amortise the per-hop overhead over more kilometres of pure fibre. fibre_floor_ms is the universe's contribution; everything above it is your system's contribution. The point of the script is to measure the gap between the two — when the gap is large, you have engineering work to do; when the gap is small, you have already extracted everything fibre will give you and the only remaining lever is to move the work closer to the user.
Why the fibre-floor calculation uses 200,000 km/s and not 300,000 km/s: the speed of light in vacuum is c ≈ 300,000 km/s, but light in glass moves at c/n where n ≈ 1.46 for typical single-mode optical fibre. So the propagation speed is ~205,000 km/s, conventionally rounded to 200,000 for back-of-envelope work. This is why no fibre RTT will ever beat the floor in the script — the constant is a property of glass, not a property of the equipment at the endpoints.
The speed-of-light floor is what makes locality — caching, edge computing, regional replicas, CDNs — non-negotiable for global services. It is the entire economic justification for Cloudflare, Akamai, and the regional-replica patterns in Part 17. You cannot make Mumbai close to Virginia; you can put a copy of Mumbai's data in Virginia.
The failure-rate floor — components fail by the laws of materials
The second physical fact is uglier than the first because it cannot be averaged away. Every component in your fleet has a non-zero probability of failing at any instant, and once you multiply that probability by the size of your fleet, something is failing right now. The exact failure modes vary — a memory module hits a single-bit error from cosmic rays, an SSD's flash cells wear out after their rated write count, a fan bearing seizes, a power supply's electrolytic capacitors dry out — but the aggregate behaviour follows a small handful of distributions, and once you internalise those distributions, redundancy stops feeling like an option and starts feeling like accounting.
The numbers, calibrated against published Backblaze drive-failure reports, Google's DRAM error studies, and the standard data-centre AFR (annualised failure rate) tables for 2024–2026:
- Spinning disks: 1.5–2.5% AFR. In a 1,000-disk fleet, you replace 15–25 disks per year — about one every two weeks.
- Enterprise SSDs: 0.5–1.0% AFR (excluding wear-out). The wear-out curve is steep; a flash cell rated for 3,000 P/E cycles begins erroring at 3,500–4,000 in production.
- DRAM: 8% of DIMMs see at least one correctable error per year; ~1% see an uncorrectable error. Cosmic rays do not respect your SLA.
- Whole machines: 2–5% AFR including all non-disk causes (power supply, fan, motherboard, CPU). In a 10,000-machine fleet, 1.5 machines fail per day.
- Network links: link flaps and submarine-cable cuts. The Indian Ocean cable cluster (the SEA-ME-WE family of submarine cables that lands at Mumbai and Chennai) has averaged 1.8 cable cuts per year over the past decade — fishing trawlers, anchor drags, and the occasional underwater landslide.
- Datacentre power: AFR around 0.05% for a tier-3 facility. A 20-DC fleet has a 63% chance of at least one DC-level outage in 5 years (compounding 0.05% × 5 × 20).
# fleet_failure_math.py — what is broken right now in your fleet?
import math
COMPONENTS = [
# (name, count, annual_failure_rate, mttr_hours)
("spinning disks", 1000, 0.020, 72), # 3 days to swap and rebuild
("enterprise SSDs", 2000, 0.008, 24), # 1 day to swap
("DRAM DIMMs", 6400, 0.010, 48), # 2 days incl burn-in
("whole VMs (host fail)", 1500, 0.030, 8), # auto-recovery on cloud
("rack switches", 60, 0.015, 12), # half-day to RMA
("AZ-level outage", 3, 0.002, 240), # 10 days incremental + DR test
]
print(f"{'component':24} {'count':>6} {'AFR':>6} {'expected/year':>14} "
f"{'broken_now':>12} {'P(any broken now)':>18}")
print("-" * 86)
hours_per_year = 8760
for name, count, afr, mttr in COMPONENTS:
expected = count * afr
broken_now = expected * mttr / hours_per_year
# P at least one broken: 1 - (1 - p_each)^count, with p_each = afr * mttr/year
p_each = afr * mttr / hours_per_year
p_any = 1 - math.pow(1 - p_each, count)
print(f"{name:24} {count:>6} {afr:>6.1%} {expected:>14.1f} "
f"{broken_now:>12.2f} {p_any:>17.2%}")
Sample run:
component count AFR expected/year broken_now P(any broken now)
--------------------------------------------------------------------------------------
spinning disks 1000 2.0% 20.0 0.16 14.94%
enterprise SSDs 2000 0.8% 16.0 0.04 4.30%
DRAM DIMMs 6400 1.0% 64.0 0.35 29.65%
whole VMs (host fail) 1500 3.0% 45.0 0.04 4.02%
rack switches 60 1.5% 0.9 0.001 0.12%
AZ-level outage 3 0.2% 0.006 0.0002 0.02%
The most important column is broken_now — the steady-state expected number of failed components, computed as count × AFR × MTTR / 8760. For a fleet that looks like a mid-size Indian fintech (1,500 VMs, 6,400 DIMMs, 1,000 disks), the steady-state expectation is 0.16 disks broken right now and 0.35 DIMMs broken right now. The probability that at least one DRAM DIMM somewhere in the fleet is currently throwing correctable errors is 29.65% at any given moment. The numbers do not say "failure may happen" — they say it is happening, the question is which one and where. P(any broken now) quantifies the probability that at least one component of that type is in a failed state right now — for DIMMs it is essentially "always" at scale.
Why MTTR (mean time to recovery) matters as much as MTBF (mean time between failures): a component with 1% AFR and 8-hour MTTR contributes 1% × 8/8760 = 0.0009% to the steady-state broken-now probability per unit. A component with the same 1% AFR and 240-hour MTTR contributes 30× more. At fleet scale, optimising MTTR — better automation, hot-swap hardware, faster RMA — is usually a larger lever than reducing AFR, because AFR is set by the materials and you cannot move it much, while MTTR is set by your operations team and you can move it a lot.
The takeaway is the failure-rate analogue of the speed-of-light argument: redundancy is not a tool you reach for when something breaks — it is the operating point of any fleet at scale. A "well-designed" service running on hardware that will fail according to the table above must already be designed for that failure when it runs nominally. The next 14 chapters on consensus, replication, and failure detection are all consequences of this single fact.
How locality and redundancy compose — and where they fight
The two physical facts pull in opposite directions when you try to satisfy them at the same time, and that tension is the source of most of the interesting trade-offs in this curriculum.
- Locality wants the data close to the user, which means many copies in many regions.
- Redundancy wants multiple copies in different failure domains, which means many copies you must keep in agreement.
- Both want copies. Neither tells you how to reconcile them when they diverge.
That reconciliation problem is what consensus protocols, replication strategies, CRDTs, and consistency models exist to solve. The combinatorics are unforgiving: if you want 99.99% availability (Cliff 3 from chapter 1) and sub-50 ms latency for a global user base and strong consistency, you need replicas in multiple regions, you need every write to wait for cross-region coordination, and the speed-of-light floor immediately pushes your write latency above 50 ms. You cannot have all three; this is the PACELC generalisation of CAP that Part 12 makes formal.
PaySetu's architecture is the hybrid. Payment authorisation — where a duplicate write would charge a customer twice — runs strongly consistent across three Mumbai AZs at 1.5 ms RTT. Account-balance reads for the customer dashboard hit a regional read replica with up to 4 seconds of staleness during peak load, because the customer noticing their dashboard is 4 seconds stale is recoverable, while a duplicate ₹4,000 charge is not. The architecture is a direct compromise between the two physics arguments — Mumbai for the latency-and-correctness-critical path, regional replicas for the latency-but-not-correctness-critical path, and a clear written rule about which reads are which.
Correlated failure — when the failure-rate math lies
The component-by-component AFR table at the start of this chapter assumes failures are independent. They are not. The failures that take services down are usually correlated: a power feed tripping kills every machine on that feed at the same instant; a kernel bug rolled out by a deployment process kills every replica that runs the new kernel within minutes; a TLS-certificate expiry kills every dependent service at the same wall-clock second. Independence is a flattering assumption that the math has no problem with — and that physical reality does not care about.
A useful exercise: for any given replica set, write down what would have to fail simultaneously to take all replicas down. If three replicas live in three AZs of the same region, the answer includes "the region" — and AWS, Google Cloud, and Azure all have multi-AZ regional outages on record at roughly once-per-three-years per region. If three replicas live in three regions of the same provider, the answer includes "the provider's global control plane" — and the major providers have taken global outages on record at the once-per-five-years per provider rate. The replication count buys you availability against the failures it covers; it buys nothing against the ones it doesn't.
# correlated_failure.py — what is the actual availability with correlation?
def availability(p_indep_failure, n_replicas, p_correlated_failure):
"""Crude two-mode model: P(all down) = P(common cause) + P(all individual failures | no common cause)."""
p_indiv = p_indep_failure ** n_replicas * (1 - p_correlated_failure)
return 1 - (p_correlated_failure + p_indiv)
scenarios = [
# name, p_per_replica, n, p_correlated
("3 VMs in same rack", 0.005, 3, 0.020), # rack-level events common
("3 VMs in 3 AZs same region", 0.005, 3, 0.0005), # AZ correlation low
("3 VMs in 3 regions", 0.005, 3, 0.00005), # region correlation tiny
]
print(f"{'topology':32} {'naive':>10} {'realistic':>10}")
for name, p, n, c in scenarios:
naive = 1 - p**n # if independent
real = availability(p, n, c)
print(f"{name:32} {naive:>10.5f} {real:>10.5f}")
Sample run:
topology naive realistic
3 VMs in same rack 0.99999988 0.97999
3 VMs in 3 AZs same region 0.99999988 0.99949
3 VMs in 3 regions 0.99999988 0.99994
The naive column is what a textbook says — "three nines per replica gives nine nines combined". The realistic column is what production looks like once correlation enters. Three replicas in one rack are 6 orders of magnitude worse than the textbook claim; three replicas across regions are still 1.5 orders of magnitude worse. The fix is failure-domain diversity — the operational discipline that says "no two replicas may share a rack, a power feed, a kernel version, a deployment cohort, or a certificate."
What the physics argument forbids — and what it does not
The physics argument is sometimes used as an excuse for designs it does not in fact justify. Two patterns to be specific about:
- It does not forbid synchronous writes within a region. A 3-AZ Mumbai cluster can do strongly consistent synchronous writes with a leader-and-followers Raft setup at p99 of about 4–6 ms — well within the latency budget of any UPI transaction (the regulatory budget is 30 seconds end-to-end). Engineers sometimes claim "we cannot do strong consistency because of latency" when what they actually mean is "we have not measured our budget", and the physics argument gets misused as cover.
- It does forbid synchronous writes across continents on the hot path. A 135 ms Mumbai-Virginia round-trip means 405 ms for a 2-phase commit (prepare + commit) before you even start serialisation overhead. No clever protocol gets you below the geometric distance divided by 200,000 km/s. The systems that look like they do — Spanner's external consistency claim, for instance — pay for it elsewhere (TrueTime atomic clocks in every datacentre, expensive cross-region commit-wait pauses).
The general rule: when latency budgets are claimed to be the reason for a design choice, demand the actual measurement. If the speed-of-light floor between the two endpoints is 38 ms and the team reports a 320 ms p99 they cannot afford, the physics is responsible for 38 ms; the other 282 ms is engineering they have not done. Never blame physics for what is in fact an unprofiled service.
Why this distinction matters operationally: every team has limited engineering capacity. Spending it on speed-of-light-bound problems (CDN, edge caching, regional replicas) when the actual problem is a thread-pool tuning issue, or an unindexed query, is a misallocation that costs months and produces no improvement. The first thing to do when a latency complaint appears is to compute the speed-of-light floor for the path involved and compare it to the measured p99. If the gap is large, fix the gap. If the gap is small, only then start designing distribution.
Common confusions
-
"Latency is the same as bandwidth." Latency is how long the first byte takes to arrive; bandwidth is how many bytes per second arrive after that. The speed-of-light floor only governs latency. You can keep adding bandwidth — fatter pipes, more parallel TCP streams, jumbo frames — and the latency floor does not move at all. Teams that buy bigger pipes to fix a latency problem are mismatching the lever to the cliff. "Latency is hard, bandwidth is cheap" is the rule that follows from the physics.
-
"Failure rates can be engineered away with better hardware." Premium hardware moves AFR by maybe 2–3× — from 2.5% to 0.8% on disks, say — never to zero. A 1,000-disk fleet of premium hardware still loses 8 disks a year. Once your fleet is large enough that the expected number of failures per year is greater than 1, redundancy is mandatory, and there is no enterprise-tier component that takes that expected number below 1 at fleet scale. The vendor that promises "five nines on a single component" is selling you ad copy.
-
"Speed of light is a problem for global services, not for me." It is a problem for any service whose users span more than 500 km from the leader. Mumbai to Bengaluru is 850 km — 8.5 ms of pure fibre RTT, and your application is using that. Teams running "single-region" services on AWS Mumbai routinely see 4–6 ms additional latency for users in Chennai or Kolkata, and dismiss it as noise; it is not noise, it is the universe.
-
"More replicas always improves availability." Up to a point. The mathematics of correlated failure means that N replicas in the same rack is barely better than 1 replica — they all fail together when the rack's top-of-rack switch fails. Three replicas in three failure domains (different racks, different AZs, different power supplies) is qualitatively different from three replicas in one. Backblaze's storage-pod design and Google's "rack diversity" placement constraints exist because correlated failure is real, and it eats naive redundancy arguments alive.
-
"Cosmic-ray bit-flips are too rare to worry about." They are about 1 per GB-year at sea level, more at altitude. A 1 TB DRAM fleet sees roughly 1,000 bit-flips per year. Most are caught by ECC; the uncorrectable rate is still around 1 per fleet per week for a mid-size fintech. Banking systems have processed exactly-zero-rupee transactions whose root cause turned out to be a flipped bit in the amount field of an in-flight RPC. The Cassini-Huygens probe used radiation-hardened DRAM for this reason, and so does every aircraft autopilot. Your laptop does not, which is why your laptop occasionally crashes.
-
"The physics argument is the same as the economic argument." It is not. The economic argument (chapter 1) says that distribution becomes cheaper than vertical scaling above a certain SKU threshold — it is a price-curve argument. The physics argument says that distribution becomes the only possible answer above a certain latency budget or a certain failure-rate-tolerance threshold — it is a geometric-and-statistical argument. The two arguments often agree on the same architectural decision but for completely different reasons, and confusing them leads to designs optimised for the wrong thing.
Going deeper
The Dean numbers — latency at every layer of a real system
Jeff Dean's "Numbers Every Programmer Should Know", published as part of an internal Google talk and now canonical, lays out latencies at every layer: an L1 cache hit is 0.5 ns, a main-memory access is 100 ns, a same-rack TCP round-trip is 0.5 ms, a same-DC round-trip is 1 ms, an internet round-trip from California to Netherlands is 150 ms. The interesting structure is that these latencies span nine orders of magnitude — 0.5 ns to 150 ms is a factor of 3 × 10^8. No other engineering discipline routinely deals with that range; it is why distributed-systems performance reasoning has its own vocabulary (microbenchmark, fanout, tail at scale) that does not appear in single-machine performance reasoning. The arc of distributed-systems thinking from 1995 to 2026 is partly a story of slowly absorbing how big this range is and how brittle naive reasoning about "milliseconds" becomes when one operation hides 100,000× the cost of another.
Cosmic-ray bit-flips, ECC, and the silent corruption you do not notice
A high-altitude cosmic ray strikes a DRAM cell roughly once per gigabyte-day at sea level, more at altitude (the rate scales with neutron flux). Single-error-correcting ECC catches single-bit flips in a 64-bit word; double-error-detecting catches the rest as uncorrectable, halting the machine. No ECC catches a flip that occurs after the value leaves DRAM — in a CPU register, on the network wire, in an unprotected NIC buffer. Google's 2009 study found that DRAM error rates are 30× higher than vendor-spec sheets in production, and that 8% of DIMMs see at least one error per year. The implication for distributed systems: every node in your fleet is occasionally executing arithmetic on corrupted data, and the only protection at the system level is end-to-end checksums on every payload — which is why TCP includes a checksum, why TLS includes a MAC, why Cassandra includes per-row CRCs. The "trust no byte you did not validate" rule is a response to physics, not paranoia.
Latency budgets and the queueing tax — why p99 looks worse than physics predicts
Even when the speed-of-light floor is low, real-world p99 latency is dominated by queueing, not by propagation. A server handling 4,000 requests/second with an average service time of 1 ms looks lightly loaded — utilisation is 4 ms-of-work per ms-of-clock × 1 server = 0.4 — but at p99 the queue length is non-trivial because Poisson arrivals cluster. The classic M/M/1 result: p99 wait time at utilisation ρ is approximately (ρ / (1-ρ)) × service_time × ln(100) — at ρ = 0.4 that is 3 ms of queueing on a 1 ms service. As ρ approaches 0.7, the tax explodes — at ρ = 0.9, p99 wait is 9× the service time. This is why "the tail at scale" — Dean and Barroso's 2013 paper — recommends running fleets at modest utilisation (40–60%) rather than wringing out every drop. The physics floor sets the lower bound; the queueing tax sets the practical lower bound. Part 6 (load balancing) and Part 7 (reliability patterns, including hedged requests) build on this.
Reproduce this on your laptop
You can confirm the speed-of-light floor and the failure-rate math without leaving your room.
# Reproduce the physics-floor measurement on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install simpy
# 1. Measure RTT to a local server vs a distant one
ping -c 200 1.1.1.1 | tail -1 # Cloudflare anycast — should be ~5-15 ms
ping -c 200 8.8.8.8 | tail -1 # Google anycast — varies by region
mtr --report-cycles=100 google.com # see hop-by-hop where the time goes
# 2. Run the fleet-failure simulation under simpy to confirm the steady-state numbers
python3 fleet_failure_math.py
# 3. Test cosmic-ray ECC events: dmesg | grep -i 'edac\|mce' on a Linux host that runs
# for more than a month — you will almost always find at least one corrected memory error.
sudo dmesg | grep -iE 'edac|mce|corrected'
The interesting reproduction is the third one: an idle Linux server running for 30 days will, on commodity DDR4 with ECC, accumulate one or two corrected single-bit errors in dmesg. They are routine; nobody sees them because nobody looks. Every distributed system you run is silently absorbing them all the time.
Where this leads next
The next three chapters move from the physics arguments to the formal vocabulary the rest of the curriculum needs.
- Latency, throughput, jitter — the working units of distribution — chapter 3, where the units this chapter used informally (RTT, p99, AFR) get defined precisely with measurement protocols.
- The fallacies of distributed computing — chapter 4, where L. Peter Deutsch's eight fallacies (the network is reliable, latency is zero, bandwidth is infinite, the topology doesn't change, …) get unpacked one by one with a real production failure for each.
- Why availability is a distributed problem — chapter 5, which formalises the failure-rate floor of this chapter into the math of replication for redundancy and the limits imposed by correlated failure.
By the end of Part 1, you will have the precise vocabulary to describe why a distributed system is needed in any specific case. From Part 2 onwards the question shifts: given that you are distributed, what failure modes does the network and the hardware actually present, and how does each protocol respond?
References
- The Tail at Scale — Jeff Dean, Luiz André Barroso, CACM 2013. The canonical paper on why latency tails dominate fanout-system performance and why physics-derived lower bounds drive design.
- DRAM Errors in the Wild: A Large-Scale Field Study — Bianca Schroeder, Eduardo Pinheiro, Wolf-Dietrich Weber, SIGMETRICS 2009. The Google study that showed DRAM error rates are 30× vendor-spec and 8% of DIMMs see an error per year.
- Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? — Bianca Schroeder, Garth A. Gibson, FAST 2007. Established that AFR specs from disk vendors understate field rates by 2–4×.
- Backblaze Drive Stats Q4 2024 — quarterly published AFR data on 250,000+ disks across many models. The freshest source of real-world AFR numbers.
- Speed of Light in Glass — fundamentals — Wikipedia. The 200,000 km/s figure for light in single-mode fibre is c divided by the refractive index of silica glass (~1.46).
- Latency Numbers Every Programmer Should Know — Jeff Dean's canonical latency-by-layer list, kept updated by the community as hardware evolves.
- The economic argument: scale and cost — internal cross-link. The companion argument to this chapter — when distribution is forced by cost rather than by physics.