The physics argument: latency and failure

Aditi runs the order-routing service for KapitalKite. Her dashboard alerts at 09:14:23 IST: cross-region replication lag from Mumbai to Singapore has spiked from a usual 38 ms to 74 ms for thirty seconds. She opens the runbook expecting a noisy-neighbour incident on the leader. The on-call from the network provider replies in three minutes: a submarine cable repair has rerouted Mumbai-Singapore traffic via Chennai, adding 3,600 km of fibre. Aditi cannot fix this with code, with a bigger box, or with a circuit breaker. The universe added 18 ms to her round-trip and she will live with it until the cable is spliced.

This is the physics argument. The previous chapter showed that cost forces distribution at scale; this one shows that two physical facts force distribution before cost ever enters the conversation — the speed of light is finite, and components fail at a rate set by the materials they are made of. Every primitive in the next 137 chapters is a response to one of these two facts.

Light traverses fibre at roughly 200,000 km/s — about 5 µs per kilometre — and round-trip latencies are bounded below by twice that geometric distance. Hardware fails at rates measured in failures-per-billion-device-hours that, once you multiply by fleet size, guarantee something is broken right now. Distribution is the only honest response: locality solves the speed-of-light bound, redundancy solves the failure rate. Caching is just locality with a different name; replication is just redundancy with a protocol attached.

The speed-of-light floor — what fibre actually buys you

The first physical fact: information cannot travel faster than light, and light in optical fibre travels at about two-thirds of c — roughly 200,000 km/s, or 5 microseconds per kilometre, one way. There is no engineering trick that beats this. You can shorten the cable, you can reduce serialisation overhead, you can pipeline more aggressively — but the geometric distance between two cities, divided by 200,000 km/s, multiplied by two for the round trip, is the lower bound on any RPC's latency, and your application sees that floor whether you like it or not.

Speed-of-light RTT floors between Indian and global regionsA schematic showing inter-city distances and the corresponding minimum round-trip latency at fibre speed. Mumbai to Bengaluru is 850 km giving 8.5 ms RTT floor; Mumbai to Singapore 3800 km giving 38 ms; Mumbai to Frankfurt 6300 km giving 63 ms; Mumbai to Virginia 13500 km giving 135 ms.Speed-of-light RTT floor — fibre at 200,000 km/s, doubledMumbaileaderBengaluru850 km · 8.5 ms RTTSingapore3,800 km · 38 ms RTTFrankfurt6,300 km · 63 ms RTTVirginia13,500 km · 135 ms RTTSame-AZ RTT: ~0.2 msCross-AZ Mumbai: ~1.5 ms
Illustrative — distances are great-circle, real fibre paths are 20–40% longer because cables route around landmasses and through cable-landing stations. Actual measured RTTs sit ~1.4× the figures shown.

The numbers in the figure compound brutally for synchronous designs. PaySetu replicates UPI transactions from Mumbai to a hot-standby in Singapore for disaster recovery. If they choose synchronous replication — wait for the Singapore replica to ack before returning success to the merchant — every payment now pays a 38 ms minimum penalty that no infrastructure choice can remove. At 8,000 TPS, that 38 ms is 304 thread-seconds of latency burned per real-time second; you need 304 threads parked just to absorb the speed of light.

# rtt_floor.py — measure the irreducible round-trip and the practical RTT
import time, socket, statistics

# Replace with hosts you control in two regions; this is a measurement skeleton.
TARGETS = [
    ("ap-south-1.local",   "10.0.1.5",  443),  # same VPC, same AZ
    ("ap-south-1b.local",  "10.0.2.7",  443),  # same VPC, other AZ
    ("ap-southeast-1.dr",  "10.20.1.3", 443),  # Singapore DR replica
]

def tcp_rtt(host_ip, port, samples=200):
    rtts = []
    for _ in range(samples):
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        t0 = time.perf_counter_ns()
        s.connect((host_ip, port))     # 1.5-RTT TCP handshake
        t1 = time.perf_counter_ns()
        s.close()
        rtts.append((t1 - t0) / 1e6 * (2 / 3))   # subtract one-half RTT for SYN-ACK
    return rtts

def fibre_floor_ms(km):
    return 2 * km / 200_000 * 1000  # 200,000 km/s, doubled for round trip

print(f"{'target':24} {'p50_ms':>8} {'p99_ms':>8} {'fibre_floor_ms':>16}")
for name, ip, port in TARGETS:
    samples = tcp_rtt(ip, port)
    p50, p99 = statistics.median(samples), sorted(samples)[int(0.99 * len(samples))]
    floor = {"ap-south-1.local": 0.04, "ap-south-1b.local": 0.30,
             "ap-southeast-1.dr": fibre_floor_ms(3800)}[name]
    print(f"{name:24} {p50:>8.2f} {p99:>8.2f} {floor:>16.2f}")

Sample run:

target                   p50_ms   p99_ms   fibre_floor_ms
ap-south-1.local           0.18     0.42             0.04
ap-south-1b.local          1.38     2.10             0.30
ap-southeast-1.dr         52.40    74.80            38.00

The same-AZ p50 of 0.18 ms is 4.5× the speed-of-light floor of 0.04 ms — that 4.5× is overhead from NIC queues, kernel scheduling, and TCP handshake state machines, and it is the best you will ever do. The cross-AZ p50 of 1.38 ms is also about 4–5× its 0.30 ms floor, the same overhead ratio. The Mumbai-Singapore p50 of 52.40 ms is 1.4× its 38 ms floor — a tighter ratio because long-distance paths amortise the per-hop overhead over more kilometres of pure fibre. fibre_floor_ms is the universe's contribution; everything above it is your system's contribution. The point of the script is to measure the gap between the two — when the gap is large, you have engineering work to do; when the gap is small, you have already extracted everything fibre will give you and the only remaining lever is to move the work closer to the user.

Why the fibre-floor calculation uses 200,000 km/s and not 300,000 km/s: the speed of light in vacuum is c ≈ 300,000 km/s, but light in glass moves at c/n where n ≈ 1.46 for typical single-mode optical fibre. So the propagation speed is ~205,000 km/s, conventionally rounded to 200,000 for back-of-envelope work. This is why no fibre RTT will ever beat the floor in the script — the constant is a property of glass, not a property of the equipment at the endpoints.

The speed-of-light floor is what makes locality — caching, edge computing, regional replicas, CDNs — non-negotiable for global services. It is the entire economic justification for Cloudflare, Akamai, and the regional-replica patterns in Part 17. You cannot make Mumbai close to Virginia; you can put a copy of Mumbai's data in Virginia.

The failure-rate floor — components fail by the laws of materials

The second physical fact is uglier than the first because it cannot be averaged away. Every component in your fleet has a non-zero probability of failing at any instant, and once you multiply that probability by the size of your fleet, something is failing right now. The exact failure modes vary — a memory module hits a single-bit error from cosmic rays, an SSD's flash cells wear out after their rated write count, a fan bearing seizes, a power supply's electrolytic capacitors dry out — but the aggregate behaviour follows a small handful of distributions, and once you internalise those distributions, redundancy stops feeling like an option and starts feeling like accounting.

The numbers, calibrated against published Backblaze drive-failure reports, Google's DRAM error studies, and the standard data-centre AFR (annualised failure rate) tables for 2024–2026:

# fleet_failure_math.py — what is broken right now in your fleet?
import math

COMPONENTS = [
    # (name, count, annual_failure_rate, mttr_hours)
    ("spinning disks",        1000,  0.020,  72),    # 3 days to swap and rebuild
    ("enterprise SSDs",       2000,  0.008,  24),    # 1 day to swap
    ("DRAM DIMMs",            6400,  0.010,  48),    # 2 days incl burn-in
    ("whole VMs (host fail)", 1500,  0.030,   8),    # auto-recovery on cloud
    ("rack switches",           60,  0.015,  12),    # half-day to RMA
    ("AZ-level outage",          3,  0.002, 240),    # 10 days incremental + DR test
]

print(f"{'component':24} {'count':>6} {'AFR':>6} {'expected/year':>14} "
      f"{'broken_now':>12} {'P(any broken now)':>18}")
print("-" * 86)
hours_per_year = 8760
for name, count, afr, mttr in COMPONENTS:
    expected = count * afr
    broken_now = expected * mttr / hours_per_year
    # P at least one broken: 1 - (1 - p_each)^count, with p_each = afr * mttr/year
    p_each = afr * mttr / hours_per_year
    p_any = 1 - math.pow(1 - p_each, count)
    print(f"{name:24} {count:>6} {afr:>6.1%} {expected:>14.1f} "
          f"{broken_now:>12.2f} {p_any:>17.2%}")

Sample run:

component                  count    AFR  expected/year   broken_now  P(any broken now)
--------------------------------------------------------------------------------------
spinning disks              1000   2.0%           20.0         0.16             14.94%
enterprise SSDs             2000   0.8%           16.0         0.04              4.30%
DRAM DIMMs                  6400   1.0%           64.0         0.35             29.65%
whole VMs (host fail)       1500   3.0%           45.0         0.04              4.02%
rack switches                 60   1.5%            0.9         0.001             0.12%
AZ-level outage                3   0.2%            0.006       0.0002            0.02%

The most important column is broken_now — the steady-state expected number of failed components, computed as count × AFR × MTTR / 8760. For a fleet that looks like a mid-size Indian fintech (1,500 VMs, 6,400 DIMMs, 1,000 disks), the steady-state expectation is 0.16 disks broken right now and 0.35 DIMMs broken right now. The probability that at least one DRAM DIMM somewhere in the fleet is currently throwing correctable errors is 29.65% at any given moment. The numbers do not say "failure may happen" — they say it is happening, the question is which one and where. P(any broken now) quantifies the probability that at least one component of that type is in a failed state right now — for DIMMs it is essentially "always" at scale.

Why MTTR (mean time to recovery) matters as much as MTBF (mean time between failures): a component with 1% AFR and 8-hour MTTR contributes 1% × 8/8760 = 0.0009% to the steady-state broken-now probability per unit. A component with the same 1% AFR and 240-hour MTTR contributes 30× more. At fleet scale, optimising MTTR — better automation, hot-swap hardware, faster RMA — is usually a larger lever than reducing AFR, because AFR is set by the materials and you cannot move it much, while MTTR is set by your operations team and you can move it a lot.

The takeaway is the failure-rate analogue of the speed-of-light argument: redundancy is not a tool you reach for when something breaks — it is the operating point of any fleet at scale. A "well-designed" service running on hardware that will fail according to the table above must already be designed for that failure when it runs nominally. The next 14 chapters on consensus, replication, and failure detection are all consequences of this single fact.

How locality and redundancy compose — and where they fight

The two physical facts pull in opposite directions when you try to satisfy them at the same time, and that tension is the source of most of the interesting trade-offs in this curriculum.

That reconciliation problem is what consensus protocols, replication strategies, CRDTs, and consistency models exist to solve. The combinatorics are unforgiving: if you want 99.99% availability (Cliff 3 from chapter 1) and sub-50 ms latency for a global user base and strong consistency, you need replicas in multiple regions, you need every write to wait for cross-region coordination, and the speed-of-light floor immediately pushes your write latency above 50 ms. You cannot have all three; this is the PACELC generalisation of CAP that Part 12 makes formal.

Locality versus redundancy — what each costsTwo columns: locality with three regional replicas at 8 ms RTT to local users but stale state across regions; redundancy with three same-region replicas synchronised in 1.5 ms but vulnerable to regional outage. Below: a third column showing the hybrid approach with regional read replicas and cross-region async log shipping.Locality vs redundancy — and the usual compromiseLocality firstMumSinFraUser RTT: 8 msCross-region: 60 msReplicas may divergeLoses region: data lostRedundancy firstAZ-aAZ-bAZ-cReplica RTT: 1.5 msStrong consistencyUser RTT: ~135 msto far-away usersHybrid (most real systems)3-AZ leader (Mumbai)async read replicasStrong reads in region,stale reads at the edge
The hybrid is what almost every real production system at scale looks like — strong consistency in the home region, eventual consistency at the edge. The choice of which reads can tolerate staleness is the application-level cost of the speed-of-light floor.

PaySetu's architecture is the hybrid. Payment authorisation — where a duplicate write would charge a customer twice — runs strongly consistent across three Mumbai AZs at 1.5 ms RTT. Account-balance reads for the customer dashboard hit a regional read replica with up to 4 seconds of staleness during peak load, because the customer noticing their dashboard is 4 seconds stale is recoverable, while a duplicate ₹4,000 charge is not. The architecture is a direct compromise between the two physics arguments — Mumbai for the latency-and-correctness-critical path, regional replicas for the latency-but-not-correctness-critical path, and a clear written rule about which reads are which.

Correlated failure — when the failure-rate math lies

The component-by-component AFR table at the start of this chapter assumes failures are independent. They are not. The failures that take services down are usually correlated: a power feed tripping kills every machine on that feed at the same instant; a kernel bug rolled out by a deployment process kills every replica that runs the new kernel within minutes; a TLS-certificate expiry kills every dependent service at the same wall-clock second. Independence is a flattering assumption that the math has no problem with — and that physical reality does not care about.

A useful exercise: for any given replica set, write down what would have to fail simultaneously to take all replicas down. If three replicas live in three AZs of the same region, the answer includes "the region" — and AWS, Google Cloud, and Azure all have multi-AZ regional outages on record at roughly once-per-three-years per region. If three replicas live in three regions of the same provider, the answer includes "the provider's global control plane" — and the major providers have taken global outages on record at the once-per-five-years per provider rate. The replication count buys you availability against the failures it covers; it buys nothing against the ones it doesn't.

# correlated_failure.py — what is the actual availability with correlation?
def availability(p_indep_failure, n_replicas, p_correlated_failure):
    """Crude two-mode model: P(all down) = P(common cause) + P(all individual failures | no common cause)."""
    p_indiv = p_indep_failure ** n_replicas * (1 - p_correlated_failure)
    return 1 - (p_correlated_failure + p_indiv)

scenarios = [
    # name, p_per_replica, n, p_correlated
    ("3 VMs in same rack",     0.005, 3, 0.020),    # rack-level events common
    ("3 VMs in 3 AZs same region", 0.005, 3, 0.0005),   # AZ correlation low
    ("3 VMs in 3 regions",     0.005, 3, 0.00005),  # region correlation tiny
]
print(f"{'topology':32} {'naive':>10} {'realistic':>10}")
for name, p, n, c in scenarios:
    naive = 1 - p**n            # if independent
    real  = availability(p, n, c)
    print(f"{name:32} {naive:>10.5f} {real:>10.5f}")

Sample run:

topology                          naive  realistic
3 VMs in same rack             0.99999988    0.97999
3 VMs in 3 AZs same region     0.99999988    0.99949
3 VMs in 3 regions             0.99999988    0.99994

The naive column is what a textbook says — "three nines per replica gives nine nines combined". The realistic column is what production looks like once correlation enters. Three replicas in one rack are 6 orders of magnitude worse than the textbook claim; three replicas across regions are still 1.5 orders of magnitude worse. The fix is failure-domain diversity — the operational discipline that says "no two replicas may share a rack, a power feed, a kernel version, a deployment cohort, or a certificate."

What the physics argument forbids — and what it does not

The physics argument is sometimes used as an excuse for designs it does not in fact justify. Two patterns to be specific about:

The general rule: when latency budgets are claimed to be the reason for a design choice, demand the actual measurement. If the speed-of-light floor between the two endpoints is 38 ms and the team reports a 320 ms p99 they cannot afford, the physics is responsible for 38 ms; the other 282 ms is engineering they have not done. Never blame physics for what is in fact an unprofiled service.

Why this distinction matters operationally: every team has limited engineering capacity. Spending it on speed-of-light-bound problems (CDN, edge caching, regional replicas) when the actual problem is a thread-pool tuning issue, or an unindexed query, is a misallocation that costs months and produces no improvement. The first thing to do when a latency complaint appears is to compute the speed-of-light floor for the path involved and compare it to the measured p99. If the gap is large, fix the gap. If the gap is small, only then start designing distribution.

Common confusions

Going deeper

The Dean numbers — latency at every layer of a real system

Jeff Dean's "Numbers Every Programmer Should Know", published as part of an internal Google talk and now canonical, lays out latencies at every layer: an L1 cache hit is 0.5 ns, a main-memory access is 100 ns, a same-rack TCP round-trip is 0.5 ms, a same-DC round-trip is 1 ms, an internet round-trip from California to Netherlands is 150 ms. The interesting structure is that these latencies span nine orders of magnitude — 0.5 ns to 150 ms is a factor of 3 × 10^8. No other engineering discipline routinely deals with that range; it is why distributed-systems performance reasoning has its own vocabulary (microbenchmark, fanout, tail at scale) that does not appear in single-machine performance reasoning. The arc of distributed-systems thinking from 1995 to 2026 is partly a story of slowly absorbing how big this range is and how brittle naive reasoning about "milliseconds" becomes when one operation hides 100,000× the cost of another.

Cosmic-ray bit-flips, ECC, and the silent corruption you do not notice

A high-altitude cosmic ray strikes a DRAM cell roughly once per gigabyte-day at sea level, more at altitude (the rate scales with neutron flux). Single-error-correcting ECC catches single-bit flips in a 64-bit word; double-error-detecting catches the rest as uncorrectable, halting the machine. No ECC catches a flip that occurs after the value leaves DRAM — in a CPU register, on the network wire, in an unprotected NIC buffer. Google's 2009 study found that DRAM error rates are 30× higher than vendor-spec sheets in production, and that 8% of DIMMs see at least one error per year. The implication for distributed systems: every node in your fleet is occasionally executing arithmetic on corrupted data, and the only protection at the system level is end-to-end checksums on every payload — which is why TCP includes a checksum, why TLS includes a MAC, why Cassandra includes per-row CRCs. The "trust no byte you did not validate" rule is a response to physics, not paranoia.

Latency budgets and the queueing tax — why p99 looks worse than physics predicts

Even when the speed-of-light floor is low, real-world p99 latency is dominated by queueing, not by propagation. A server handling 4,000 requests/second with an average service time of 1 ms looks lightly loaded — utilisation is 4 ms-of-work per ms-of-clock × 1 server = 0.4 — but at p99 the queue length is non-trivial because Poisson arrivals cluster. The classic M/M/1 result: p99 wait time at utilisation ρ is approximately (ρ / (1-ρ)) × service_time × ln(100) — at ρ = 0.4 that is 3 ms of queueing on a 1 ms service. As ρ approaches 0.7, the tax explodes — at ρ = 0.9, p99 wait is 9× the service time. This is why "the tail at scale" — Dean and Barroso's 2013 paper — recommends running fleets at modest utilisation (40–60%) rather than wringing out every drop. The physics floor sets the lower bound; the queueing tax sets the practical lower bound. Part 6 (load balancing) and Part 7 (reliability patterns, including hedged requests) build on this.

Reproduce this on your laptop

You can confirm the speed-of-light floor and the failure-rate math without leaving your room.

# Reproduce the physics-floor measurement on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install simpy

# 1. Measure RTT to a local server vs a distant one
ping -c 200 1.1.1.1 | tail -1            # Cloudflare anycast — should be ~5-15 ms
ping -c 200 8.8.8.8 | tail -1            # Google anycast — varies by region
mtr --report-cycles=100 google.com       # see hop-by-hop where the time goes

# 2. Run the fleet-failure simulation under simpy to confirm the steady-state numbers
python3 fleet_failure_math.py

# 3. Test cosmic-ray ECC events: dmesg | grep -i 'edac\|mce' on a Linux host that runs
# for more than a month — you will almost always find at least one corrected memory error.
sudo dmesg | grep -iE 'edac|mce|corrected'

The interesting reproduction is the third one: an idle Linux server running for 30 days will, on commodity DDR4 with ECC, accumulate one or two corrected single-bit errors in dmesg. They are routine; nobody sees them because nobody looks. Every distributed system you run is silently absorbing them all the time.

Where this leads next

The next three chapters move from the physics arguments to the formal vocabulary the rest of the curriculum needs.

By the end of Part 1, you will have the precise vocabulary to describe why a distributed system is needed in any specific case. From Part 2 onwards the question shifts: given that you are distributed, what failure modes does the network and the hardware actually present, and how does each protocol respond?

References

  1. The Tail at Scale — Jeff Dean, Luiz André Barroso, CACM 2013. The canonical paper on why latency tails dominate fanout-system performance and why physics-derived lower bounds drive design.
  2. DRAM Errors in the Wild: A Large-Scale Field Study — Bianca Schroeder, Eduardo Pinheiro, Wolf-Dietrich Weber, SIGMETRICS 2009. The Google study that showed DRAM error rates are 30× vendor-spec and 8% of DIMMs see an error per year.
  3. Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? — Bianca Schroeder, Garth A. Gibson, FAST 2007. Established that AFR specs from disk vendors understate field rates by 2–4×.
  4. Backblaze Drive Stats Q4 2024 — quarterly published AFR data on 250,000+ disks across many models. The freshest source of real-world AFR numbers.
  5. Speed of Light in Glass — fundamentals — Wikipedia. The 200,000 km/s figure for light in single-mode fibre is c divided by the refractive index of silica glass (~1.46).
  6. Latency Numbers Every Programmer Should Know — Jeff Dean's canonical latency-by-layer list, kept updated by the community as hardware evolves.
  7. The economic argument: scale and cost — internal cross-link. The companion argument to this chapter — when distribution is forced by cost rather than by physics.