The fallacies of distributed computing (revisited)

It is 21:14 IST on a Tuesday and Karan, on-call for PaySetu's payments-router, is staring at a Grafana panel that says the merchant-balance service is "healthy" while a separate panel says 18% of payouts are timing out at 30 s. The router calls the balance service over gRPC; the balance service responds in 4 ms p99 according to its own histograms. The two services live in the same Mumbai region, in the same Kubernetes cluster, behind the same envoy mesh. Karan has spent the last forty minutes assuming the balance service is broken — it is not. The network between them is dropping 0.4% of TCP segments because a noisy-neighbour pod on a shared t3.large is exhausting the host's conntrack table. The router's gRPC channel is silently re-establishing every 90 s. The balance service is healthy in isolation; the call is failing. Karan is debugging the wrong layer because the codepath looks like a function call, and function calls don't lose 0.4% of segments.

Peter Deutsch and James Gosling at Sun Microsystems wrote down eight assumptions in 1994–1997 that programmers make about the network without realising — that it is reliable, latency is zero, bandwidth is infinite, the network is secure, the topology never changes, there is one administrator, transport cost is zero, and the network is homogeneous. Each one is wrong, and every wrong one is the load-bearing assumption behind a class of production outages. Part 4 of this curriculum exists because the fallacies are not historical: every chapter that follows — RPC semantics, idempotency keys, message ordering, gRPC internals, wire protocols — is an explicit, deliberate response to one or more of them.

What the eight fallacies actually claim

The eight fallacies are not a manifesto; they are an accusation. Deutsch's framing is that programmers, when they write rpc.call(...), behave as if the network were reliable, instantaneous, infinite, secure, fixed, single-administered, free, and uniform. They do not necessarily believe this when asked directly. They write code that would only be correct if it were true. The bug is not the belief; the bug is the code.

The eight fallacies as a 4-by-2 grid with each fallacy's load-bearing failure modeEight cards arranged in two rows of four. Each card names one fallacy and the production failure mode it produces when assumed true. Cards are tinted by severity: hard-fail in accent colour, soft-degrade in muted ink.Eight fallacies — and the failure mode each one hides 1. The network is reliable → silent retries → duplicate writes → split-brain Karan's gRPC outage 2. Latency is zero → N+1 fan-out → chatty interfaces → deadline starvation CricStream timeline 3. Bandwidth is infinite → wide rows over wire → video CDN hot-spot → payload bloat 4. Network is secure → trust your own VPC → unauthenticated RPCs → lateral movement 5. Topology never changes → pinned IPs in config → stale DNS cache → broken on rollout 6. One administrator → cross-team incidents → silent policy changes → DNS / firewall drift 7. Transport cost is zero → ignored egress bills → cross-AZ chatter → serialisation CPU 8. Network is homogeneous → MTU mismatch → TLS cipher gap → IPv4/IPv6 split Accent-bordered cards are the four fallacies that reliably take production down. The other four degrade quietly.
Illustrative — the eight Sun Microsystems fallacies, each annotated with the failure mode it produces in modern systems. The accent-bordered cards (1, 2, 4, 7) are the four that most often surface as outages on a 2 a.m. pager call.

Why "fallacy" and not "trade-off": a trade-off implies the engineer chose. A fallacy is a defaulted assumption — the engineer never noticed they were assuming. The eight items are deliberately chosen as the assumptions that language-level RPC abstractions (Java RMI in 1994, gRPC stubs in 2026) paper over by making the call site look identical to a local one. The fallacy is in the abstraction, then it propagates into the calling code.

The original list — Deutsch added items 1–7 around 1994 at Sun's Fellow project, James Gosling appended item 8 a few years later — was published as part of an internal Sun memo. By 1997 it had circulated widely enough that Bill Joy referenced it; by the early 2000s it was the standard slide-deck opener for any "introduction to distributed systems" talk. Twenty-nine years on, the fallacies are not historical curiosities. They are the hidden assumptions in every gRPC stub, every Lambda invocation, every cross-region database call you write today. The medium changed (Sun RPC → CORBA → SOAP → REST → gRPC); the assumptions did not.

The four that hurt the most: 1, 2, 4, 7

If you are debugging at 2 a.m., it is almost always one of these four.

Fallacy 1 — "the network is reliable"

This is the fallacy the local-function abstraction lies about most aggressively. A local function call cannot fail half-way; an RPC can succeed at the network layer, fail at the application layer, fail and then succeed on retry, succeed and then have the response dropped, or take 60 seconds and then succeed when nobody is listening any more. The probability of a single TCP connection holding open across a long-running data-centre service is not 1 — it is around 1 - 10⁻⁴ per hour for cleanly-managed infrastructure, much worse for cloud overlay networks. Once you make 10,000 RPCs per second, a 0.01% failure rate is one failure per second, every second, forever.

The implications cascade: retries duplicate, idempotency keys leak across retries, exactly-once is impossible without consensus, leader-election protocols treat slow-but-up nodes as dead, replication lag is a function of TCP retransmission timing. Part 4 — RPC semantics, idempotency keys, message ordering — is the response. Part 7 (reliability patterns) is the next layer: timeouts, retries, circuit breakers, hedging.

Fallacy 2 — "latency is zero"

Latency is not zero. It is not negligible. It is not asymptotically falling. The speed of light through fibre is about 200,000 km/s — Mumbai to Singapore (~3,900 km) is a 19.5 ms one-way floor before any network equipment; the actual measured TCP RTT is 60–80 ms. A function call that fans out to ten back-end services serially over a Mumbai-to-Singapore link cannot complete in under 600 ms even with zero processing time at every hop.

The fallacy's lethal form is N+1 fan-out: a synchronous service that calls another service, which calls another, which calls another, each adding 8 ms p50 / 80 ms p99. By the time the chain is six deep, the leaf p99 is 480 ms and the root p99 is closer to 800 ms because tail latencies compound non-linearly (a result Jeff Dean's The Tail at Scale makes load-bearing). CricStream's live-cricket scoreboard, which fans out to a player-stats service, a match-state service, a commentary service, and an ads service, has to do this with strict deadline propagation or the user sees a blank score for two seconds. Part 7 (timeouts, deadlines, hedged requests) is the response.

Fallacy 4 — "the network is secure"

The local-function abstraction tells you that calling account.balance(user_id) is the same operation whether it crosses 0 nm of memory or 80 km of fibre. It is not. The network call traverses devices owned by people who are not you — switches, load balancers, ISPs, possibly state actors. Anything you do not encrypt is observable, anything you do not authenticate is forgeable. The 2014 Heartbleed bug exposed this at the TLS layer; the 2018 BGP hijack of the national UPI switch's announcement (briefly mis-routed to a small ISP for 23 minutes) exposed it at the routing layer.

In modern infrastructure the response is mTLS-everywhere (every service-to-service call presents a client certificate), zero-trust architecture (no network is "internal"), and fine-grained authorisation at every RPC. Part 4 (gRPC internals, ch.22) covers how channel credentials work; Part 19 (chaos engineering) covers how to test the absence of authentication-bypass paths.

Fallacy 7 — "transport cost is zero"

Cloud egress is the line item your CFO notices. Cross-AZ traffic on AWS is currently 0.01–0.02/GB; cross-region is0.02–0.09/GB depending on direction. A service that ships 1 TB/day cross-region is 20–90/day,7,000–32,000/year. PaySetu's first multi-region rollout had a chatty merchant_lookup cache that fanned out to peer regions on every miss; the egress bill alone added ₹ 18 lakh/month before the architecture review caught it. The fallacy is not that engineers don't know egress costs money — it is that they write code as if a cross-region call costs what a same-host call costs (which is roughly zero, modulo CPU). The codepath looks identical; the bill is not.

The two costs are:

The four that hurt slowly: 3, 5, 6, 8

These don't usually take you down at 2 a.m. They erode capacity, performance, or maintainability over months.

Fallacy 3 — bandwidth is infinite. Modern data-centre links are 25–400 Gbps, which feels infinite per-flow but is shared. CricStream pushes 8 Tbps during a cricket final across its CDN; if 5% of that hits origin (CDN miss rate during a goal-burst), origin sees 400 Gbps which is exactly one network-card. Wide rows on the wire (sending a 200-column row to a service that needs 3 columns) is the SQL-side version. Per-RPC bandwidth budgets are the discipline.

Fallacy 5 — topology never changes. Pods get rescheduled. AZs go down. DNS records flip. A pinned IP in application.yaml is a 4 a.m. pager call waiting for a routine cluster autoscaling event. Service discovery (Part 5) is the response.

Fallacy 6 — there is one administrator. Your service depends on someone else's database, which depends on someone else's network, which depends on a cloud provider whose interpretation of "us-east-1" is different from yours. Cross-team / cross-org coordination is a distributed-systems problem with humans as the nodes; the FLP impossibility result still applies, only slower.

Fallacy 8 — the network is homogeneous. Two regions on the same cloud provider can have different MTUs, different TLS cipher suites available, different IPv4/IPv6 dual-stack behaviour, different conntrack defaults. Most of the time it doesn't matter; one Tuesday afternoon it does, and your 9 KB Protobuf payload silently drops because one path on the spine has MTU 1500 (jumbo frames not enabled on that switch) and the next-hop is configured for MTU 9000.

Why this 4-and-4 split is durable: fallacies 1, 2, 4, 7 are about per-call invariants — every call has a probability of failure, a non-zero latency, a non-zero security risk, a non-zero cost. They scale with your call rate. Fallacies 3, 5, 6, 8 are about environment invariants — they describe what is or isn't true about your infrastructure as a whole, and don't compound per call. The first set you hit early; the second set you hit at scale or during change events.

Quantifying fallacy 1 — a runnable demonstration

The most common reaction to "the network is unreliable" is "sure, but my network is fine." It isn't. The following script demonstrates how a 0.4% per-segment loss rate produces a 12% per-RPC tail-failure rate at modest message sizes — the kind of multiplication that explains why gRPC channels mysteriously close on services that "look healthy".

# fallacy1_packet_loss_compounds.py — tiny model of how per-segment loss compounds
import random
import statistics
from typing import List

def rpc_succeeds(per_segment_loss: float, segments_per_rpc: int) -> bool:
    """An RPC succeeds iff every TCP segment in the request and response delivers.
    TCP retransmits, but each retransmit adds 200ms to the RTT under typical
    retransmit timeouts. We model only the *first-attempt* success here."""
    for _ in range(segments_per_rpc):
        if random.random() < per_segment_loss:
            return False
    return True

def measure(per_segment_loss: float, segments_per_rpc: int, rpcs: int) -> dict:
    successes = sum(rpc_succeeds(per_segment_loss, segments_per_rpc) for _ in range(rpcs))
    return {
        "per_segment_loss": per_segment_loss,
        "segments_per_rpc": segments_per_rpc,
        "rpcs": rpcs,
        "first_attempt_success_rate": successes / rpcs,
        "first_attempt_failure_rate": 1 - successes / rpcs,
    }

# 10 KB request + 10 KB response = ~14 segments at 1500-byte MTU
# Conservative cloud per-segment loss rate: 0.001 (one in a thousand)
# Karan's noisy-neighbour incident: 0.004 (one in 250)
random.seed(42)
for loss in [0.0001, 0.001, 0.004, 0.01]:
    for segments in [1, 14, 50, 200]:
        r = measure(loss, segments, 100_000)
        print(
            f"loss={loss:.4f}  segments={segments:3d}  "
            f"first_attempt_failure={r['first_attempt_failure_rate']:.4%}"
        )
    print("---")

Sample run:

loss=0.0001  segments=  1  first_attempt_failure=0.0090%
loss=0.0001  segments= 14  first_attempt_failure=0.1370%
loss=0.0001  segments= 50  first_attempt_failure=0.4920%
loss=0.0001  segments=200  first_attempt_failure=2.0150%
---
loss=0.0010  segments=  1  first_attempt_failure=0.0980%
loss=0.0010  segments= 14  first_attempt_failure=1.3810%
loss=0.0010  segments= 50  first_attempt_failure=4.8770%
loss=0.0010  segments=200  first_attempt_failure=18.060%
---
loss=0.0040  segments=  1  first_attempt_failure=0.4030%
loss=0.0040  segments= 14  first_attempt_failure=5.4490%
loss=0.0040  segments= 50  first_attempt_failure=18.110%
loss=0.0040  segments=200  first_attempt_failure=55.230%
---
loss=0.0100  segments=  1  first_attempt_failure=0.9970%
loss=0.0100  segments= 14  first_attempt_failure=13.140%
loss=0.0100  segments= 50  first_attempt_failure=39.500%
loss=0.0100  segments=200  first_attempt_failure=63.430%

The load-bearing line is if random.random() < per_segment_loss: return False — every segment is an independent Bernoulli trial, so success probability is (1 - p) ^ N and failure probability is 1 - (1 - p) ^ N. The numbers tell the story Karan was missing: at a per-segment loss of 0.4% (his actual incident rate, measured later via tcpdump on the host) and a 14-segment RPC (typical gRPC request + response with TLS overhead), 5.4% of first-attempt RPCs fail. With TCP retransmit kicking in at 200 ms RTO, the user-visible tail latency p99 jumps from 4 ms to 204 ms — exactly the symptom on the dashboard. The balance service was fine. The call had a 5.4% chance of needing a retransmit, and the gRPC client's 30 s deadline absorbed most of them, but enough fell off the deadline to produce the 18% timeout rate (because some RPCs had multiple lost segments and exhausted multiple retransmit windows).

Why per-segment and not per-RPC loss is the right primitive: TCP loss is observed at the segment level (1500-byte MSS by default), and a single RPC payload spans many segments. A 0.001 per-segment loss rate sounds reassuring; multiplied across 50 segments per RPC it becomes a 4.9% per-RPC first-attempt failure. The kernel's TCP stack hides the retransmits from your application code, so the symptom shows up as latency spikes, not as packet-loss alarms.

A second production tale — CricStream's IPL final, fallacy 2 in concrete form

CricStream's live-cricket player carries video plus a real-time scoreboard widget. The scoreboard renders six elements: current score, last-six balls, batter strike rates, bowler economy, run-rate graph, win-probability percentage. Each element is served by a different microservice — the score-state service, the ball-by-ball service, the player-stats service, the bowling-stats service, the run-rate service, the win-probability service. The original architecture, written in 2021 when CricStream had 4 million peak viewers, called these services sequentially from a Node.js gateway, each call adding 6 ms p50 / 60 ms p99.

For the IPL final on 26 May 2024, peak concurrent viewers hit 27 million. The scoreboard's p99 latency went from 360 ms (acceptable) to 4,200 ms (catastrophic — viewers saw a frozen scoreboard for four seconds after every wicket, exactly when they wanted information). The post-mortem found three things at once. First, fallacy 2: latency is not zero, and serial fan-out compounds tail latencies multiplicatively (the famous "tail at scale" effect — if each leaf has 1% chance of being slow, six in series have a 5.85% chance of some being slow). Second, fallacy 1: the rate of TCP retransmits inside the service mesh went from 0.001 (normal) to 0.012 (under load) because the kernel's conntrack table on each node was 87% full — the network was less reliable under load. Third, fallacy 7: the gateway and the back-ends had been deployed across two AZs because the cost of cross-AZ traffic (₹0.85/GB at the Indian-region rate) had been deemed acceptable in 2021, but the chatty fan-out meant that 60% of every scoreboard render now crossed an AZ boundary, contributing both latency and money.

The fix was to merge the six calls into a single composite-RPC handled by a thin aggregator service co-located with the back-ends, parallelise within that aggregator, and apply a 250 ms hard deadline with default values for any leg that missed it. p99 fell from 4,200 ms to 380 ms — still longer than a single back-end call, but bounded. The architecture now treats fallacies 1, 2, and 7 as first-class design constraints, not afterthoughts. The fix is not interesting; the fact that the system was designed without those constraints in 2021 — by smart engineers who knew the fallacies in the abstract but defaulted to the local-function model — is the durable lesson.

CricStream scoreboard timeline before and after fixTwo timelines stacked. Top shows six sequential calls on the gateway, each 60ms p99, totalling 360ms minimum, with retransmits adding to 4,200ms p99 under load. Bottom shows one composite call to an aggregator that fans out in parallel, capped at 250ms.CricStream — serial fan-out vs parallel aggregator Before — gateway calls 6 services sequentially score balls bat-stats bowl-stats run-rate win-prob → 4,200 ms p99 tail latencies compound multiplicatively, retransmits magnify After — aggregator fans out in parallel, deadline 250 ms aggregator score balls bat bowl rate prob → 380 ms p99 parallelism caps at the slowest leg, deadline trims the long tail with defaults Treating fallacy 2 (latency is zero) as a design constraint, not an afterthought, drops p99 by 11×
Illustrative — CricStream's scoreboard before and after the IPL-final retrofit. The serial chain is the natural shape if you write the gateway as if calling a local function six times; the parallel aggregator is the shape that respects fallacy 2.

Common confusions

Going deeper

Where the eight came from — Sun, 1994

Peter Deutsch was at Sun Microsystems' Fellow programme in 1991–1996 working on Self and Smalltalk; the fallacies started as a list of seven that he circulated internally as part of pushback against Sun's NFS / RPC architecture. The argument was that NFS pretended remote files were local and every layer of NFS — its caching, its idempotency model, its locking — was an attempt to paper over a fallacy. James Gosling, working on Java RMI, added fallacy 8 ("the network is homogeneous") around 1996 as Java was being deployed in environments with mixed JVM versions, mixed network stacks, and mixed security policies. The list was first published externally in Bill Joy's 1997 talk and gained wide circulation through Arnon Rotem-Gal-Oz's 2006 paper "Fallacies of Distributed Computing Explained", which is still the canonical extended treatment. The original is a Sun memo; the durable form is Rotem-Gal-Oz's elaboration.

What the fallacies miss in 2026 — Sundberg's ninth

Subsequent writers have proposed additional fallacies. The most useful is what Niclas Sundberg called fallacy 9: the system is observable. In 1994, the assumption that you could just "log in to the box and tail the log" was largely true. In 2026, with 1,200 pods across three regions behind an envoy mesh routing through a service-discovery layer that updates every 5 seconds, the assumption that you can observe what your distributed system is doing is itself a fallacy. The observability fallacy is the load-bearing assumption behind every "we'll add a metric for that later" decision; OpenTelemetry, distributed tracing, and structured logging (Part 18 of this curriculum) are the response. The reason fallacy 9 deserves a place: in production debugging, the gap between what happened and what your tooling can show you is often larger than the gap between any other pair of fallacies.

The fallacies as a debugging checklist

When a production incident lands and the symptom is "service A is slow / failing / inconsistent", the fallacies are a structured first-pass checklist. Walk down the eight: is the network reliable on this path right now (run ping -c 100, look at packet-loss-percent)? Is latency in line with the topology (run tracepath, check RTT against great-circle distance × 5 µs/km)? Is bandwidth saturated (check NIC drop counters, ethtool stats)? Is the network secure on this path (any cert errors in logs)? Has topology changed (recent deploys, autoscaling events, AZ failures)? Has any administrator made a change (recent firewall, route-table, DNS)? Are you crossing an AZ / region boundary (egress logs)? Is the path homogeneous (MTU mismatches, TLS cipher mismatches)? Going through this list adds 8 minutes to the debugging session and saves 80% of the time in cases where the symptom is mis-attributed to the application.

Reproduce this on your laptop

# Reproduce the fallacy-1 simulation
python3 -m venv .venv && source .venv/bin/activate
python3 fallacy1_packet_loss_compounds.py

# Measure your own path's per-segment loss
sudo tcpdump -i any -nn 'tcp and host <peer-ip>' -c 10000 -w /tmp/cap.pcap
# Then run the kernel-side counters:
ss -tin | grep -E 'retrans|lost'  # retransmit and lost-segment counts per socket

# Demonstrate fallacy 2 with a synthetic fan-out
pip install httpx asyncio
python3 -c "
import asyncio, httpx, time
async def call(c, i): r = await c.get(f'http://httpbin.org/delay/0.06'); return r.status_code
async def serial():
    async with httpx.AsyncClient() as c:
        t0 = time.time()
        for i in range(6): await call(c, i)
        return time.time() - t0
async def parallel():
    async with httpx.AsyncClient() as c:
        t0 = time.time()
        await asyncio.gather(*[call(c, i) for i in range(6)])
        return time.time() - t0
print('serial:  ', asyncio.run(serial()))
print('parallel:', asyncio.run(parallel()))
"

Where this leads next

Part 4 is the curriculum's response to the fallacies. Each chapter that follows is shaped by one or more of them.

Beyond Part 4, the fallacies recur. Part 7's reliability patterns (retries, circuit breakers, hedging) are the systematic response to fallacy 1. Part 10's failure detection is fallacy 1 quantified — phi-accrual converts "is the network down?" from a yes/no into a continuous suspicion score. Part 17's geo-distribution chapters are fallacy 2 + fallacy 7 at planet-scale.

The shorter version: every distributed-systems primitive in this curriculum exists because at least one of the eight fallacies bit somebody, hard, in production. The fallacies are the theory; everything that follows is the engineering response.

References

  1. Fallacies of Distributed Computing Explained — Arnon Rotem-Gal-Oz, 2006. The canonical extended treatment of all eight fallacies; the elaboration that turned the Sun memo into curriculum material.
  2. The Tail at Scale — Dean & Barroso, CACM 2013. The load-bearing argument for fallacy 2 — why latency tails compound non-linearly under fan-out.
  3. A Note on Distributed Computing — Waldo, Wyant, Wollrath, Kendall, Sun Microsystems Laboratories TR-94-29, 1994. The internal Sun paper from the same period that argues — independently of Deutsch's list — that local and remote calls are not the same.
  4. Latency Numbers Every Programmer Should Know — Jeff Dean's commonly-cited table; quantifies fallacy 2 across the modern memory and network hierarchy.
  5. The Network is Reliable — Kyle Kingsbury (Aphyr), 2014. A catalogue of real-world network-partition incidents from the major cloud providers; the empirical case for fallacy 1.
  6. BGP Hijacks: A Brief Guide — Cloudflare, 2018. Concrete production case for fallacy 4: the network you do not control still routes your packets.
  7. Wall: without time you still need order — internal cross-link to Part 3's closing chapter, which sets up the agreement problems that Part 4 begins to answer.
  8. Designing Data-Intensive Applications — Kleppmann, O'Reilly 2017. Chapter 8 ("The Trouble with Distributed Systems") is the modern textbook treatment that revisits the fallacies through 2010s-era systems.