Wall: distributed is a failure-first design

Aditi joined PaySetu six months ago from a desktop-software shop where her last big bug was a memory leak that grew over 40 hours. On her first on-call rotation she gets paged at 23:47: the merchant settlement job has been running for four hours and 18% of the rows are stuck in state=PROCESSING with no terminal status. She SSHes into the worker, finds the process alive, finds the database row for one stuck merchant, and finds that the gateway service has logged a 200 OK for the disbursement at 19:51 while the worker has logged a requests.exceptions.ReadTimeout at 19:52. Both are correct. The disbursement happened; the worker does not know. There is no exception to catch, no traceback to read, no test that would have failed in CI. The bug is not in any line of code anyone has written — the bug is the gap between two lines of code on two different machines, and the gap is where the network lived for one second too long. Aditi is, at 23:47, hitting the wall this chapter is named after.

Distributed code is not single-machine code with retries bolted on. It is code where every operation has a third outcome — neither success nor failure but uncertainty — and where failure is a steady-state probability rather than an exception. Designing failure-first means assuming each call will partially complete, designing the receiver to be idempotent, and writing the recovery path before the happy path.

The third outcome — every operation can return UNCERTAIN

A function call on a single machine has two outcomes: it returns a value or it raises an exception. Both are observable, both are local, and both are honest — if you got a value, the function ran; if you got an exception, it didn't, and the stack trace tells you why.

A network call has three outcomes. The first two look familiar: a response (the operation completed and you have its result) or a definite failure (the connection refused, DNS resolution failed, the kernel returned ECONNREFUSED before any byte left the box). The third is the one that breaks every assumption single-machine code is built on: a timeout, retry-able-error, or socket-reset after the request was sent. The receiver may have processed it, may not have; the sender does not know, will not know without an out-of-band probe, and the system has to make a decision anyway.

This is not a rare edge case. On a healthy LAN with 0.01% TCP retransmit rate, an RPC service handling 10,000 calls per second sees ~1 timeout per second across the fleet — 86,400 per day. Most are resolved by retry-with-idempotency, but every one of them passes through the UNCERTAIN state for some bounded interval, and every one of them must be designed for. The most common failure mode of mid-career engineers moving from monoliths to distributed systems is forgetting that the third outcome exists, writing code that assumes "no exception means success", and discovering at 02:30 that they have shipped a service which, on timeout, retries a non-idempotent payment debit.

The frequency rises sharply during incidents. During a partial AZ failure, a slow GC pause, or a noisy-neighbour CPU spike, the timeout rate on a service can move from 0.01% to 5%+ within seconds — meaning a service that handled 1 timeout-per-second in steady state now handles 500. Every one of those is an UNCERTAIN outcome that the system must handle correctly while the underlying cause is still fluctuating, while the on-call engineer is still reading the dashboard, while upstream callers are also retrying their portion of the work. Failure-first design is the assumption that the system will spend non-trivial time in this regime — not occasionally, but as a regular part of operations, especially during deploys, scaling events, and dependency upgrades. The design rule that follows: every code path that runs at 0.01% in normal traffic must have been mentally executed at 5% in your head, because that is the rate it will run at during your worst-case Diwali-evening incident.

Illustrative — the third outcome is what makes distributed code different. Designing the system to make `UNCERTAIN` safe to retry is the entire game of failure-first design.

Why the sender cannot distinguish "lost request" from "lost reply": both look identical from the sender's vantage point — the socket sat there, no bytes came back, the deadline expired. Only the receiver knows whether it processed the request, and asking the receiver requires another network call which is itself subject to the same three-outcome rule. There is no fixed point that resolves this; the only resolution is to make the operation idempotent and retry, or to accept the bounded uncertainty and reconcile later.

What "failure-first" actually means in code

Failure-first does not mean writing more try / except blocks. It means inverting the control structure of the function: the failure path is the trunk of the tree, the success path is a branch. The single-machine version of withdraw(account, amount) is roughly:

1. read balance
2. check sufficient
3. write new balance
4. return success

The failure-first version is:

1. generate idempotency_key (or accept one from caller)
2. check whether this idempotency_key has been seen — if yes, return prior result
3. read balance with explicit consistency level (leader / quorum / follower)
4. attempt write with explicit acknowledgement requirement (W=quorum)
5. on partial-success, decide: block the client until reconciled, or return UNCERTAIN
6. record the (idempotency_key, result, timestamp) so step 2 works on retry
7. on UNCERTAIN, schedule a background reconciliation job that will eventually resolve
   the truth — by reading the receiver, by checking the WAL, by waiting for replication
8. emit a structured log event with trace_id so the auditor can reconstruct what happened

Steps 1, 2, 5, 6, 7, 8 do not exist in the single-machine version because they have no analogue — there is no idempotency-key concept when the function only runs once, no partial-success when there is one place state lives, no background reconciliation when there is no other replica that might hold the truth. The body of a distributed function is mostly the failure-handling and the bookkeeping; the actual business logic is a small fraction of the line count. Engineers who measure productivity by lines-of-business-logic-per-day get whiplash when they cross into distributed work and discover their throughput dropped by 4×. The lines did not disappear; they migrated into the receiver, the message broker, the deduplication store, the reconciliation job, and the trace-id propagation library.

The phrase "write the recovery path before the happy path" is more than a slogan. It means: when you design a new distributed operation, the first artefact you produce is not the API contract or the database schema — it is the answer to the question "if this operation fails halfway through, what does the system look like, and what brings it back to a clean state?" If the answer is "I'll add a reconciliation script later", you are deferring the design problem to your future on-call self at 02:30. If the answer is "I'll add a manual-fix runbook", you are coupling system correctness to human availability — which is acceptable for low-volume operations and disastrous for high-volume ones. The honest answer is one of: (a) the operation is idempotent and the client retries until it succeeds, (b) the operation is part of a saga with explicit compensating actions if any step fails, (c) the operation has a bounded UNCERTAIN window during which the client can poll for resolution, (d) the operation is rejected entirely until preconditions are guaranteed (e.g. the system holds a fencing-token lease). Anything else — "we'll figure it out", "it's so unlikely we don't need to handle it", "it's the database's problem" — is the residue of single-machine thinking, and the operations team will pay for it.

The runnable artefact below shows the smallest non-trivial failure-first call I can fit into one Python file. It uses asyncio to model concurrent client requests, simulates a flaky receiver, and demonstrates how at-least-once delivery plus an idempotency cache becomes effectively-once at the application layer. Critically, it shows the moment-to-moment state where the same idempotency-key is being processed by a retry while the original request is still pending — the kind of race nobody draws on a whiteboard but production hits constantly.

# failure_first_withdraw.py — at-least-once + idempotency = "effectively-once"
import asyncio, random, time, hashlib

class Server:
    def __init__(self, fail_rate=0.30):
        self.balance = 10_000
        self.idempotency_cache = {}        # key -> (result, balance_after, ts)
        self.in_flight = {}                # key -> asyncio.Event for de-duping in-flight retries
        self.fail_rate = fail_rate
        self.processed = 0

    async def withdraw(self, key: str, amount: int):
        # Step 1: dedupe by idempotency key — instant return on retry of completed work
        if key in self.idempotency_cache:
            res, bal, _ = self.idempotency_cache[key]
            return ("REPLAY", res, bal)
        # Step 2: dedupe of in-flight retries (same key, original still running)
        if key in self.in_flight:
            await self.in_flight[key].wait()
            res, bal, _ = self.idempotency_cache[key]
            return ("JOINED-INFLIGHT", res, bal)
        evt = asyncio.Event(); self.in_flight[key] = evt
        try:
            await asyncio.sleep(random.uniform(0.02, 0.08))    # processing time
            if random.random() < self.fail_rate:               # simulate post-commit reply drop
                if self.balance >= amount:
                    self.balance -= amount                     # we DID commit, but reply will be lost
                    self.idempotency_cache[key] = ("OK", self.balance, time.time())
                raise asyncio.TimeoutError("network ate the reply")
            if self.balance < amount:
                self.idempotency_cache[key] = ("REJECT", self.balance, time.time())
                return ("OK", "REJECT", self.balance)
            self.balance -= amount
            self.processed += 1
            self.idempotency_cache[key] = ("OK", self.balance, time.time())
            return ("OK", "OK", self.balance)
        finally:
            evt.set(); del self.in_flight[key]

async def client(server, name, amount, attempt=0, max_retries=4):
    key = hashlib.sha1(f"{name}:{amount}".encode()).hexdigest()[:12]
    for i in range(max_retries):
        try:
            outcome, status, bal = await asyncio.wait_for(server.withdraw(key, amount), timeout=0.15)
            return (i, outcome, status, bal)
        except asyncio.TimeoutError:
            await asyncio.sleep(0.05 * (2 ** i))               # exponential backoff
    return (max_retries, "EXHAUSTED", None, None)

async def main():
    random.seed(11)
    s = Server(fail_rate=0.40)
    results = await asyncio.gather(*[client(s, f"merch-{i}", 100) for i in range(8)])
    for i, (retries, outcome, status, bal) in enumerate(results):
        print(f"merch-{i}: retries={retries} outcome={outcome:18} status={status} bal_after={bal}")
    print(f"\nfinal balance={s.balance}  processed={s.processed}  cache_size={len(s.idempotency_cache)}")

asyncio.run(main())

Sample run:

merch-0: retries=2 outcome=REPLAY             status=OK   bal_after=9900
merch-1: retries=1 outcome=REPLAY             status=OK   bal_after=9800
merch-2: retries=0 outcome=OK                 status=OK   bal_after=9700
merch-3: retries=3 outcome=REPLAY             status=OK   bal_after=9600
merch-4: retries=0 outcome=OK                 status=OK   bal_after=9500
merch-5: retries=1 outcome=REPLAY             status=OK   bal_after=9400
merch-6: retries=0 outcome=OK                 status=OK   bal_after=9300
merch-7: retries=2 outcome=REPLAY             status=OK   bal_after=9200

final balance=9200  processed=8  cache_size=8

The interesting line is processed=8 while half the clients retried 1–3 times. Eight merchants were each debited exactly ₹100; the fleet experienced 11 total network attempts. That ratio — 11 attempts producing 8 effects — is the fingerprint of at-least-once-plus-idempotency working correctly. The REPLAY outcomes are not failures; they are the system telling the client "I already did your work, here is the result I committed." Without the idempotency_cache lookup at line 12, those REPLAY retries would have been double-debits, and the merchant balances would now read ₹8800 instead of ₹9200. The 13 lines of cache-and-dedupe logic in Server.withdraw are not error-handling decoration; they are the substance of the function. A version of this code without them would pass tests, ship, and silently double-debit during the next network glitch.

Why the in_flight event matters separately from the cache: when the same idempotency-key arrives twice in quick succession (client retries before the first request even completes), a pure cache check would miss — neither call has cached a result yet — and both would proceed, executing the side-effect twice. The in_flight map makes the second arrival wait for the first, then read its result from the cache. This is the most subtle dedup case and the one inexperienced engineers leave out; it shows up as the rare "every other retry double-debits" production bug.

Why exponential backoff (0.05 * (2 ** i)) and not constant retry: when a downstream is overloaded, every client retrying at fixed intervals creates a thundering herd that prolongs the overload. Exponential backoff with jitter spreads the retry pressure across time, giving the downstream a chance to drain its queue. Without it, the retry storm itself becomes the outage. Part 7 covers the math (Dean & Barroso 2013); for now, the rule is: never retry without backoff.

A subtler failure-first move that the snippet above does not show, but every production system grows: retry budgets. Even with exponential backoff, an unbounded retry loop will, under sustained downstream failure, consume client resources (open file descriptors, in-flight goroutines, memory holding the un-acked request) until the client itself crashes. The discipline is to set a per-call retry budget (e.g. ≤3 retries) and a per-second fleet-wide retry-budget (e.g. retries may not exceed 10% of original requests), enforced by a token-bucket at the client. When the budget is exhausted, the client returns UNCERTAIN upstream and lets its caller decide what to do — the unsafe-to-retry signal propagates up the call chain instead of consuming the system. Google's gRPC retry-budget implementation is the canonical reference; Envoy and Linkerd ship similar primitives. Without retry budgets, a small downstream blip turns into a system-wide retry amplification — service A sees its calls to B fail, retries 3×; service Z calls A and now sees its A-calls fail (because A is busy retrying), retries Z→A 3×; the original 1 failure turns into 9 fleet-wide attempts, the network saturates, the partial outage becomes total. Part 7 quantifies the threshold; the design rule is unambiguous — every retry path needs a budget upstream of it.

A companion technique worth naming explicitly is deadline propagation. When the user's mobile request to PaySetu's gateway has a 2-second deadline (because at 2 seconds the customer's mobile app shows a loading spinner that has crossed the threshold of patience), every downstream call the gateway makes must inherit a shrinking deadline — if 1.4 seconds have already been spent on auth and risk-scoring, the call to the wallet service must have a deadline of ≤0.6 seconds. The wallet's call to the ledger gets ≤0.4 seconds, and so on. Without propagation, a partial failure deep in the stack burns 8 seconds of work that the caller already gave up on after 2 — pure waste, and worse, the side effect of that work commits long after the user has retried, producing a double-debit. gRPC's context.Context and the OpenTelemetry Baggage mechanism carry deadlines across hops; HTTP services propagate via the X-Request-Deadline header. The discipline is to never call a downstream without a deadline, never use a fixed deadline when a propagated one is available, and always check the remaining budget before starting expensive work. Failure-first design is not just about handling failure when it arrives; it is about not generating failure by doing work nobody is waiting for.

Designing the receiver — make the inbox the system of record

A second consequence of failure-first design is that the receiver, not the caller, becomes responsible for ordering and de-duplication. On a single machine, the caller's stack frame is the order; on a network, the caller cannot enforce order because messages can be lost, reordered by routers, or delivered out-of-order from a partitioned queue. The receiver must accept that messages arrive in any order and multiple times, and design its state machine so the steady-state result is correct under both.

The pattern is sometimes called idempotent inbox + at-least-once delivery. The MealRush order pipeline learned this the hard way: every kitchen acknowledgement was sent over a Kafka topic, the consumer was a stateless service that decremented inventory on each KITCHEN_ACK event, and during a brief broker rebalance the consumer's offset slipped backwards by 200 messages. Two hundred kitchen acknowledgements were re-played; two hundred inventory rows decremented twice; the next morning's prep list was wrong by exactly that much. The fix was three lines — store the last_processed_event_id per kitchen partition in the same DB transaction as the inventory write, reject events with event_id <= last_processed. The bug's existence was a clue that the receiver had been designed assuming each event arrives exactly once, when the broker contract was at-least-once. The lesson is universal: whenever the consumer's state changes based on an event, store the event-id alongside the state change in the same atomic write, and reject duplicates.

Illustrative — the design move that makes at-least-once delivery survivable. The inbox table is just (idempotency_key PRIMARY KEY, processed_at, result), updated in the same transaction as the side effect.

A common follow-up question: how big should the inbox grow? The naive answer is "unbounded — store every event-id forever". The production answer is "as large as your worst-case retry window plus one safety margin". A consumer that retries with deadlines of 60 seconds plus a broker that retains 24 hours of history needs an inbox covering at least 24 hours plus the retry window — anything older can be safely garbage-collected because no replay will reference it. Many teams choose 7 days as a comfortable upper bound; the storage cost is small (a row per event-id is 32–64 bytes) and the rare disaster-recovery replay from a 5-day-old broker snapshot is covered. The inbox is itself a distributed-systems artefact and has its own failure modes — if the inbox table is lost, every event becomes a possible duplicate; if the inbox is on a different database than the state, the two-database problem (atomicity across stores) re-appears (Part 14). The simplest safe pattern is: same database, same transaction, idempotency-key column with a PRIMARY KEY constraint, and let the database's ON CONFLICT DO NOTHING clause do the deduplication for you.

Why "same database, same transaction" matters more than it sounds: if the state-write and the inbox-write are on different stores, you have two-phase-commit semantics again — the inbox might commit while the state-write fails, or vice versa, leaving you with a phantom dedup record (state never changed, but next retry is silently dropped) or a lost dedup record (state changed but next retry will re-apply it). The single-database constraint reduces the problem to "either both commit or neither does", which the database's transaction guarantee gives you for free. Cross-store atomicity is one of the hardest sub-problems in distributed systems; co-locate when you can.

The implications spread further than payments. CricStream's recommendation service consumes a WATCH_EVENT topic whose Kafka cluster occasionally rebalances and re-delivers the last 30 seconds of events. The recommendation model treats each event as an interaction signal — early in CricStream's life this caused a feedback loop where every rebalance amplified the most recent watch events 2–3×, drowning out historical signal and creating "stuck" recommendations that just kept showing whichever match someone happened to be watching when the broker last rebalanced. The fix was to dedupe events by (user_id, content_id, ts_floor_to_minute) in a Redis set with TTL of 90 seconds — small change, three days to design, two hours to ship, immediate measurable improvement in recommendation diversity. Once again, the bug was the residue of an at-most-once mental model surviving into an at-least-once world.

The cost of forgetting — five real failure modes you will hit

The framing of "the third outcome" and "idempotent inbox" can feel abstract until you see what not designing failure-first looks like in production. Each of the five failure modes below is something a real engineering team has reproduced verbatim — not the exact team, but a structurally identical one — within the last two years on the Indian fintech / e-commerce / streaming-platform circuit. The point of listing them is not voyeurism. The point is that each failure mode has a single root cause: a single-machine assumption was preserved across the network boundary, and that assumption broke in production at the worst possible moment. Reading the list with the four questions from the previous section in mind — is this idempotent? what happens on retry? what happens on partial failure? where is the deadline? — turns each story into a small concrete drill.

It is worth enumerating, concretely, the failure modes that hit teams who build single-machine code and then "distribute it":

The double-debit. A payment service does not store idempotency keys; on a network timeout, the client retries; the original request had succeeded; the retry succeeds again. A customer is charged twice. The fix is the inbox pattern above; the cost of the lesson is real money and a refund process. Every Bengaluru fintech has hit this exactly once on the way to maturity.

PaySetu's first version of merchant settlements had this exact bug. A network glitch between the gateway and the wallet caused a 200-millisecond timeout on Saturday afternoons during a routine LB rebalance. The client retried, the wallet's debit had already committed, and 1,400 merchants were debited twice for a total of ₹47 lakh before the dashboard alarm caught the anomaly. The fix was an idempotency_key column on the wallet's transaction-log, indexed and unique-constrained, with a thirty-line PR. The post-mortem ran six pages — most of it was the refund-mechanics, the legal-team review, and the customer-comms script. The fix itself was small; the lesson was that the cost of a single missing failure-first primitive is not measured in code lines, it is measured in customer trust and refund-team hours, and both are vastly more expensive than the engineering it would have taken to do it right the first time.

The lost write. A worker service consumes events from a queue with manual ack; it processes the event, updates the database, then crashes before sending the ack; the queue redelivers the event; the worker reprocesses (good). But sometimes the database update fails after the ack was sent (crashed mid-flight); the queue does not redeliver; the write is lost. The fix is to commit the ack and the write in the same transaction — or, if that's not possible, to use a write-ahead log inside the worker. The cost of the lesson is silent data corruption; you find out weeks later when a reconciliation runs.

The stuck-pending state. A workflow has steps A, B, C; step B times out; the system marks the workflow PENDING and waits for human intervention; the human never sees the alert because the alert was routed to a Slack channel that nobody monitors. Three months later someone runs a SQL query and finds 47,000 stuck workflows. The fix is to design every PENDING state with a bounded timeout and an automated path forward (auto-cancel after 24h, auto-retry with escalation). The cost of the lesson is operational debt that compounds silently.

The cascading retry storm. Service A retries calls to service B with no backoff; B is overloaded; A's retries make B's queue grow; B's response time grows; A's deadline expires more often; A retries more; B's queue fills; B falls over; A's retries now go to no one; A also falls over; the system discovers it has accidentally implemented a denial-of-service attack against itself. The fix is exponential backoff with jitter, and circuit breakers. The cost of the lesson is a multi-hour outage that the post-mortem will diplomatically describe as "an unfortunate amplifying interaction".

The split-brain. A network partition isolates the leader from a quorum of followers; the followers elect a new leader; the network heals; the system now has two nodes that both believe they are the leader, both accepting writes; when the discrepancy is detected, the operations team must decide which timeline to keep and which to throw away. The fix is fencing tokens (Part 9) and lease-based leadership; the cost of the lesson is data loss and an incident report. Foreign companies famously hit this — GitHub's 2018 24-hour outage was a partition-induced split-brain — but a young Indian fintech named PaisaCard hit a smaller version on its 2024 Diwali sale, and the engineering blog post that came out of it is required reading for the next hire.

The point of the list is not to memorise five bugs. The point is that each one is the residue of a single-machine assumption that the engineer did not realise they were making until it broke in production at 02:30. Failure-first design is the discipline of identifying those assumptions during design, not during the post-mortem.

Common confusions

"Failure-first means writing more try / except blocks." No. Single-machine code can be exception-rich and still fail-second-class — it assumes the function ran exactly once if it didn't throw. Failure-first means designing the data model and the call protocol so that idempotency, ordering, and partial-completion are first-class concerns. The exception handlers are downstream of that work, not the substance of it.
"Idempotency keys solve everything." Idempotency keys solve replay safety for a single operation. They do not solve cross-service consistency (sagas, Part 14), they do not solve ordering (vector clocks, Part 3), they do not solve unbounded retry under partition (deadline propagation, Part 7), and they do not free you from designing the receiver to handle out-of-order arrivals. They are necessary, not sufficient.
"Exactly-once delivery is now possible thanks to Kafka 0.11+." Kafka's "exactly-once semantics" is exactly-once within the Kafka cluster boundary — between a producer, a broker, and a transactional consumer that commits its offset and its side effect to Kafka itself. The moment the consumer writes to an external database, calls another service, or sends an SMS, you are back in at-least-once-plus-idempotency land. The marketing language is technically correct within a narrow scope and dangerously misleading outside it.
"If we use a managed service, we don't need to worry about partial failure." Managed services hide the implementation, not the physics. AWS DynamoDB, GCP Spanner, Azure Cosmos all return timeouts and ProvisionedThroughputExceeded errors that your client must handle as UNCERTAIN. The managed service made the implementation invisible; it did not make the third outcome go away. Reading the SDK's error taxonomy is the first task of integrating any managed datastore.
"Building stateless services makes failure-first easy." Stateless services move the state to the datastore but the operations the stateless service performs may still have side effects (sending an SMS, charging a card, decrementing inventory). The stateless-ness moves the consistency problem; it does not eliminate it. The classic trap is a stateless worker that calls three APIs in sequence and treats the sequence as atomic — it isn't.
"Tests written against a single-process mock are sufficient." A mock that returns 200 OK deterministically tests the happy path only. Failure-first code requires tests where the mock returns UNCERTAIN, returns the same response twice, returns out-of-order responses, drops requests entirely, and replies after the deadline expired. Property-based testing tools (Hypothesis in Python, Jepsen's elle for trace verification) are designed for exactly this; standard unit tests are not.

Going deeper

The fallacies of distributed computing — Peter Deutsch's eight

The Sun Microsystems engineers — Peter Deutsch, Bill Joy, James Gosling, Tom Lyon, and others — produced a list during the late 1980s of eight assumptions that crash-land single-machine engineers into distributed bugs. Read in order, they are: the network is reliable; latency is zero; bandwidth is infinite; the network is secure; topology doesn't change; there is one administrator; transport cost is zero; the network is homogeneous. The list is older than this curriculum's reader by a generation, and yet the bugs it predicts are the bugs that show up in every cohort of new engineers within their first six months on a distributed team. The "wall" this chapter names is the cumulative weight of the eight fallacies hitting at once. Part 2 unpacks each fallacy with a war story; the punch-line of this chapter is that failure-first design is the architectural response to all eight at the same time — you cannot pick which fallacy to defend against, you must build the system assuming all of them are wrong.

FLP impossibility and why "just use consensus" is harder than it sounds

Fischer, Lynch, and Paterson's 1985 result — that no deterministic consensus algorithm can guarantee termination in an asynchronous network with even one faulty process — is the formal underpinning of why consensus protocols (Paxos, Raft, Zab, Part 8) build in randomised timeouts. The takeaway for a Part 1 reader is not the proof; it is that even the most rigorous tool we have for "failure-first" design — running a consensus protocol over a quorum — is bounded by an impossibility result, and the protocols you will read about in Part 8 are engineering compromises around that boundary, not magic. The compromise is: assume a partial-synchrony model (eventually messages arrive within a known bound), use timeouts and randomisation, accept that under prolonged partition the system pauses rather than violates safety. Failure-first design at the smallest scale (one RPC) and the largest (a 5-region consensus group) shares the same shape.

Why production-faithful chaos testing changes engineering culture

Netflix's Chaos Monkey, born from Chaos Engineering practice (Basiri et al., IEEE Software 2016), proved a point that has reshaped how mature distributed teams operate: failures the system has not experienced in production are failures it does not actually handle. A retry-on-timeout code path that has never been exercised in real conditions is, statistically, broken in some way the test suite did not catch — it has the wrong backoff, it doesn't properly clean up the in-flight state, it leaks a goroutine, it propagates a stale deadline. The discipline of injecting failure into production (in controlled blast radius, during business hours, with a "stop button") is how you turn untested failure paths into tested ones. PaySetu adopted a quarterly "game-day" practice in 2024 where one team injects partition or latency into a single AZ for a 30-minute window, watches the system either heal or hit UNCERTAIN-infested degradation, and writes the post-mortem the same afternoon. Six game days revealed seven bugs that had been latent for years. Part 19 covers this in depth; the implication for failure-first design is that the design phase is where you sketch the failure modes; the chaos-engineering phase is where you discover the ones you missed.

The architecture-shape implication — failure-first reshapes service boundaries

A subtle but important consequence: failure-first design pushes service boundaries toward transactional alignment, not domain alignment. The "domain-driven design" school says you carve services along bounded contexts (the order context, the payment context, the inventory context). Failure-first design says you carve services along consistency boundaries — operations that must be atomic must live in one service, and operations that can tolerate partial completion may be split. When BharatBazaar split its monolith into 47 microservices around domain boundaries in 2021, the order-completion flow crossed five services and four of them had to coordinate via sagas — at every saga step there was a new UNCERTAIN outcome to manage, and the operational complexity dwarfed the engineering benefit of the split. Two years later they consolidated 18 of those services back into 6 transactional cores, leaving 35 leaf services for non-transactional concerns. The lesson: service boundaries are a consequence of where you can tolerate uncertainty, not where the diagram looks neat. Part 14 (distributed transactions) and Part 16 (workflows) are the chapters where this becomes a quantitative argument.

Reproduce this on your laptop

The Python artefact in ## Designing the receiver — make the inbox the system of record is reproducible without infrastructure:

# Reproduce the at-least-once + idempotency demo
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
python3 failure_first_withdraw.py
# Vary fail_rate from 0.0 to 0.95 and re-run to see the cache_size grow
# while balance always equals 10000 - 100*processed.

To see real at-least-once behaviour over a real broker, install Redpanda (a Kafka-compatible broker that runs in a single binary) and re-run a producer-consumer with enable.idempotence=true:

brew install redpanda-data/tap/redpanda     # or: apt install redpanda
rpk container start                          # spins up a single-broker dev cluster
pip install confluent-kafka
# Run a producer that sends the same key twice; observe the consumer's
# offset commit and verify the consumer dedupes by message-id.

Doing this once with your own hands earns you a more durable mental model than reading any number of "exactly-once" blog posts.

Where this leads next

This chapter closes Part 1. The previous four chapters built an argument with three economic / physical / availability premises and one cost — distribute when you must, pay the bill of consistency / simplicity / debuggability, and design every line of code as though failure runs through it. Part 2 starts the inventory of failure: the fallacies of distributed computing, the failure models (fail-stop / fail-noisy / fail-Byzantine), the difference between a partition and a slow node, and the formal taxonomy that the rest of the curriculum lives inside.

The fallacies of distributed computing — Part 2 — the eight Sun-era assumptions, with a war story per fallacy.
Fail-stop, fail-noisy, fail-Byzantine — Part 2 — the formal failure-mode taxonomy that consensus protocols are designed against.
Partial failure and the impossibility of perfect failure detection — Part 2, 10 — why "down" is a verdict, not a fact.
Idempotency keys and the at-least-once contract — Part 4 — the application-layer protocol that makes at-least-once safe.
Lamport timestamps and happens-before — Part 3 — the first technique for inventing order across machines.
Quorums and the W + R > N rule — Part 8 — the technique that turns UNCERTAIN into bounded uncertainty.

The reader who has finished Part 1 holds a position that is rare and useful: they know why distributed systems exist (the previous three chapters — economics, physics, availability), what the cost is (consistency, simplicity, debuggability — the chapter just before this one), and how to design under the cost (failure-first, three outcomes, idempotent inbox — this chapter). The remaining 134 chapters are the catalogue of techniques. They are interesting on their own and they are deeply consequential for production systems, but the framing established here — failure-first, three outcomes, idempotent inbox, partial-completion-as-a-state — is the lens through which they should be read. Without the lens, each technique looks like a separate trick to memorise; with the lens, they form a coherent toolbox of partial-recovery moves against a common adversary: the network, refusing to be reliable.

A practical exercise before starting Part 2: take the most recent production incident your team handled, and write a one-paragraph post-mortem in this template — "The incident was caused by a single-machine assumption (X) carried into a distributed component. The third outcome (UNCERTAIN) appeared at point (Y) and was handled (or not handled) by mechanism (Z). The fix (or proposed fix) is to add (idempotent inbox / fencing token / deadline propagation / dedup cache / W+R quorum / saga compensation). The class-of-bug this falls under is (one of the five failure modes above)." Doing this for ten incidents in a row across a quarter is the cheapest way to internalise the framing this chapter has tried to install. The framing is more durable than any specific protocol — the protocols change every few years, the framing has held for forty.

A second, sharper exercise: open the source code of the most-called RPC handler in your service and answer four questions. Is the handler idempotent on the data model — i.e. if it ran twice with the same arguments, would the database state be the same as if it ran once? If a client retries this call after a timeout, does the system handle it correctly without manual intervention? What happens if the handler's downstream call times out at exactly the moment the downstream actually committed? Is there a deadline propagated into this handler's downstream calls, or are they using a fixed timeout that ignores how much of the user's patience has already been spent? Most handlers fail at least one of these four questions on first reading. The honest answer to "what fraction of our RPC handlers pass all four" is the most predictive single metric of how often your team will be paged for distributed-systems bugs in the coming year. A team that scores 90% will sleep through the coming Diwali sale; a team at 30% will be debugging at 02:30 on the night BharatBazaar's next mega-sale launches. The scoring is not academic — it is the leading indicator of operational health, and the four questions are the most concise summary of failure-first design that fits on a sticky note above an engineer's monitor.

The chapters that follow are where each of those four questions becomes a concrete protocol or pattern. Part 3 (clocks) answers "how do we order events when machines disagree about time"; Part 4 (RPC) is where idempotency and deadline propagation become rigorous; Parts 8–9 (consensus, leader election) are where W+R quorums and fencing tokens replace ad-hoc retry logic; Part 12 (consistency models) puts a quantitative dial under "how stale is too stale"; Part 14 (distributed transactions) and Part 16 (workflows) are where the multi-step saga patterns from this chapter's ## The cost of forgetting section get formalised. Read the catalogue with the four questions in mind, and the techniques will arrange themselves around the same skeleton you have already internalised. The point of Part 1 is exactly that — to give you a skeleton sturdy enough to hang the rest of the curriculum on.

References

A Note on Distributed Computing — Waldo, Wyant, Wollrath, Kendall, Sun Microsystems 1994. The foundational paper arguing that distributed objects cannot transparently look like local ones — the formal version of "you must design failure-first".
Fallacies of Distributed Computing Explained — Arnon Rotem-Gal-Oz. Annotated walkthrough of Peter Deutsch's eight fallacies with production examples.
Notes on Distributed Systems for Young Bloods — Jeff Hodges 2013. A practitioner-voice essay covering exactly the gotchas this chapter sketches; pairs naturally as the next reading after Part 1.
Impossibility of Distributed Consensus with One Faulty Process — Fischer, Lynch, Paterson, JACM 1985. The formal boundary on what failure-first design can achieve in fully asynchronous networks.
Designing Data-Intensive Applications — Martin Kleppmann, O'Reilly 2017. Chapter 8 ("The Trouble with Distributed Systems") is the most accessible long-form companion to this chapter.
Patterns of Distributed Systems — Unmesh Joshi on Martin Fowler's site. A pattern-language catalogue of failure-first design moves, with running code.
Chaos Engineering — Casey Rosenthal, Aaron Blohowiak et al. The principled argument for why production-tested failure paths are the only ones that work.
What you lose: consistency, simplicity, debuggability — internal cross-link to the immediate predecessor chapter; this article is its operational conclusion.