Crash, omission, timing, Byzantine

Aditi is staring at three dashboards at 02:14. Replica wallet-3 of PaySetu's wallet cluster has gone quiet; its heartbeat to the leader stopped 90 seconds ago, but its TCP socket is still half-open and the kernel on the load-balancer thinks the host is alive. Replica wallet-5 is replying to client requests with 200 OK and a body that contains last week's balance — the data is stale by 11 hours. Replica wallet-7 is acking writes to the leader within 4 ms, but the on-disk WAL position has not advanced for 6 minutes, which means every ack it sent in the last 6 minutes was a lie. Three replicas, three failures, and none of them is the same kind of failure. The runbook page Aditi is on says "if a replica fails, mark it dead and remove it from the quorum" — but applying that to all three at once would either over-react (kill wallet-3 which is recoverable) or under-react (leave wallet-7 lying about acks, which is the most dangerous mode of all). This chapter is the taxonomy that lets you tell which is which.

There are four failure models, ranked by how badly each one violates the protocol contract: crash (node stops cleanly), omission (node drops some messages, processes others), timing (node responds, but past the deadline), and Byzantine (node behaves arbitrarily — possibly maliciously, possibly buggy, possibly corrupted). Each demands a different defence, and assuming the wrong model is how 4-hour outages become 4-day data-loss incidents.

The four models, in increasing order of pain

A failure model is a bound on what a faulty node is allowed to do. The protocol you build is correct under that bound — it makes no promises if the actual failure exceeds the bound. The discipline of distributed-systems engineering is matching the right bound to the right environment, because a tighter bound buys you a simpler protocol but a looser bound is the one that will actually hit you in production.

The four standard models, formalised across forty years of literature (Lamport, Schneider, Cristian, Castro & Liskov), look like this. Read them top-to-bottom: each row is a strict superset of the failures the previous row admits, and a strict superset of the protocol complexity needed to handle them.

Model What a faulty node is allowed to do Detectable by Example in production
Crash (fail-stop) Stops sending messages permanently. Never sends a wrong message. Heartbeat timeout Process kill -9, machine power-off, kernel panic
Omission Drops some messages (in or out). Other messages are correct. Sequence-number gaps, retry-with-deduplication Buggy NIC dropping every 1000th packet, half-closed TCP, full kernel send buffer
Timing Responds correctly, but late — past the agreed deadline. Deadline timer Long GC pause, page-cache thrash, IO stall, CPU stolen by noisy neighbour
Byzantine Anything: corrupted bytes, conflicting answers to different peers, replays of old messages, outright lies. Cryptographic signatures + voting (BFT) Bit-flip from cosmic ray on cheap RAM, malicious actor in adversarial setting (blockchains), buggy code that returns subtly wrong answers

The models nest: every crash failure is also an omission failure (it dropped all future messages); every omission failure is also a timing failure (the dropped message has effectively infinite latency); every timing failure can be modelled as Byzantine if you squint hard. The reason we name them separately is that the protocol cost grows by an order of magnitude at each step, so you want the tightest model your environment allows. Inside a single trusted datacentre with ECC RAM, crash + omission is enough — you don't need Byzantine fault tolerance, and paying for it (3f+1 replicas instead of 2f+1, signature verification on every message, ~10× more RPCs per consensus round) is wasted CPU. Across an open internet with mutually distrustful parties (a public blockchain), Byzantine is the only honest choice.

The four failure models, nested by strictnessA nested-rectangles diagram showing crash failures contained within omission, contained within timing, contained within Byzantine. Each band labelled with what the faulty node may do, and a callout listing the protocol cost of defending against that level.Failure models nest — each outer ring includes everything insideByzantine — arbitrary behaviour, possibly maliciouscost: 3f+1 replicas, signatures, voting (PBFT, Tendermint)Timing — responds, but latecost: deadlines, hedged requests, partial synchrony assumptionOmission — drops some messages, others finecost: sequence numbers, retries, idempotency keysCrash — stops cleanly, forever silentcost: heartbeat + timeout, 2f+1 replicas (Raft, Paxos)cheapest — most production systems live here
Illustrative — the four models as nested fault classes. The outer rings cost more to defend against because they admit more adversarial behaviour. Inside a trusted datacentre with ECC, the crash ring is usually enough.

Why crash needs only 2f+1 replicas but Byzantine needs 3f+1: under crash, a faulty node is silent — the remaining 2f+1−f = f+1 nodes form a majority and decide. Under Byzantine, a faulty node may lie — so a majority among the remaining 3f+1−f = 2f+1 must out-vote the f liars and the f+1 honest minority's potential disagreement under the same partition. The arithmetic comes from Pease-Shostak-Lamport 1980; the practical consequence is that Byzantine consensus burns ~1.5× the hardware of crash consensus and that ratio shows up in cloud bills.

Walking through what each model actually looks like in code

Talking about failure models in the abstract is how engineers end up applying the wrong defence in production. Each model has a distinctive signature in logs and metrics, and learning to read those signatures is half the skill. The other half is knowing which signature corresponds to which row of the table — and the cheapest way to learn that mapping is to watch the four modes side-by-side in a simulation small enough to fit on one page.

The Python artefact below simulates each of the four failure modes and shows how a naive consensus-style coordinator (one leader, four followers, majority quorum) responds. The point is not that the simulator is rigorous — it is that each failure mode produces a visibly different log pattern, and a real on-call engineer learns to recognise those patterns before opening the protocol spec.

# failure_modes.py — simulate crash / omission / timing / Byzantine and see how a coordinator responds
import asyncio, random, time

class Follower:
    def __init__(self, name, mode="healthy"):
        self.name = name; self.mode = mode
        self.committed = 0; self.dropped_count = 0

    async def append(self, idx, data):
        # Each mode is one of: healthy, crash, omission, timing, byzantine
        if self.mode == "crash":
            raise asyncio.CancelledError("dead")              # never replies, ever
        if self.mode == "omission" and random.random() < 0.4:
            self.dropped_count += 1
            await asyncio.sleep(10)                            # simulates dropped — caller times out
            return None
        if self.mode == "timing":
            await asyncio.sleep(random.uniform(0.5, 1.2))      # late but correct
            self.committed = idx
            return ("ACK", self.name, idx, data)
        if self.mode == "byzantine":
            # The dangerous one: replies fast with a *wrong* commit_index
            await asyncio.sleep(0.01)
            return ("ACK", self.name, idx, data + "_TAMPERED")
        await asyncio.sleep(0.01)
        self.committed = idx
        return ("ACK", self.name, idx, data)

async def coordinator(followers, idx, data, deadline=0.2):
    # Leader sends append to all, waits for majority within deadline
    tasks = [asyncio.create_task(f.append(idx, data)) for f in followers]
    done, pending = await asyncio.wait(tasks, timeout=deadline)
    for p in pending: p.cancel()
    acks = [t.result() for t in done if t.result() is not None]
    quorum = (len(followers) // 2) + 1
    by_data = {}
    for ack in acks: by_data[ack[3]] = by_data.get(ack[3], 0) + 1
    winning = max(by_data.items(), key=lambda x: x[1]) if by_data else (None, 0)
    return {"acks": len(acks), "quorum": quorum, "winning_data": winning[0],
            "agreed": winning[1] >= quorum, "tampered_seen": any("TAMPERED" in a[3] for a in acks)}

async def main():
    random.seed(7)
    scenarios = [("healthy",  ["healthy"] * 4),
                 ("crash-1",  ["crash", "healthy", "healthy", "healthy"]),
                 ("crash-2",  ["crash", "crash", "healthy", "healthy"]),
                 ("omission", ["omission", "omission", "healthy", "healthy"]),
                 ("timing",   ["timing", "timing", "healthy", "healthy"]),
                 ("byzantine-1", ["byzantine", "healthy", "healthy", "healthy"])]
    for name, modes in scenarios:
        followers = [Follower(f"f{i}", m) for i, m in enumerate(modes)]
        r = await coordinator(followers, idx=42, data="balance=9000")
        print(f"{name:12} acks={r['acks']}/4 quorum={r['quorum']} agreed={r['agreed']:1} "
              f"tampered_seen={r['tampered_seen']:1} winning_data={r['winning_data']}")

asyncio.run(main())

Sample run:

healthy      acks=4/4 quorum=3 agreed=1 tampered_seen=0 winning_data=balance=9000
crash-1      acks=3/4 quorum=3 agreed=1 tampered_seen=0 winning_data=balance=9000
crash-2      acks=2/4 quorum=3 agreed=0 tampered_seen=0 winning_data=balance=9000
omission     acks=3/4 quorum=3 agreed=1 tampered_seen=0 winning_data=balance=9000
timing       acks=2/4 quorum=3 agreed=0 tampered_seen=0 winning_data=balance=9000
byzantine-1  acks=4/4 quorum=3 agreed=0 tampered_seen=1 winning_data=balance=9000_TAMPERED

The interesting rows are the last three. omission looks identical to crash-1 from the coordinator's perspective — it got 3 acks, hit quorum, committed. The coordinator does not know that follower f0 and f1 dropped 40% of their messages on the floor; it just notices that the slow ones missed the deadline and the quorum was reached without them. This is why omission failures are insidious — they hide inside the "we got a quorum" success path. timing got only 2 acks within the 200 ms deadline because two followers were 500–1200 ms late — the protocol did not commit, even though all four followers were healthy and would have eventually replied with the correct answer. From the coordinator's viewpoint, a slow node is identical to a dropped message until the deadline expires; the protocol cannot wait forever (FLP) and so it must reject the slow reply, even though that reply, when it eventually arrives, will be correct. byzantine-1 is the chilling case: 4 of 4 acks arrive on time, the coordinator's naive quorum check passes (agreed=1)… except tampered_seen=1 is now the only signal something is wrong, and a coordinator that does not check the content of the acks against each other will happily commit a tampered value. Real BFT protocols require 2f+1 acks for the same content — not just 2f+1 acks. That single word — same content — is the difference between Raft and PBFT.

Why the deadline parameter is doing protocol-shaping work, not just timeout-handling: a tighter deadline (50 ms) catches timing faults faster but classifies more healthy slow nodes as failed, increasing false-positives. A looser deadline (1 second) catches fewer timing faults but waits longer to make progress. The deadline is a knob on the synchrony assumption — T_deadline = T_typical_RTT × k where k is the multiplier you trust, typically 4–10. Cassandra's phi-accrual failure detector treats this as a tunable distribution rather than a fixed threshold; Part 10 derives the math. Until then, every timeout in your config file is a guess at this knob.

Why a follower mode of "byzantine" is the one production engineers underestimate: most production Byzantine faults are not malicious — they are buggy code returning subtly wrong answers, corrupted memory due to single-bit flips on commodity hardware (Google's 2009 study found cosmic-ray-induced bit flips occur ~1 per gigabyte of RAM per year), or stale caches serving 11-hour-old data while reporting 200 OK. Calling them "Byzantine" sounds dramatic but the protocol cost is real, and the fix — checksums on every persisted value, signed acks, content-based quorum — is the difference between catching the bug at write time and catching it 11 hours later in a customer support ticket.

Production stories — recognising the model in your dashboard

The taxonomy is only useful if you can map it onto what you actually see at 02:30. Here is what each model looks like in three real Indian-production-shape incidents — each is structurally based on a real outage we have seen, with names changed to fictional roster.

Each story below is a structurally-identical version of an incident a real engineering team handled within the last two years, retold with fictional roster names. The point is not the war-story drama — it is to attach a face to the four abstract rows of the table, so that next time you see the signature in your own logs you recognise it without having to re-read the literature.

Crash failure — CricStream's leader rotation. During the 2024 cricket world cup final (25M concurrent viewers on CricStream), one of the leader-cluster nodes hit a kernel panic at 21:47 IST when its NVMe driver hit a bug under sustained 4 GB/s write load. The node went completely silent — heartbeat stopped, TCP RST on the next ack, kernel logs showed the panic-trace. The standby took over in 240 ms (exactly within the lease's f+1 = 2-quorum tolerance), playback was paused for ~600 ms across the affected shard, and the post-mortem was three pages long. Crash is the easy case — the protocol you have probably already handles it, and the only operational concern is that the timeout is short enough to keep user impact under one second.

Omission failure — PaySetu's wallet cluster on Diwali eve 2024. Replica wallet-3 was running on a host whose NIC firmware had a bug that dropped exactly 1 in 4096 IP packets. Heartbeats survived because they were tiny and frequent; large AppendEntries requests for the wallet WAL routinely lost their final TCP segment, causing the kernel to retransmit (visible only in /proc/net/tcp with elevated RetransSegs). The follower acked some entries and silently failed to ack others; the leader's next_index[3] got stuck at a position 47 entries behind the rest. For 18 minutes the cluster was in a state where 4 of 5 replicas were progressing and 1 was stuck. No alarm fired because the leader still had quorum. The breakage surfaced when an unrelated rolling upgrade restarted wallet-2, leaving wallet-3 (the stuck one) as one of the two needed for quorum, and writes paused for 9 minutes. Omission is the failure mode that sneaks past the metrics — the cluster is "healthy" by every standard signal until the topology changes.

Timing failure — KapitalKite's order-matching cluster during the Adani-cycle market open. A JVM full-GC pause of 4.2 seconds froze the leader of the order-matching consensus group. Followers' election timers (3 seconds) elapsed; one became candidate, won, and started accepting orders. Then the original leader's GC finished, it returned to life with the old term and tried to push its in-flight orders — those were rejected, but for ~800 ms there were two leaders both accepting writes (until the old leader saw a higher term and stepped down). The 800 ms split-brain caused 84 orders to be confirmed against now-stale state, then nullified during reconciliation. Timing failures look like crashes for the duration of the pause and like a normal node afterwards — they are the worst kind of single-node fault because they break protocol invariants on both sides of the boundary.

A timing failure in KapitalKite's leader: GC pause masquerades as a crashA horizontal timeline showing leader healthy, GC pause start, election triggered, new leader elected, original leader resumes, dual-leader window, original leader steps down. The 800 ms dual-leader window is highlighted as the danger zone.Timing failure timeline — 4.2s GC pause causes 800ms dual-leader windowt=0leader healthyt=1.0sGC startsheartbeats stopt=4.0selection timeoutcandidate emergest=4.4snew leaderterm=8t=5.2sold leader wakesterm=7, stalet=6.0sold steps downsees term=8800msdual-leader window→ during the highlighted window, both leaders accept writes; reconciliation rejects 84 orders
Illustrative — the dual-leader window is the price of treating a timing failure as a crash failure. Fencing tokens (Part 9) close this window by making the old leader's writes provably stale at the storage layer.

Byzantine failure — BharatBazaar's cart-counter cache. A single rack of cheap commodity servers in BharatBazaar's recommendation tier started returning cart-counter values that were silently off by 1–4 items, intermittently, during the 2024 Big Billion sale. Investigation traced it to faulty (non-ECC) DRAM in three of the rack's nodes — bit flips were corrupting the in-memory counter before the response was sent. Every value still looked syntactically valid (small integers, plausibly cart-sized), so neither the application nor the load-balancer detected anything wrong. Customers reported "I added 3 items but the icon shows 2" tickets at ~0.05% of sessions over four hours before the shape of the bug clarified. The fix was twofold: replace the rack with ECC-DRAM hardware and add a SHA-256 checksum to every cart-counter response. In commodity-hardware datacentres, Byzantine failures show up as bit-corruption far more often than as malicious behaviour — and the defence (ECC RAM, checksums on hot data, content-comparison across replicas before serving) is the same.

The lesson across all four stories is the same shape: the failure mode you under-defend against is the failure mode that bites you the worst. Crash is cheap; omission and timing fail invisibly inside "healthy" metrics; Byzantine corrupts data without anyone realising until the customer-support tickets surface. The system-design move is to know your environment's expected mix and pay for the protocols that match.

A subtler observation about the simulator output: the time-to-detect differs sharply across modes. Crash is detected at the deadline (200 ms in this run); omission of an individual message is detected at the same deadline but the full follower-is-falling-behind picture only becomes visible after several rounds; timing is detected at the deadline plus the slop the protocol allows for one slow ack; Byzantine is not detected at all by a coordinator that does not check ack content against itself. The detection lag is the operational cost of each model — it is the time during which the system is degraded but the dashboards have not yet flipped red, and during that time a human operator cannot react because they have nothing to react to. Real production systems shorten this lag with redundant signals (heartbeat plus data-plane probes plus replication-lag plus content checksums), each catching one of the four modes earlier than a single signal would. The pile-up of these signals is what a mature SRE dashboard looks like, and reading the dashboard fluently is exactly the skill of mapping signal patterns onto the four rows of the table.

Choosing a model — the design conversation you must have

When you sit down to design a new distributed component, the first decision is not which database, not which RPC framework, not which language — it is which failure model you are designing under. The decision is usually implicit, encoded in the protocol you reach for ("we'll use Raft" implies crash-only), and that implicit choice is precisely where the model-vs-environment mismatch creeps in. Make the decision explicit, write it in the design doc, and revisit it during every code review. The conversation has three concrete questions.

What is the trust boundary of the deployment? If every node runs in a single datacentre your team controls, on hardware your SRE team chose, behind a firewall, talking only to other nodes you operate — your trust boundary is the datacentre, and crash + omission is usually enough. If nodes run across multiple cloud providers your team does not fully audit, with software you did not write at every layer, in an environment where a node could be compromised by a supply-chain attack — the boundary moves outward and the model that matches is closer to Byzantine. Most production distributed systems live in the first regime; the recent rise of confidential computing (Intel SGX, AMD SEV) and the proliferation of supply-chain compromises have pushed some toward the second.

What is the cost of a missed fault? A messaging system that drops a chat reply costs an annoyed user. A wallet that mis-debits a balance costs ₹47 lakh and a refund-team week. A national rail-booking system that confirms two tickets for the same berth costs reputational damage and possibly regulatory action. The higher the cost-per-fault, the further outward you should pay for fault-tolerance. PaySetu's wallet uses content-checksums on every transaction (a partial-Byzantine defence) precisely because the cost of a single silent bit-flip exceeds the cost of the extra CPU cycles for the SHA-256 by orders of magnitude. CricStream's video-buffer cache uses no checksum on the cached frame because a single corrupted frame produces a one-frame visual glitch and the user never notices.

What is your tolerance for false-positive failure detection? Tighter timing (faster failure detection) catches more genuine failures but classifies more healthy slow nodes as failed; looser timing makes the opposite trade-off. PlayDream's fantasy-sports leaderboard, which updates every 30 seconds, sets failure-detection deadlines of 8 seconds — a healthy node has plenty of slack and a failed one is detected in under 10 seconds. RailWala's Tatkal-hour booking pipeline, which runs at 2-second user-perceived deadlines, must set failure-detection deadlines of ~200 ms — meaning every GC pause longer than 200 ms is a probable fault, and the system must either tolerate frequent re-elections or invest in tail-latency engineering (low-pause GC, off-heap memory, real-time JVMs) to keep pauses below the threshold. The deadline is set by the protocol's worst-case latency budget, not by what feels comfortable.

The output of these three questions is a one-paragraph statement at the top of the design doc: "We design under the crash failure model with omission tolerance via idempotent retries; we do not defend against Byzantine faults at the protocol layer, and we mitigate hardware-induced bit-corruption via ECC RAM + per-page checksums at the storage layer." That paragraph, when honestly written, makes the assumptions visible to every future engineer who reads the doc, and the inevitable on-call conversation becomes "did the failure exceed our model?" instead of "what is happening?". The framing turns out to be cheaper than building the wrong protocol and finding out under load.

Edge cases and failure-mode interactions

The four-model taxonomy treats each fault as if it occurs in isolation, but production rarely cooperates. The ugly cases happen when two failure modes interact, and the protocol's tolerance for each individually does not compose into tolerance for the combination. A few canonical interactions worth naming:

The taxonomy is also a checklist for code-review: any new protocol-touching change should be accompanied by an explicit answer to "does this preserve our crash-tolerance, omission-tolerance, timing-tolerance, and Byzantine-tolerance properties?". Most regressions in real distributed systems are caused by a refactor that broke one of these tolerances without anyone noticing during review. A reviewer trained on the four-model lens spots these regressions; a reviewer trained on "does the code compile and the tests pass" does not.

Crash + asymmetric partition. A node crashes, the cluster begins a failover, then the network heals just enough that some of the cluster can reach the crashed node's replacement and some still cannot. The cluster now has two cohorts each making decisions, with one cohort treating the crash as resolved and the other still treating it as in-progress. This is how Cassandra's gossip-based failure detection produced the well-known 2018 Facebook Messenger outage — a mix of crash and partition produced a state the failure detector did not have a category for. The defence is that failure detectors must distinguish "I cannot reach the node" from "the node has crashed" and the cluster's protocol must resolve disagreements about which it is. Phi-accrual (Part 10) does exactly this by making the suspicion level a continuous probability rather than a binary verdict.

Timing + omission inside a single replica. A node experiences a long GC pause (timing failure) and during the pause its kernel's TCP send-buffer fills up; when the pause ends and the node resumes, the kernel begins dropping packets that overflow the buffer (omission failure). The node is now both late and losing some of its outbound messages — the leader sees occasional acks (so the node looks alive) but with random gaps (so the WAL never catches up). This is the failure mode behind the recurring "follower stuck behind, fixes itself on restart" class of bugs that every Raft implementation has hit at some point. The defence is bounded queue depth with explicit drop-or-block semantics — a queue that silently drops messages is the substrate omission failures need; a queue that blocks the producer makes the failure observable upstream.

Byzantine + crash camouflage. A buggy node returns subtly-wrong answers for some requests (Byzantine) and then, when its corruption hits an internal invariant check, panics and crashes (crash). The cluster's failure detector reports "node crashed at 14:47" and the on-call engineer treats it as a clean crash failure — but the wrong answers it returned in the 11 minutes before the crash have already propagated to clients and downstream caches. Cleaning up after a Byzantine fault that ends in a crash is harder than either pure mode, because the crash hides the evidence of the Byzantine behaviour. The defence is append-only audit logs with content checksums on every value the node served, so that post-mortem reconciliation can find the corrupted answers even after the producing node is gone.

The interaction patterns are the reason real production systems run protocols designed for one model "up" from their nominal threat model — a Raft cluster that nominally tolerates crash often runs with content checksums and signed RPCs because the marginal cost is small and the marginal robustness against the crash + Byzantine interaction is real.

Common confusions

Going deeper

Lamport's hierarchy and Schneider's 1990 survey

Lamport's 1982 "Byzantine Generals Problem" paper gave the field its dramatic name; Schneider's 1990 ACM Computing Surveys article ("Implementing Fault-Tolerant Services Using the State Machine Approach") provided the formal taxonomy this chapter uses. Schneider's hierarchy is: fail-stop ⊂ crash ⊂ omission ⊂ timing ⊂ Byzantine. The distinction between "fail-stop" and "crash" is subtle and worth knowing: a fail-stop node, when it fails, announces its failure (or at least is detectable instantly); a crash node simply stops, and other nodes must use timeouts to infer the failure. Fail-stop is a useful idealisation for proofs but does not exist in real networks — every real "crash" is a "crash + uncertain detection time", which is why Part 10 (failure detection) is its own multi-chapter topic.

A useful operational corollary: papers from the late 1980s and early 1990s (Cristian, Aghili-Strong, Charron-Bost) explicitly distinguished send-omission, receive-omission, and general omission — a node that drops only the messages it sends is structurally different from a node that drops messages it receives, and the protocol consequences differ. Most modern textbooks collapse the three into "omission" for pedagogy, but real-system debugging often requires the finer distinction — was the leader's AppendEntries lost on the way out (send-omission of the leader), or arrived but was dropped before processing (receive-omission of the follower)? tcpdump on both sides answers this; the protocol's behaviour is the same in either case, but the fix (replace the leader's NIC vs replace the follower's NIC) is different.

The PBFT cost — why Castro & Liskov 1999 is the protocol everyone cites

Castro & Liskov's "Practical Byzantine Fault Tolerance" (OSDI 1999) was the first BFT protocol fast enough to be considered practical — it required 3f+1 replicas, two rounds of all-to-all messaging, and signature verification on every message. The constants are heavy: a single PBFT consensus round is roughly 4× the messages and 10× the CPU of a Raft round in our benchmarks. The more recent Tendermint, HotStuff, and the consensus protocols underlying Diem / Aptos have brought the per-round message complexity down from O(n²) to O(n) using threshold signatures, but the floor is still 3f+1 replicas, and that floor comes from the Pease-Shostak-Lamport 1980 lower bound. If you ever read a paper that claims sub-3f+1 BFT, check whether it is assuming a synchronous or partially-synchronous model — the bound is provable in asynchronous settings only, and partially-synchronous protocols can sometimes do better at the cost of liveness during the asynchrony period.

For a curriculum-relevant calibration: a Bengaluru fintech running consensus over 5 nodes (tolerating 2 crash failures via Raft) would need to grow to 7 nodes to tolerate the same 2 failures under Byzantine, and that 40% hardware-cost increase is the price of admission. The architectural answer most teams reach is "we will not pay PBFT's cost everywhere, but we will pay the partial Byzantine cost (content checksums, signed RPCs, replicated audit logs) at the layers where corruption would be most expensive". This is the pragmatic compromise — Raft for ordering, BFT-style content checks for tamper-resistance — and it is the deployed shape of most modern fintech-grade replicated systems.

Gray failures — the failure mode that does not fit the taxonomy

Microsoft Research's 2017 paper "Gray Failure: The Achilles' Heel of Cloud-Scale Systems" (HotOS) named a failure pattern that lives uncomfortably between omission and Byzantine: a node is partly working, in a way that some observers see as healthy and others see as broken. The classic example is a node whose health-check endpoint replies fast (reporting healthy) while its actual data-plane endpoint is silently failing. Two observers — the load balancer's health-check probe and the service's actual client — disagree about whether the node is alive. This is not a model in the four-model hierarchy — it is a property of the observation, not the node, and it is the failure mode most likely to bite production teams who religiously monitor health-checks. The fix is to observe what the user observes — check the data-plane path, not a synthetic /health endpoint. Chapter 7 (gray failures) and Part 10 (failure detection) develop this further.

The deeper insight from the gray-failure paper is that failure is a relation between observer and observed, not a property of the observed alone. The four-model taxonomy implicitly treats failures as objective ("is this node faulty"), but real distributed systems have multiple observers (the load-balancer, the failure detector, the application client, the SRE staring at Grafana) and each may classify the same node into a different model at the same wall-clock instant. A protocol that depends on consensus among observers about which model applies — for example, a quorum-based failure detector — handles this correctly; a protocol that depends on a single observer's classification (a single load-balancer's /health probe) is brittle, and the brittleness shows up as the cluster unanimously believing a sick node is healthy. The refinement Part 10 develops is that failure detection itself must be a distributed-systems problem, not a single-observer query, and that move converts gray failures from "exotic" to "expected".

The synchrony spectrum and why the model depends on it

A subtlety the table at the top of the chapter glosses over: timing failures only make sense in a model where there is a deadline. In a fully asynchronous model (no bound on message delay), there is no such thing as a "late" message — every message either arrives eventually or does not, and "timing" collapses into either crash or omission. The reason real protocols can detect timing failures at all is that they implicitly assume partial synchrony — Dwork-Lynch-Stockmeyer's 1988 model, where the network is asynchronous for some unknown initial period and then becomes synchronous (with known message-delay bound). Most production systems make this assumption tacitly: the heartbeat interval is a guess at the synchrony bound, the failure-detection timeout is a guess at how long the asynchrony period might last, and the protocol's correctness depends on these guesses being right "most of the time". When the guesses are wrong (a 4.2-second GC pause in a system designed for 200 ms pauses), the protocol does not crash but its safety/liveness analysis no longer applies — and that is exactly when split-brains and stuck states emerge. The synchrony spectrum is a knob, not a binary, and tuning it is a Part-10 / Part-11 conversation that this chapter only flags.

Why "fail-fast" hardware is a Byzantine defence

Modern server hardware ships with several "fail-fast" features that look like reliability engineering but are actually Byzantine defences in disguise. ECC RAM detects and corrects single-bit errors and panics on multi-bit errors — converting a Byzantine fault (silently corrupted memory) into a crash failure (clean panic), which is much easier to recover from. Intel's Memory Checking Architecture and AMD's MCA do the same for CPU caches. NVMe drives use end-to-end CRCs on every block, converting silent disk corruption into explicit IO errors. NIC TSO offload checksums every TCP segment. Each of these features is a hardware-level transformation from a higher fault model into a lower one — Byzantine becomes crash, omission becomes detected omission, timing becomes bounded latency. The reason commodity-cloud distributed systems can tolerate as much as they do is that the hardware below them is doing relentless fault-model translation, and the protocol layer only has to handle what leaks through. When hardware is cheap or weakly-spec'd (the BharatBazaar non-ECC rack story), the leak rate spikes and protocols designed for crash-only suddenly need to handle Byzantine; the design failure was not in the protocol but in the hardware procurement decision.

Reproduce this on your laptop

# Reproduce the four-failure-modes simulator
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
python3 failure_modes.py
# Vary the random seed and the deadline parameter to see how the boundary
# between "agreed=1" and "agreed=0" shifts with the protocol's tolerance.

To see real Byzantine failures in action without writing them yourself, install Jepsen and run one of its built-in test scenarios against a misbehaving database — the output is a worked example of a checker that detects content-disagreement across replicas:

# https://jepsen.io/ — Aphyr's test framework
git clone https://github.com/jepsen-io/jepsen
cd jepsen/etcd
lein run test --time-limit 60 --concurrency 10
# Watch the linearisability checker classify each operation; a Byzantine
# fault would show up as "indeterminate" or as a violation of the
# read-your-writes invariant.

Where this leads next

This chapter is the foundation that the next four chapters build on. Each subsequent chapter takes one of the four models and develops it in production-grade depth:

The four-model taxonomy is one of the most-used framings in distributed systems — every consensus protocol's safety/liveness analysis explicitly states "we tolerate up to f Byzantine faults" or "up to f crash faults", and the difference is the bound this chapter spelled out. Reading any subsequent paper or documentation, the first question to ask is "what failure model does this protocol assume?", and the second is "does my actual environment match that assumption?". Mismatches between assumed and actual model are the largest single source of distributed-systems bugs that ship through code review and testing — because the code is correct under the assumed model and the tests run under it, the bug only appears when the actual environment violates the assumption, often in production at the worst time.

A small concrete drill to internalise the four-model lens: take the most recent incident your team handled and answer four questions. Which of the four failure modes was the root cause? Which model did the affected protocol assume? If those don't match, was the gap in the protocol's design (model under-spec'd) or the deployment environment (hardware / network worse than assumed)? What would the protocol need to add to handle one model up — say, treating crash as omission or omission as Byzantine? Doing this for ten incidents in a row teaches the four-model lens better than any paper. The lens then becomes the question you ask during design review for every new component, not after the post-mortem.

A second drill, sharper and more useful for an on-call rotation: open your most-paged service's runbook and grep it for the words "restart the node", "remove from the cluster", "wait for healthy", and "force failover". Each of those operational responses corresponds to an implicit failure-model assumption — restart-the-node assumes the failure is crash-recoverable; remove-from-cluster assumes the failure is permanent crash; wait-for-healthy assumes the failure is timing; force-failover assumes the failure detector is wrong. A runbook with a single mitigation step for every alert is a runbook that has been written under a single failure-model assumption, and the production engineer applying it does not get to choose. The richer runbook has a decision tree: what does the node's data-plane look like? what does the WAL position look like? are content checksums consistent across replicas? — each branch corresponding to a different failure-model diagnosis and a different mitigation. The richer runbook is also the one whose first version takes a week to write and saves a year of mis-applied mitigations.

The four-model lens is not a separate skill from on-call response; it is the diagnostic skeleton that turns "the cluster is misbehaving" into "this is omission, the failure detector hasn't caught it because heartbeats are surviving, the fix is to add a data-plane probe". Engineers who internalise the lens move two phases faster through every distributed-systems incident, because they spend less time guessing what kind of failure they're looking at and more time applying the right mitigation. That is the practical pay-off of this chapter, beyond the academic taxonomy — the framing turns into a production skill, and the production skill is the difference between a 4-hour incident and a 4-day one.

References

  1. The Byzantine Generals Problem — Lamport, Shostak, Pease, ACM TOPLAS 1982. The paper that named the fourth model and proved the 3f+1 lower bound for asynchronous Byzantine consensus.
  2. Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial — Schneider, ACM Computing Surveys 1990. The formal hierarchy of failure models this chapter follows.
  3. Practical Byzantine Fault Tolerance — Castro & Liskov, OSDI 1999. The first BFT protocol fast enough for practical use; still the canonical reference.
  4. Reaching Agreement in the Presence of Faults — Pease, Shostak, Lamport, JACM 1980. The original 3f+1 lower bound proof for Byzantine consensus.
  5. Gray Failure: The Achilles' Heel of Cloud-Scale Systems — Huang et al., Microsoft Research, HotOS 2017. The failure mode that lives between omission and Byzantine.
  6. DRAM Errors in the Wild: A Large-Scale Field Study — Schroeder, Pinheiro, Weber, SIGMETRICS 2009. The Google study showing that bit flips happen ~1 per gigabyte of RAM per year, motivating ECC.
  7. Wall: distributed is a failure-first design — internal cross-link to the chapter that established why these models matter.
  8. Designing Data-Intensive Applications — Kleppmann, O'Reilly 2017. Chapter 8 ("The Trouble with Distributed Systems") gives the practitioner's view of these four failure modes.