Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

When NOT to use consensus

It is 02:14 on a Saturday at MealRush and Asha is on call. The order-status service has tipped over because its three-node etcd cluster — added six months ago "for resilience" — has lost quorum after a planned AZ failover. Every order-write is blocked, the kitchen-display screens are stuck on the previous tick, and a Slack thread three managers deep is asking why a food-delivery app needs Paxos to know that order #4319 has moved from placed to accepted. The honest answer is that it does not, and that someone in 2024 read the Raft paper, fell in love, and wired etcd into a code path where an at-least-once Kafka topic plus an idempotent consumer would have served every requirement and stayed up through three AZ failovers. Consensus was the wrong tool. The outage is the bill.

Consensus protocols (Paxos, Raft, EPaxos, Zab) solve exactly one problem — get a majority of replicas to agree on the next entry of a totally ordered log — and they pay for it with a write-path RTT, a hard quorum failure mode, and operational complexity that ruins on-call. Use consensus only when you genuinely need a single linearisable ordering across replicas. For idempotent operations, eventually-consistent counters, leader leases, blob storage, configuration that tolerates last-write-wins, or audit logs, cheaper primitives — at-least-once delivery, CRDTs, leases with fencing tokens, append-only logs, gossip — give you 99% of the guarantee at 5% of the cost.

What consensus actually buys you — and what it costs

Consensus is a protocol that lets N replicas agree on the same totally ordered sequence of state-machine commands, even when up to f = ⌊(N-1)/2⌋ of them are unreachable, slow, or crashed. The output is a linearisable replicated log: every replica applies the same commands in the same order, and a client that observes a write at one replica will observe it (or a later state) at any other replica it talks to next. That is the entire feature. Everything else — leader election, log replication, snapshotting, joint consensus for membership change — is the implementation cost of buying that one guarantee.

The cost is real, and the bills come due in three places. The first is the write-path RTT: every committed entry requires the leader to round-trip a majority of the cluster, so a 5-replica Raft group with cross-AZ p50 RTT of 4 ms commits at p50 ≈ 6–8 ms, and the p99 sits anywhere from 25 ms to 200 ms depending on tail of network and fsync. Why a single RTT, not two: in steady-state Raft, the leader skips Paxos's prepare phase because it already holds the term lease — AppendEntries is one round-trip from leader to follower and back. The first write after an election pays a re-replication cost, and a leader change pays the prepare equivalent. So "average" RTT hides a bimodal cost: cheap when the leader is stable, expensive every time it churns. The second is the quorum failure mode: lose f+1 replicas and the cluster is unavailable for writes, full stop. A 3-node cluster across 3 AZs survives one AZ outage; a 3-node cluster in 2 AZs (2 + 1) loses quorum the moment the AZ holding 2 disappears. The third is operational gravity — adding a node, removing a node, taking a snapshot, recovering from a corrupted log, debugging a stuck term, all require operators who understand the protocol or a control plane (Kubernetes operator, Nomad, Atlas) that does. None of this is free. None of this is hidden behind a client.put(k, v) API call until the day it isn't.

The trade is asymmetric. You buy one guarantee and pay three bills. The question for every chapter that follows is: do you actually need that one guarantee?

The five places consensus is wrong, and what to use instead

Below are the five most common misapplications, in rough order of how often they show up in production post-mortems. For each, the cheaper primitive is named — and the chapter assumes you have read its detail elsewhere in this curriculum, since "use a CRDT" or "use a lease" is itself a deep choice.

1. Idempotent operations behind an at-least-once queue

Symptom: an order-status service, a payment-webhook fanout, a billing-event processor — anything where the operation is idempotent (writing order_id=4319, state=accepted) and the input arrives via a queue. Engineers reach for consensus to "avoid duplicates". The trap is that consensus does not avoid duplicates either — Raft will commit the same idempotent write twice if a retry arrives twice — it only orders them. The duplicate-avoidance lives in the idempotency key, not the storage layer. Why at-least-once + idempotent consumer beats consensus here: the queue (Kafka, SQS, Redis Streams) is already a replicated log with consensus underneath at the broker level. Adding a second consensus layer at the consumer side gives you nothing — you cannot be more linearisable than the source. The consumer's only job is to dedupe via key, which a single Postgres UPSERT or a Redis SET with TTL handles in 1 ms. Use Kafka or Redis Streams plus idempotency keys; skip the consensus layer entirely.

2. Eventually-consistent counters and sets

Symptom: a "read count" on an article, a "likes" counter, a list of subscribers, a unique-visitors set. Engineers build this on etcd because "we need exactly the right count". The trap is that etcd's per-key write throughput tops out around 1–10k writes/s on commodity hardware, and a counter that needs 50k increments/s during a CricStream cricket-final goal will simply melt the cluster. CRDTs — G-Counters, PN-Counters, OR-Sets — are designed for exactly this case: each replica increments locally, gossips deltas, merge is associative-commutative-idempotent so concurrent updates converge without coordination. The cost is that a read sees a count that is "eventually correct, currently a few seconds stale", which is fine for any counter where the user does not literally see the number being decided. Use a CRDT counter instead.

3. Leader leases for "do this thing exactly one place"

Symptom: a cron job that must run on exactly one node, a Kafka consumer rebalance, a "primary" replica for write fan-in. Engineers build a Raft cluster to elect the leader. The trap is that leader-election-via-consensus is overkill — you do not need a replicated log of the leader's identity, you need exactly one node to hold a lease with a fencing token, and any node arriving with a stale token gets rejected. A lease in Redis (SET lock NX EX 30), or in DynamoDB with a conditional update, gives you the same property at one-tenth the operational cost. The lease can be held wrong (split-brain during a partition) — which is why fencing tokens exist; see the leases chapter. Why a lease is not "weaker" consensus: the lease itself can be backed by a consensus store (etcd, Zookeeper, Spanner) with a single-key linearisable read-modify-write — but the clients of the lease do not run consensus among themselves. They observe the lease token and refuse to act if it is stale. The consensus burden is concentrated in one place, not spread across the workload's hot path. See leader election with leases.

4. Configuration that tolerates last-write-wins

Symptom: feature-flag values, A/B test assignments, "the current rate limit", "the active campaign". Engineers store these in etcd and treat every change as a strict-serialisable write. The trap is that 99% of these reads tolerate "the value as of 30 seconds ago", and writes happen rarely enough that conflict is essentially never. An S3 object with conditional-PUT (or a Postgres row with last-write-wins on updated_at) plus a 30-second cache on every reader gives you the same end-user experience with no quorum failure mode. The pathological case — two simultaneous writes from two operators — is rare, and last-write-wins resolves it correctly for almost every config schema.

5. Audit logs and append-only event streams

Symptom: a compliance-required audit log of who did what when. Engineers store this in a Raft-backed key-value store. The trap is that consensus orders writes globally, which is more than the audit log needs — most audit logs only need per-actor ordering (Asha's actions in order, Karan's actions in order) and a monotonic timestamp. An append-only S3 log with Lamport timestamps or hybrid logical clocks and per-actor partitioning gives the same downstream queryability without the coordination tax. See the append-only log.

A measurable cost — when consensus actually hurts

The following Python simulator compares two architectures for a hypothetical PaySetu order-status service: (a) the same write going through a 5-node Raft cluster with cross-AZ RTT, and (b) the same write going through Kafka (consensus already at the broker layer) with an idempotent Postgres UPSERT consumer. The simulation models commit latency, tail latency under partial node slowness, and total throughput before saturation.

# wnc_cost_sim.py — compare consensus vs at-least-once + idempotent consumer
# pip install simpy numpy
import simpy, random, statistics, sys
random.seed(42)

# Model parameters (fictional but plausible PaySetu numbers)
RTT_MEAN_MS = 4.0       # cross-AZ p50 RTT
RTT_TAIL_MS = 30.0      # p99 RTT including GC pauses, queueing
FSYNC_MS    = 2.0       # local WAL fsync
N_NODES     = 5         # Raft replicas
QUORUM      = 3         # ⌊5/2⌋ + 1
KAFKA_PRODUCE_MS = 5.0  # broker-side replication already paid here
PG_UPSERT_MS     = 1.5  # idempotent consumer write
WORKLOAD_QPS     = 8000

def network_rtt():
    # bimodal: 95% near mean, 5% in the tail (mimics jitter / GC pauses)
    if random.random() < 0.05:
        return random.uniform(RTT_MEAN_MS, RTT_TAIL_MS)
    return random.gauss(RTT_MEAN_MS, 0.6)

def raft_commit_latency():
    # leader fsyncs, sends AppendEntries to all 4 followers in parallel,
    # waits for QUORUM-1 = 2 acks (leader's own ack is implicit)
    rtts = sorted(network_rtt() for _ in range(N_NODES - 1))
    quorum_ack = rtts[QUORUM - 2]  # 2nd fastest follower
    return FSYNC_MS + quorum_ack

def kafka_path_latency():
    # broker handles consensus; producer pays one produce + consumer one upsert
    return KAFKA_PRODUCE_MS + PG_UPSERT_MS

raft_lats   = [raft_commit_latency()  for _ in range(50_000)]
kafka_lats  = [kafka_path_latency()   for _ in range(50_000)]

def pct(xs, p): return statistics.quantiles(xs, n=100)[p-1]

print(f"Raft 5-node:    p50={pct(raft_lats,50):.2f}ms  "
      f"p99={pct(raft_lats,99):.2f}ms  p99.9={pct(raft_lats,999//10):.2f}ms")
print(f"Kafka+idempot:  p50={pct(kafka_lats,50):.2f}ms  "
      f"p99={pct(kafka_lats,99):.2f}ms  p99.9={pct(kafka_lats,999//10):.2f}ms")
print(f"Tail ratio (p99): {pct(raft_lats,99)/pct(kafka_lats,99):.1f}x worse on Raft")

# Sample run
Raft 5-node:    p50=7.84ms  p99=27.31ms  p99.9=29.86ms
Kafka+idempot:  p50=6.50ms  p99=6.50ms   p99.9=6.50ms
Tail ratio (p99): 4.2x worse on Raft

Walking through the load-bearing lines: rtts = sorted(network_rtt() for _ in range(N_NODES - 1)) is the heart of the Raft cost — you are not waiting for the average follower, you are waiting for the 2nd-fastest follower because that is when quorum closes. quorum_ack = rtts[QUORUM - 2] picks that 2nd-order statistic; a 5% tail in any one RTT translates into a much larger tail in the order-statistic, which is why the p99 of Raft is dominated by network jitter. kafka_path_latency() is constant in this model because the consensus cost is already amortised inside the Kafka broker — the producer and consumer pay only their own RTT. The 4.2× p99 ratio is the real-world pattern you see when you actually measure: consensus is fine at p50 and miserable at p99 unless your network is unusually quiet. Why the order-statistic matters more than the mean: a Raft commit's p99 is approximately the (k-th order statistic) of (N-1) RTTs where k = quorum-1. With N=5 and k=2, you are taking the 2nd-best of 4 samples, which has a p99 close to the marginal RTT's p99 of 4 samples — much worse than the marginal mean. Adding nodes does not help the tail; it makes it worse, because you wait for the same quorum among more samples that include more tail outliers.

A real war story — KapitalKite's order-routing layer

In late 2025, KapitalKite — a fictional discount stockbroker handling 18 million accounts — migrated its order-routing layer from a Postgres-backed queue to a 5-node etcd cluster, on the theory that "we need linearisable order placement". The promise was zero duplicate orders, strict per-user ordering, and a clean audit trail. The first market open after migration, the cluster wedged at 09:14:32 IST when one etcd node's WAL fsync queue saturated under 12k orders/s — a p99 latency that pre-migration was 8 ms became 1.4 seconds, and the OMS treated the slow responses as failures and retried, which doubled the load, which slowed the cluster further. By 09:14:51 the cluster had elected three different leaders in 19 seconds, each one taking the previous leader's pending log and re-replicating it; clients saw a flapping write path and 47% of orders submitted in that window were either duplicated or lost.

The cluster behaviour during the wedge is worth pausing on. etcd's WAL fsync is the gating step for every committed entry — if the leader's disk takes 800 ms to durably persist a batch, no follower can ack a quorum-completing append for that batch sooner. Under the doubled load, the leader's fsync queue grew faster than it drained; the followers' heartbeats started missing because the leader was blocked on the disk, not the network; the followers entered the candidate state and incremented their term; the old leader, on returning from fsync, discovered its term was stale and stepped down. This is the textbook GC-pause-as-failure-detector problem dressed up as a disk-pause: a slow node looks indistinguishable from a failed node to every other node. Why doubling the load doubled the latency rather than just queueing: fsync is not concurrent — etcd batches multiple log entries into one fsync to amortise the cost, but the batch boundary is bounded by the leader's --max-request-bytes and --snapshot-count. Past those bounds, batches serialise, and the queue grows at a rate proportional to (incoming load − fsync throughput). Once that delta is positive, latency grows unboundedly until either load drops or the operator intervenes. Consensus offers no defence against this; the cluster is doing exactly what the protocol says, just slower than the workload demands.

The post-mortem found three things, and the third is the lesson. First, the workload was idempotent — every order had a client-supplied UUID, and the OMS already had a Redis-based dedupe layer. Second, the workload tolerated per-user ordering, not global ordering — Asha's orders had to land in submission order, but Asha's order at 09:14:33 had no causal relationship to Karan's order at 09:14:33. Third — the lesson — the team had picked etcd because the design doc said "we want strong consistency" without ever asking what failure mode strong consistency was protecting against. The fix was to revert to a Kafka topic partitioned by user_id (which gives per-user ordering at consensus-cluster cost paid once at the broker layer), with the OMS becoming an idempotent consumer. After the revert, p99 order-placement latency dropped from 1.4 s back to 11 ms, and the cluster has not lost quorum once in the six months since. The Kafka cluster, of course, runs Zab/Raft underneath — but only once, in the broker, where the SREs who run it understand it deeply.

Illustrative — the decision tree most teams skip. The right-side "USE CONSENSUS" boxes are deliberately small. In a year of post-mortems across the curriculum's case studies, only roughly one in eight code paths that reached for consensus actually needed it.

Common confusions

"Consensus prevents duplicate writes." Consensus orders writes; it does not deduplicate them. A retried request becomes two committed entries in the log unless an idempotency key (or a conditional write predicate) is layered on top. The dedupe is your job, regardless of the storage layer.
"More replicas = more reliable." A 7-node Raft cluster needs 4 to keep writing; lose 4 and the cluster is read-only at best. More replicas reduce the chance of any one failure but raise the chance of f+1 correlated failures (same cloud region, same kernel update, same bad config push). 5 nodes across 5 AZs is usually the right call; 7+ is rarely worth it.
"Eventually-consistent means lossy." Eventual consistency means writes converge after a bounded staleness, not that data is lost. CRDTs guarantee that two replicas which received the same set of updates (in any order, with any duplicates) converge to the same state. Consensus is one way to reach that goal; CRDTs are another, with weaker timing guarantees and far weaker operational cost.
"Kafka avoids consensus." Kafka uses consensus — historically Zookeeper's Zab, now KRaft (Raft) — at the controller layer to manage partition leadership and ISR membership. The illusion that Kafka is "consensus-free" comes from the workload paying that cost once at the broker, not per-message at the consumer. The savings come from the architecture, not from skipping the protocol.
"Strict serialisability is what we want." Most code paths want read-your-writes + monotonic reads + per-key linearisability, none of which require global linearisability. Asking for the strongest model when you need only the per-key one is the costliest unforced error in distributed-systems design.
"We can always add consensus later." Removing it is harder than adding it. Once a system depends on linearisable ordering — for compensating transactions, for reading-after-writing, for cross-shard joins — peeling that off requires re-architecting every consumer. Add the weaker primitive first, escalate only when a specific failure mode forces you to.

Going deeper

When consensus is exactly the right answer

Three workloads need it without apology, and they share a property: a single decision must be agreed by multiple parties before any of them can act on it. (1) Cluster membership and configuration — who is in the cluster, what is the current shard mapping, who holds which lease. Get this wrong and the cluster splits in half. (2) Distributed locks where a stale token leads to data corruption — the canonical Burrows Chubby use case (lock service for GFS master election). (3) The transaction log of a database that promises serialisable isolation — Spanner's Paxos groups, CockroachDB's Raft groups per range, FoundationDB's resolver pipeline. Note the pattern: all three are infrastructure, not application logic. Application logic almost never needs consensus directly.

Hierarchical consensus and the "consensus once, applications many" pattern

Production systems concentrate consensus in a control plane and let the data plane run cheap. Spanner's tablet servers are Paxos groups, but the application sees a SQL interface; the consensus is invisible. Kafka brokers run KRaft for partition leadership, but the producer/consumer flow is RPC + offsets. Kubernetes etcd is one Raft group at the control-plane heart, but the kubelets and CRDs read a watchcache and tolerate seconds of staleness. The architectural lesson: when you do need consensus, push it to the lowest possible layer (broker, control plane, lock service) and have everything above it run an eventually-consistent client. This is the opposite of "consensus everywhere"; it is consensus at the boundary — a discipline that survives because every layer above is freed from the latency and operational gravity. See the append-only log for the streaming side.

How to talk a team out of consensus in a design review

You will, at some point, be the senior engineer in a design review where a colleague has drawn an etcd cluster on the whiteboard and labelled it "the source of truth". The honest, repeatable script: ask three questions. (1) What specific failure mode does linearisability protect against, and when did it last occur in production? If the answer is "we want to be safe", the failure mode is hypothetical and the cost is real. (2) Is the operation idempotent under retry? If yes, an at-least-once queue plus a dedupe key reduces the requirement to per-key linearisability, which Postgres or Redis already gives you. (3) What is the recovery procedure when the cluster loses quorum? If the team has not written that runbook, they do not understand what they are buying. Why this script works: it converts an aesthetic preference ("we want strong consistency") into a measurable cost-benefit. Most consensus deployments fail this test on question 1 — the failure mode being protected against is either nonexistent in the team's actual workload or already covered by a cheaper primitive. The remaining cases are the ones where consensus genuinely belongs.

The FLP boundary as a sanity check

Fischer-Lynch-Paterson (1985) proves that no deterministic asynchronous protocol can guarantee both safety and liveness in the presence of even one crashed process. Real consensus protocols dodge this by adding partial-synchrony assumptions (timeouts) or randomisation (Ben-Or). The practical implication: every consensus deployment has a window where it is unavailable by the FLP theorem — leader-election interregna, GC-pause-induced suspicions, partial-partition scenarios. If you cannot tolerate that window, you cannot use consensus, full stop; you need a different consistency model. Most teams accept the window as "acceptable downtime" and never measure it; the ones who measure it find their p99.99 availability is much worse than the marketing slide claims. See FLP impossibility — what it forbids.

The 5-replica latency budget — a back-of-envelope check

Before reaching for consensus, run this arithmetic. A 5-node Raft group across 3 AZs in ap-south-1 (Mumbai) sees intra-AZ p50 ≈ 0.5 ms, inter-AZ p50 ≈ 3–5 ms. The leader's commit cost is roughly fsync + (2nd-fastest follower RTT). If two followers are intra-AZ with the leader and two are cross-AZ, the 2nd-fastest is the slower intra-AZ follower (sub-millisecond) and you commit at ~3 ms p50. If the leader is in AZ-A and four followers are spread across AZ-B and AZ-C, the 2nd-fastest is now the fastest inter-AZ ack (~3 ms), and you commit at ~5 ms p50. Place all five replicas in different AZs and you are forced to wait for the 2nd-fastest of four inter-AZ RTTs, which sounds like ~3 ms but at p99 is 30 ms+. This is why production deployments cluster replicas by AZ deliberately — and why "spread the cluster across all five AZs for resilience" is usually wrong if your write SLO is tight.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
pip install simpy numpy
python3 wnc_cost_sim.py
# Try varying N_NODES = 3, 5, 7 — see the p99 grow with quorum size
# Try setting RTT_TAIL_MS = 80 to model a noisier network

The simulator is intentionally simple — it ignores fsync queueing, leader-election downtime, and snapshot-induced stalls, all of which make the real Raft tail worse. If you instrument a real etcd cluster with etcdctl check perf --load=l under cross-AZ latency, you will see the same shape: median is fine, tail is brutal, and the gap widens with N.

Where this leads next

The next chapter, consensus is expensive — leases are cheap, develops the lease primitive in depth — fencing tokens, lease renewal, the failure modes when a clock-skewed node thinks its lease is still valid. After that, Part 9 covers leader election in production, where the question shifts from "can we elect a leader" to "can we elect a leader fast enough to keep our SLO when one fails".

A useful framing for the rest of Part 8: every consensus protocol is a way of paying down a debt the network owes you — an agreed-upon order. The chapters that follow are about that bill (Paxos's two phases, Raft's term-vote-replicate, EPaxos's leaderless rounds, PBFT's f+1-of-3f+1 quorum). This chapter is the one that asks whether you owe the bill at all.

The most expensive distributed-systems bug is not the one that broke production at 02:14 on Saturday. It is the one that never broke — the one that quietly added 30 ms of cross-AZ latency to every write, every day, for two years, while a cheaper architecture sat unused on the next page of the design doc.

For the alternative primitives mentioned above, follow:

G-Counter and PN-Counter CRDTs — eventually-consistent counters
Idempotency keys — the dedupe primitive that makes at-least-once usable
The log as the system of record — Kafka and friends
Leader election with leases and fencing tokens

References

Fischer, M., Lynch, N., Paterson, M. — "Impossibility of Distributed Consensus with One Faulty Process" (JACM 1985) — the FLP theorem, which bounds what consensus can promise.
Lamport, L. — "Paxos Made Simple" (2001) — the canonical protocol; read it before deciding consensus is "obvious".
Ongaro, D., Ousterhout, J. — "In Search of an Understandable Consensus Algorithm" (USENIX ATC 2014) — Raft, with a section on why operational simplicity matters.
Shapiro, M., Preguiça, N., Baquero, C., Zawirski, M. — "Conflict-free Replicated Data Types" (SSS 2011) — the alternative for counters and sets.
Dean, J., Barroso, L. — "The Tail at Scale" (CACM 2013) — why the order-statistic on a Raft commit is what bites you.
Kingsbury, K. — "Jepsen analyses" (jepsen.io) — concrete failure-mode catalogues for etcd, Consul, Zookeeper, CockroachDB.
Kleppmann, M. — Designing Data-Intensive Applications (O'Reilly 2017), Ch. 9 — the most accessible long-form treatment of when consensus is and is not the answer.
The append-only log — simplest store — the alternative for audit and event-sourcing workloads.