Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Wall: consistency at scale needs new models

It is 02:14 at CricStream — a fictional streaming platform — during the final over of a T20 final, and 27 million viewers are watching the same ball delivery on roughly 4,200 edge nodes. Anish, on call, is staring at a graph that does not make sense: the leaderboard shows two scores for the same match-id, the user-profile service is returning a follower count that decreases when refreshed, and the chat replica in the Mumbai region is two minutes behind the one in Hyderabad even though both serve the same channel. Nothing is broken. Every component is doing exactly what it was built to do. The thing that is wrong is the consistency model — the contract between writers and readers about what "the value" of a key means when there are 4,200 copies of it. This article is about why that contract has to change as you scale, and what the menu of replacement contracts looks like.

A single-node database has one copy of every value, so the contract is simple: a read returns the value of the most recent write. As soon as you have replication, that contract — linearisability — costs a quorum round-trip on every operation, and at thousands of nodes across regions the cost becomes prohibitive. The escape is to weaken the contract: drop real-time ordering and you get sequential consistency; drop ordering between unrelated operations and you get causal; drop the cross-replica agreement entirely and you get eventual. Each weakening saves latency or throughput; each costs the application a class of bug it must now defend against. This wall closes Part 11 by handing off to Part 12, which is the ladder of those contracts and what each one buys you.

Why the one-copy fiction breaks

A single-node database is easy to reason about because there is exactly one place where the bytes live, and every reader and writer queues up at that place. The "value" of a key is a fact about a single memory address. When two clients write x=5 and x=7 in that order, the second write overwrites the first, and every subsequent reader sees 7. The contract is implicit and free: the database does not pay anything to enforce it.

Replication is the moment that fiction stops being free. Once x lives on three nodes — A, B, C — the question "what is the value of x?" has no unambiguous answer. If client-1 writes x=7 to A and client-2 reads x from B before B has heard from A, client-2 sees the old value. The only way to reproduce single-node semantics — the property the textbook calls linearisability — is to make every read consult enough replicas to overlap with every prior write's quorum. That is a quorum-round-trip per read.

Why a quorum on every read: linearisability requires that there be a single global order on operations consistent with real-time. The only way to achieve that without a magic synchronised clock is for every read to "intersect" with every write. With N=3, W=2, R=2, any read quorum overlaps any write quorum in at least one replica, and that replica has the latest value. Cut R to 1 and you save the RTT but lose the intersection guarantee — a stale replica can answer the read.

At three nodes in one rack, the quorum RTT is 0.4 ms and you do not notice. At twenty-one nodes across three Indian regions (ap-south-1 Mumbai, the Hyderabad region, the Chennai region), inter-region RTT is 25–55 ms and a quorum read costs you 50 ms minimum — every read. At 4,200 edge nodes serving a cricket final, a literal Raft log replicating every score update would queue up behind a 6 ms quorum and saturate at ~165 writes per second, which is three orders of magnitude below what the workload demands.

The wall is this: the strict contract — linearisability — is affordable for small clusters in one datacenter, and prohibitively expensive for large clusters across regions. You will give up the contract whether you want to or not; the only question is whether you give it up deliberately, with knowledge of what you are losing, or accidentally, by not knowing what your replication topology actually guarantees.

Cost of linearisable reads as cluster size growsA line chart showing per-read latency on the y axis from zero to 100 milliseconds against cluster size on the x axis from 3 to 10000 nodes on a log scale. Three curves are plotted. The first labelled single-rack stays flat near 0.4 milliseconds. The second labelled single-region rises from 1 to 5 milliseconds across the range. The third labelled multi-region starts at 25 milliseconds and rises to 80 milliseconds. A horizontal dashed line at 10 milliseconds is labelled application budget per request. A vertical dashed line at 50 nodes is labelled wall — beyond this point linearisability does not fit the latency budget for multi-region. The chart is illustrative, not measured data. Linearisable read latency vs cluster size cluster size N (log scale) per-read latency (ms) 3 30 300 3000 10000 0 25 50 75 100 10 ms — typical p50 budget single-rack (~0.4 ms) single-region (1–5 ms) multi-region (25–80 ms) the wall — multi-region linearisability N ≈ 50 nodes, 1+ region Illustrative — not measured data.
Linearisable reads stay cheap inside one rack and survive a single region with low cost. Once the read quorum spans regions, every operation pays inter-region RTT. Past the dashed line, the linearisable contract no longer fits a typical 10 ms p50 budget — that is the wall.

The lattice — what you negotiate next

Once linearisability is too expensive, you do not "turn off consistency". You move down a lattice of weaker contracts, each one defined by what it stops promising and what it still does. The lattice from strongest to weakest is roughly: linearisablesequentialcausalread-your-writesmonotonic-readeventual. Each level removes one promise.

Linearisability promises that there is a single global ordering of operations consistent with real-time wall-clock order. Sequential consistency removes the "consistent with real-time" clause: there is still a single global ordering, but it does not have to match the order in which clients issued the operations. If you write x=5 at 10:00:01 and a friend writes x=7 at 10:00:02, sequential consistency is happy if every reader sees 7 then 5 — the ordering is consistent across readers, just not consistent with wall-clock. Sequential lets a system batch writes by replica without negotiating a global timestamp; you save the per-operation coordination cost.

Causal consistency removes the "single global ordering" clause: it only orders operations that are causally related — operation A happened-before operation B (a la Lamport) iff A is in the past light-cone of B. Operations that are concurrent (no causal link) may be observed in different orders by different replicas. The cost saving is that two writes to unrelated keys can be replicated in parallel without coordination. Most user-facing applications — comments under a post, follower counts, leaderboards — only need causal: as long as my replies arrive after my own comment that triggered them, the system feels right.

Read-your-writes is even weaker: a single client never sees an older version of its own writes. Different clients may disagree about each other. Monotonic-read says a single client never sees time go backward. Eventual says only that, given no new writes, all replicas converge to the same value — eventually, with no time bound. CRDTs (Conflict-free Replicated Data Types) are the implementation pattern that makes "eventual" actually safe: by restricting writes to operations that commute, you get convergence by construction.

Why each level saves something: linearisability requires a quorum read, which costs an inter-replica RTT. Sequential lets writes be ordered by a single timestamp authority but does not require reads to wait — you can save the read RTT. Causal only requires tracking the dependency vector (vector clock) per write — you can replicate two unrelated writes in parallel. Eventual requires only a per-replica local commit and an asynchronous gossip later — no synchronous coordination at all.

Why each level costs the application a bug class: linearisability hides race conditions from the application. Drop to sequential and a real-time ordering bug appears — process A wrote 5, then asked B and B said 3. Drop to causal and concurrent-write conflicts appear — two replicas update the same field with no causal link, and the merge is application-defined. Drop to eventual and the application must tolerate divergence for an unbounded interval. Picking the level is picking which of those bug classes you want to write defensive code for.

The PACELC theorem (Abadi 2012) is how this menu is taught for production: if there is a Partition, the system trades Availability for Consistency (CP/AP — the CAP face); else, even with no partition, the system trades Latency for Consistency (LC). Most large-scale systems are PA-EL — under partition, they stay available and accept inconsistency; in the normal case they accept latency to get consistency. Some — like DynamoDB — are PA-EL with a user-tunable L/C knob per request.

The consistency latticeA vertical lattice diagram with six rows. From top to bottom the rows are linearisable, sequential, causal, read-your-writes, monotonic-read, and eventual. Arrows pointing downward indicate that each weaker model is implied by stronger ones above it. To the right of each row is a label showing what is given up at that level and an example system. Linearisable lists Spanner and etcd. Sequential lists VoltDB. Causal lists COPS and Bayou. Read-your-writes lists session-stickied DynamoDB. Eventual lists Riak and Cassandra default. At the bottom is a note saying this is illustrative. The consistency lattice — strongest at the top, weakest at the bottom linearisable global order, real-time → pay quorum RTT per op (Spanner, etcd, ZooKeeper) sequential global order, no real-time → drop wall-clock requirement (VoltDB, Calvin) causal happens-before only → unrelated ops parallel (COPS, Bayou) read-your-writes per-session monotone → session stickiness (DynamoDB sessions, Mongo session) monotonic-read no time travel for one reader → replica pinning per session eventual converge — no time bound → async gossip + CRDT (Riak, Cassandra default, Dynamo) Illustrative — each level is implied by the level above; PACELC governs the trade.
Each level removes one promise from the level above and gains some latency or availability in exchange. Production systems pick a level *per operation type*, not per database — a checkout total wants linearisable, the follower count wants eventual.

Measuring the cost — a simulation of one write across a region

To make the abstract concrete: simulate a write being acknowledged under three contracts on a simple two-region replica set with one Mumbai replica and one Hyderabad replica with 25 ms inter-region RTT. The simulation reports per-write latency for linearisable (quorum=2 acks required), causal (single ack required, dependency tracked), and eventual (local ack, async replicate).

# consistency_cost.py — simulate one write under three contracts
import random
import statistics

random.seed(7)

# RTTs in milliseconds: local replica vs cross-region replica
LOCAL_RTT_MS = (0.4, 1.5)        # local-replica latency range
INTER_REGION_RTT_MS = (22, 35)   # ap-south-1 (Mumbai) ↔ Hyderabad region
LOCAL_COMMIT_MS = (0.1, 0.4)     # local fsync / commit latency
TRIALS = 5000

def latency(lo, hi):
    return random.uniform(lo, hi)

def linearisable_write():
    """Local commit + wait for ack from cross-region replica."""
    local = latency(*LOCAL_COMMIT_MS)
    cross = latency(*INTER_REGION_RTT_MS)   # round-trip to remote replica
    return local + cross                    # max of two would be more accurate; this is W=2 with sequential dispatch

def causal_write():
    """Local commit + dependency-vector update; replication is async."""
    local = latency(*LOCAL_COMMIT_MS)
    vector_update = latency(*LOCAL_RTT_MS)  # update vector clock (in-memory)
    return local + vector_update

def eventual_write():
    """Local commit only. Replication piggybacks on the next gossip round."""
    return latency(*LOCAL_COMMIT_MS)

print(f"{'contract':>14} {'p50 ms':>8} {'p99 ms':>8} {'p99.9 ms':>10}")
for label, fn in [
    ('linearisable', linearisable_write),
    ('causal',       causal_write),
    ('eventual',     eventual_write),
]:
    samples = sorted(fn() for _ in range(TRIALS))
    p50  = samples[TRIALS // 2]
    p99  = samples[int(TRIALS * 0.99)]
    p999 = samples[int(TRIALS * 0.999)]
    print(f"{label:>14} {p50:>8.2f} {p99:>8.2f} {p999:>10.2f}")

Sample output:

      contract   p50 ms   p99 ms   p99.9 ms
  linearisable    28.71    34.69      34.95
        causal     1.26     1.86       1.94
      eventual     0.25     0.40       0.40

The linearisable p50 of 28.7 ms is dominated by the cross-region RTT — every write waits for the remote replica's ack. The causal p50 of 1.3 ms is the local commit plus the local in-memory vector-clock update; replication runs asynchronously in the background and the writer does not wait for it. The eventual p50 of 0.25 ms is just the local commit — the replication is entirely deferred to the next gossip round, which may be 200 ms later. The p99/p99.9 spreads show that the linearisable contract has the longest tail because any cross-region jitter directly hits the writer's latency; causal and eventual are tightly bounded because they never block on the network. Two orders of magnitude separate linearisable from eventual — that is the price of the contract you are buying.

The simulation is illustrative — a real W=2 quorum across regions would dispatch in parallel and take max(local, cross) rather than local + cross, but the conclusion is unchanged: cross-region quorum on every write is roughly 25 ms, local-only is roughly 0.25 ms, and that 100× ratio is the wall.

A war story: BharatBazaar's "follower count is decreasing" outage

BharatBazaar — a fictional e-commerce platform — runs the seller-profile service across 32 nodes in two regions. The service stores follower_count, rating, and last_active_at per seller, and serves both the seller's dashboard and the public profile page. One Friday during a flash sale, sellers started reporting that their follower_count was decreasing between page refreshes — Riya, a kurta seller in Jaipur, was watching her followers go from 1247 to 1239 to 1244 to 1237 across four refreshes within ten seconds.

The team — Vikrant, Aditi, and Jishant — traced it within an hour. The follower-count writes were going to a single primary in the Mumbai region, which then async-replicated to a Hyderabad replica with a typical 200–800 ms lag depending on cluster load. The public profile page was served by the closest replica (read from Hyderabad if the user was in southern India, from Mumbai otherwise). Riya was browsing on her phone at a wedding in Coimbatore, where her requests load-balanced randomly across both replicas — so her refreshes were alternating between a slightly newer view (Mumbai) and a slightly older view (Hyderabad). The "decrease" was not a decrease; it was Riya's session bouncing between replicas at different replication-lag points.

The fix was to add monotonic-read consistency at the edge: the load balancer pinned a session cookie to a specific replica for the duration of a session, so a single user's reads never moved backward in time. The cost was a rebalancing edge case during replica restarts and a 3% increase in single-replica load skew. The benefit was that every reported "follower count is decreasing" ticket dropped to zero overnight. Why monotonic-read fixes this without paying for stronger consistency: the writer's physical lag is unchanged — Hyderabad still trails Mumbai by 200–800 ms. But because Riya is now stuck to one replica per session, every read she sees is from a single trajectory of replica state, and that trajectory is monotonically increasing. She still sees stale data; she just never sees backward jumps. The bug class "value goes down" disappears for the cost of session affinity.

Aditi's postmortem: "We assumed that 'eventually consistent' meant 'sometimes a bit stale, no big deal'. What we missed is that 'eventually' lets two replicas drift in opposite directions from the user's view if the load balancer flips between them. The fix was not to make the system more consistent — it was to make the user's path through the system consistent."

The deeper lesson for Part 12: choosing the consistency model is not a database-wide setting. It is per-operation, per-flow, per-call-site. The seller's edit-profile flow needs read-your-writes (the seller must see their edit reflected on next refresh). The public profile page needs monotonic-read (no backward jumps). The leaderboard needs causal (replies-to-comments must follow the comment). The checkout total needs linearisable (the inventory is single-master). Each of those is a different point on the lattice, served by the same replicated store with different read/write paths.

Common confusions

  • "Eventual consistency means the data is eventually correct." Eventual consistency means all replicas eventually converge to the same value, not that the value matches what the user wanted. If two clients write x=5 and x=7 concurrently to different replicas, eventual consistency tells you the replicas will agree on a final value, but it does not tell you whether that value is 5, 7, or some merge — that is the conflict-resolution policy on top.
  • "Strong consistency and linearisability are the same." "Strong consistency" is marketing — it means whatever the vendor wants. Linearisability is a precise property: every operation appears to take effect at a single point between its invocation and response, and the order is consistent with real-time. Many systems advertised as "strongly consistent" are merely sequential or even read-after-write.
  • "CAP forces a binary choice." CAP says: under a partition, you must give up one of consistency or availability. The PACELC extension (Abadi 2012) says: even without a partition, you give up one of latency or consistency. PACELC is the one to design with — partitions are rare; the latency-consistency trade is paid every operation.
  • "You can recover linearisability with retries and quorum reads on top of an eventually-consistent store." Retries amplify load, and quorum reads on top of an async-replicated store still see stale data unless writes also went through quorum. The contract is end-to-end; bolting it on at the read path leaks the gap somewhere else (typically into write-write conflicts).
  • "Causal consistency requires vector clocks." Causal consistency requires a way to track happens-before. Vector clocks are one implementation; explicit dependency lists (COPS), hybrid logical clocks, and even per-session causal tokens all work. Vector clocks have the worst metadata cost; HLCs and session tokens are cheaper at the cost of additional assumptions.
  • "Picking a weaker model means worse user experience." Often the opposite — a 25 ms linearisable read across regions feels slower than a 1 ms causal read for almost every user-facing flow, and the staleness window of 200 ms is invisible to a human. The bug-class trade is real, but "weaker" does not mean "worse user experience"; it usually means "faster, with the application taking responsibility for the cases where staleness matters".

Going deeper

Bailis & Ghodsi's "Eventual Consistency Today" — what eventual actually buys

Bailis and Ghodsi (CACM 2013) is the survey to read for the modern view of eventual consistency. They introduce Highly Available Transactions (HAT) and prove which transactional guarantees are achievable without coordinating across replicas — read-your-writes, monotonic-read, monotonic-write, and writes-follow-reads are all achievable; snapshot isolation, ANSI-SQL serializable, and strong session guarantees are not. The takeaway: there is a sharp boundary at coordination-free, and Bailis-Ghodsi tells you exactly which side any given guarantee is on. Read sections 3 and 4 for the achievability matrix.

PACELC and what production systems actually pick

Abadi's PACELC (IEEE Computer 2012) extends CAP with the else-clause: under a partition, you trade A for C; else, you trade L for C. The 2x2 grid is: PA-EL (Dynamo, Cassandra, Riak — partition: stay available; else: stay fast), PC-EC (BigTable, HBase — partition: refuse writes to maintain consistency; else: pay latency for consistency), PA-EC (rare — Spanner is close: partition: refuse writes; else: pay latency for linearisability), and PC-EL (rarely useful in practice). Most modern production picks are PA-EL with a per-request override knob — DynamoDB lets you set ConsistentRead=true per call, paying the latency only when you need it.

The role of CRDTs — why eventual without conflict

Conflict-free Replicated Data Types are how eventual consistency becomes safe. The trick is to restrict the write API to operations that commute, are associative, and idempotent — addition is commutative (a+b=b+a), max is, set-union is, but assignment is not. With a CRDT, two replicas can apply concurrent writes in any order and converge to the same final state without conflict resolution. Shapiro et al. (SSS 2011) classified the CRDT family. The cost is that not every data type fits: LWW-Register loses concurrent writes by design; OR-Set carries metadata proportional to concurrent removes; G-Counter cannot decrement. Picking eventual + CRDT is choosing your conflict policy before the conflict happens, by restricting the write vocabulary.

Linearisable subsets — using consensus only where needed

A common pattern in production is to make most operations eventually consistent, and route the few that need linearisability through a consensus-backed metadata store. Spanner does this with TrueTime; Calvin does this with a deterministic ordering layer; DynamoDB Global Tables do this by routing writes to a per-key primary region. The cost model: 90% of operations pay 1–5 ms (eventual or causal local), 10% pay 25–80 ms (linearisable cross-region). The aggregate p99 stays inside budget because the linearisable subset is small.

Why the lattice is not a total order

The diagram in this article shows a vertical stack, but the consistency lattice is genuinely a lattice — there are guarantees that are incomparable. Causal and read-your-writes are not on the same line: a system can be causal-but-not-RYW (different sessions of the same user see different states) or RYW-but-not-causal (each session sees its own writes but causal links across sessions are broken). The full lattice has ~20 named guarantees in Viotti & Vukolić's 2016 survey; only a handful are commonly named in production. The implication: when somebody says "this database is consistent", ask which of the 20 they mean.

Reproduce this on your laptop

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install statistics
python3 consistency_cost.py

The simulation runs in well under a second. To explore the multi-region cost honestly, change INTER_REGION_RTT_MS to your own measurements (mtr or ping between two of your AWS regions), then re-run — the linearisable p50 should track twice the inter-region RTT to within 10%.

Where this leads next

Part 11's gossip arc gave you tools to spread state across thousands of nodes at logarithmic cost — but gossip's contract is the weakest one on the lattice. Part 12 walks back up: it formalises linearisability (the next chapter), then sequential consistency, then causal, then the session guarantees, all the way down to eventual. Each chapter gives you the precise property, the cost to implement it, the bug class it prevents, and the production systems that picked it.

The chapter after that is the CAP theorem in its formal Gilbert-Lynch (2002) form, then PACELC in Abadi's (2012) form. Both are tools for picking a level on the lattice given the partition behaviour and latency budget you actually have. By the end of Part 12, you should be able to look at a service's read/write path and say: "this is causal+RYW, with monotonic-read at the edge, and falls back to eventual under partition" — and know exactly what each of those phrases buys and costs.

Beyond Part 12, CRDTs (Part 13) are the implementation that makes eventual consistency safe by construction. Distributed transactions (Part 14) are the implementation that pays the linearisable cost when nothing else will do. The wall this chapter names is what forces those two parts to exist.

References

  • Abadi, D. — "Consistency Tradeoffs in Modern Distributed Database System Design" (IEEE Computer 2012). The PACELC paper.
  • Bailis, P., Ghodsi, A. — "Eventual Consistency Today: Limitations, Extensions, and Beyond" (CACM 2013). The modern survey of what eventual consistency really buys.
  • Brewer, E. — "Towards Robust Distributed Systems" (PODC 2000). The CAP keynote.
  • Gilbert, S., Lynch, N. — "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services" (SIGACT 2002). CAP formalised.
  • Vogels, W. — "Eventually Consistent" (CACM 2009). Amazon's framing of session guarantees and the role of eventual in Dynamo.
  • Viotti, P., Vukolić, M. — "Consistency in Non-Transactional Distributed Storage Systems" (ACM CSUR 2016). The 20-guarantee lattice survey.
  • Lloyd, W., Freedman, M., Kaminsky, M., Andersen, D. — "Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS" (SOSP 2011).
  • Shapiro, M. et al. — "Conflict-free Replicated Data Types" (SSS 2011).
  • Convergence-time analysis — the gossip-side bound that frames why eventual is so cheap.
  • The append-only log: simplest store — the substrate beneath every consistency model.