Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

The bumper-sticker CAP — "pick two of consistency, availability, partition tolerance" — is wrong. The honest statement is narrower: during a network partition, a distributed system can guarantee at most one of consistency or availability — and partitions are not optional, so the only real choice is how your code behaves while one is happening. PACELC adds the axis everyone feels day-to-day: even when there is no partition, every system trades Latency against Consistency because strong consistency costs a quorum round-trip. The discipline this chapter installs is per-operation thinking — name what each query gives up under partition, and what it pays in latency when there is none.

CAP is the most-quoted and most-misquoted theorem in distributed systems. The folk version sells it as a buffet — pick two of three — and it has misled a generation of engineers into thinking partition tolerance is something you opt into. It is not. This chapter rebuilds CAP from the Gilbert-Lynch proof, walks through PACELC, and uses a UPI payments app to show why most real architectures are not pure CP or AP but a mixture chosen per-operation.

What CAP actually says

The CAP theorem was a conjecture by Eric Brewer in 2000, proved in a 2002 paper by Seth Gilbert and Nancy Lynch. The three letters in the proof have precise meanings, and they are not the meanings most engineers use.

Consistency (C) in CAP means linearizability — every read sees the result of the most recently completed write, as if the database were a single register that processes one operation at a time. This is the strictest meaning of "consistent" in the distributed systems literature; it is not the C in ACID, which is about preserving application-level invariants.

Availability (A) means every non-failing node responds to every request with a non-error response in finite time. Not "highly available" in the operational sense (99.99% uptime); a stronger, mathematical sense — no node refuses any request, ever.

Partition tolerance (P) means the system continues to operate when the network drops messages between nodes. Not "the system handles partitions gracefully"; the literal property that some messages between honest nodes can be lost without the system crashing.

The theorem is then: no system can guarantee all three of C, A, and P at the same time. Equivalently — and this is the form that matters in practice — when a partition occurs, a system must give up either C or A.

The classic CAP triangle, with the C-A edge blocked because P is non-negotiableA triangle with three vertices labelled C (Consistency) at top, A (Availability) at bottom right, and P (Partition tolerance) at bottom left. Three edges connect the vertices: the C-P edge on the left labelled "CP systems: Spanner, etcd"; the A-P edge on the right labelled "AP systems: Cassandra, DynamoDB classic"; the C-A edge at the bottom is drawn with a heavy red X across it and labelled "CA only — not a distributed system". A note on the side explains that you cannot opt out of P in any real distributed system, so the meaningful choice is along which side of the triangle you sit, CP or AP.CAP triangle: P is forced on you, so the real choice is the C-P edge or the A-P edgeCConsistency(linearizability)AAvailability(every node always responds)PPartition tolerance(survives lost messages)CP edgestays consistent,refuses writes during partitionSpanner, CockroachDB, etcdAP edgestays available,accepts divergent writesCassandra, DynamoDB, RiakCA only — single node, not a distributed system(blocked: you cannot opt out of P in a real cluster)
The folk drawing has all three edges as live options. The honest drawing crosses out the bottom edge: pure CA means a single-node system, where there is no network to partition. Real distributed systems pick one of the two slanted edges — CP (consistent, refuses writes when partitioned) or AP (available, accepts divergent writes when partitioned).

Why the bottom edge is not a real option: a "CA system" would be one that has both consistency and availability, and gives up partition tolerance. But "giving up partition tolerance" means assuming the network never partitions — which means you have one node, or you crash the entire cluster the instant any link flickers. Single-node Postgres is CA in this sense; so is your laptop's SQLite. Neither is what people mean by a distributed database. The instant you have two nodes that need to agree, the network can partition them, and CAP fires.

Why "pick two of three" is wrong

The phrase you have heard a hundred times — "you can have two of CAP, but not all three" — frames CAP as a buffet. It implies that some systems pick C and A, others pick C and P, others pick A and P. This is wrong on its own terms.

The theorem says no system can have all three. It does not say every system must have exactly two. A buggy system can have zero. More importantly, the framing pretends P is a design choice, when it is not. Partitions happen whether or not you plan for them. The only design choice is the response to a partition.

Brewer himself wrote a twelve-years-later reflection in 2012 clarifying that the original "2 of 3" was a simplification that "has been seen as misleading", and that in real systems "the choice between C and A can occur many times within the same system at very fine granularity — not only can subsystems make different choices, but the choice can change according to the operation or even the specific data or user involved." The right mental model is per-operation, not per-system.

Two concrete misuses to retire from your vocabulary:

"NoSQL is AP, SQL is CP." False on every axis. MongoDB since v4 defaults to primary-only writes with majority read concern, which is CP. Cassandra can be configured per-query with consistency levels from ONE (AP-ish) to ALL (CP-ish). MySQL with async replication is closer to AP (you can read stale data from a replica). PostgreSQL with synchronous replication and a single writeable primary is CP. The classification is per-system, per-config, per-operation; "NoSQL vs SQL" is a category error.

"MongoDB chose A in CAP." That was a marketing line from circa 2010, and it has been overtaken by reality. MongoDB v4+ uses Raft for replica-set consensus; writes with writeConcern: majority and reads with readConcern: majority give you linearizable behaviour at the cost of write availability during a partition that loses the majority. It is, by any honest reading, CP for its default workload. You can dial it down per-query if you want, but the default is no longer "available at all costs".

"CAP is dead." Not dead, just often misapplied. The theorem is a worst-case statement about a single failure mode (network partition). Most production systems most of the time are not partitioned, and the interesting trade-off they face minute-to-minute is the one PACELC names: latency vs consistency. CAP is necessary, not sufficient.

PACELC: the axis you actually feel every day

Daniel Abadi noticed in a 2010 blog post and 2012 IEEE Computer paper that CAP misses the day-to-day trade-off. If there is a Partition, you choose A or C. Else (no partition), you choose Latency or Consistency. Hence PACELC: PA/EL, PA/EC, PC/EL, PC/EC.

The reason latency and consistency trade off even without partitions is structural. To guarantee a read sees the latest write, the read must either go to the same node that took the write (a single primary, which then becomes a bottleneck) or to a quorum of nodes that collectively agree on the latest value (a Paxos / Raft round, which costs at least one network RTT, often more). Either way, strong consistency costs network round-trips. Weak consistency lets you serve a read from any local replica, possibly stale, in a few hundred microseconds.

PACELC extends CAP with the latency-versus-consistency trade-off in the no-partition caseTwo side-by-side panels. Left panel labelled "Partition (P)" shows a fork with two branches: PA (choose Availability) and PC (choose Consistency). Right panel labelled "Else — no partition (E)" shows a fork with two branches: EL (choose low Latency) and EC (choose Consistency). A combined classification table below shows four cells: PA/EL with examples Cassandra default, DynamoDB classic; PA/EC with example Riak in some modes; PC/EL with examples MongoDB v4 default, CockroachDB follower reads; PC/EC with examples Spanner, etcd, ZooKeeper, FoundationDB.PACELC: when partitioned, A vs C; when healthy, L vs CIf Partition (P)network split between nodesPAstay AvailablePCstay ConsistentElse — no Partition (E)healthy network, normal operationELlow LatencyECstrong ConsistencyFour combined classes — pick one rowClassDuring partitionNo partitionReal-world examplesPA/ELstay availablelow latencyCassandra default, DynamoDB classic, RiakPA/ECstay availablestrong readsRare — some Riak strong-consistency modesPC/ELrefuse writeslow-latency readsMongoDB v4 default, CockroachDB w/ follower readsPC/ECrefuse writesstrong reads alwaysSpanner, etcd, ZooKeeper, FoundationDB, FaunaDB
PACELC's four classes. PC/EC systems pay latency for strong consistency on every operation — Spanner's commit-wait, etcd's Raft round. PA/EL systems serve reads from the nearest replica and accept eventual divergence. PC/EL is the most popular modern compromise: strict during the (rare) partition, but lean on follower reads or read-your-writes consistency for fast everyday reads.

Why PA/EC is rare: it would mean "stay available during partition (so accept conflicting writes) but somehow give strong consistency in the normal case." That requires you to merge divergent writes when the partition heals AND give linearizable reads when it does not — the merge logic forces you to weaken consistency in the normal case too, because a read might see a value that has not yet been merged with a concurrent partitioned write. Most systems that try this end up at PA/EL in practice. Riak's strong-consistency mode is the closest production example, and it is rarely used.

The map of real systems

The cleanest way to organise actual production databases is two axes: default consistency (eventual ↔ strong) and availability behaviour during partition (CP — sacrifices A vs AP — sacrifices C). Plot the systems and you see clusters.

Real production systems plotted on consistency vs partition behaviour axesA two-dimensional plot. The x-axis runs from "Eventual consistency" on the left to "Strong consistency" on the right. The y-axis runs from "AP — stays available during partition" at the bottom to "CP — sacrifices availability for consistency" at the top. The plot area is divided into four quadrants. Top-right quadrant (CP, strong) contains Spanner, CockroachDB, FoundationDB, etcd, ZooKeeper, FaunaDB, MongoDB v4+ (default). Bottom-left quadrant (AP, eventual) contains DynamoDB classic, Cassandra default, Riak. Top-left quadrant (CP, eventual) is mostly empty. Bottom-right quadrant (AP, strong) is rare. A diagonal cluster in the middle labelled "Tunable per-query" lists Cassandra (per-query CL), DynamoDB global tables, Cosmos DB (5 levels), CockroachDB (AS OF SYSTEM TIME). Below the plot a note reads "MySQL/PostgreSQL with replication: depends on configuration — sync replication ≈ CP, async ≈ AP-ish".Real-world map: consistency × partition behaviourEventualStrong / LinearizableDefault consistency →CP — drops AAP — drops CPartition behaviour →CP / strong (the busy quadrant)SpannerCockroachDBFoundationDBetcdZooKeeperFaunaDBMongoDB v4+ (default)AP / eventualDynamoDB classicCassandra (default CL=ONE)RiakCP / eventual(mostly empty — strange combination)AP / strong (rare — see PA/EC)tries to give both, hedges with mergingTunable per-query / per-tableCassandra (CL per stmt)DynamoDB global tablesCosmos DB (5 levels)CockroachDB (AS OF SYSTEM TIME)MySQL/PostgreSQL with replication: depends on config — sync ≈ CP, async ≈ AP-ish, single primary always serialisable on the primary itself
Most production OLTP databases live in the upper-right (CP/strong) quadrant. Most production analytics-and-feed databases live in the lower-left (AP/eventual) quadrant. The diagonal in the middle holds the increasingly common tunable systems, where the same database can be used in either mode by changing one knob per query. The lower-right (AP/strong) is the awkward quadrant — rare in practice for the structural reason from the PA/EC discussion above.

A few of the entries deserve commentary because their classification is contested:

MongoDB. Pre-v3.2 it was AP-leaning (the old "eventually consistent secondaries, no Raft"). v3.2 introduced Raft-based replica sets; v3.6 added causal-consistency sessions; v4.0 added multi-document ACID transactions; v4.2 extended them across shards. In v4+, the default writeConcern: majority plus readConcern: majority on a replica set is CP — a partition that loses majority will refuse writes. You can still configure it down for weaker guarantees, but the default is CP for primary writes.

DynamoDB. Classic single-region DynamoDB is AP — eventual reads by default, strongly-consistent reads available as an option (which costs 2× the read capacity unit). Global tables (multi-region) are AP because of the cross-region merge. Recent transactional APIs (TransactWriteItems) give CP within a region for the items in the transaction. So DynamoDB is plural, not singular.

Cassandra. Defaults are AP/EL with consistency level = ONE for reads and writes. Set CL = QUORUM on both and it behaves like a CP system on top of a Dynamo-style ring; set CL = ALL and it is even stricter (and even less available). The classification is per-query, not per-cluster.

MySQL / PostgreSQL with replication. A single primary with synchronous replication to one replica is approximately CP — writes block if the replica is unreachable. Asynchronous replication is approximately AP — writes succeed even if the replica is gone, and a failover can lose the unreplicated tail. Group replication / multi-primary modes complicate this further. The honest answer is "it depends on what you set in my.cnf or postgresql.conf."

The pragmatic relaxations no theorem captures

Pure CP / pure AP is mostly fiction in production. Real systems layer in pragmatic guarantees that fall between strict linearizability and pure eventual consistency. Knowing these by name is half of senior distributed-systems engineering.

Bounded staleness. A read sees data that is at most T seconds old. CockroachDB's AS OF SYSTEM TIME '-5s' and DynamoDB's eventual reads are flavours of this. For most user-facing reads — a Chirpline timeline, a product page, a transaction history — five seconds of staleness is invisible to humans and saves you a round-trip to the leaseholder.

Read-your-writes. A session always sees its own previous writes, even if other sessions might see slightly older state. Implemented by routing the same session's reads to the replica that just took the write, or by stamping reads with the session's last write timestamp. Most apps need this for correctness ("after I post a tweet, my own page must show it") but do not need full linearizability.

Causal consistency. If write A causally precedes write B (on any session, anywhere), all readers that see B also see A. Strictly weaker than linearizability, much cheaper, sufficient for most messaging systems and collaborative editors. MongoDB sessions provide this.

Hinted handoff and read-repair. AP systems heal: if a node was down during a write, a hint is left for it on a peer; when it comes back, the hint is replayed. Reads that detect divergence between replicas trigger background repair. The system is eventually consistent in a real, observable sense — usually within seconds.

Pat Helland's Memories, Guesses, and Apologies frames the whole thing brilliantly: most distributed systems give you guesses (your local replica's current best knowledge) and an apology mechanism for when the guess turns out wrong. CAP and PACELC tell you what kind of guess and what kind of apology you have signed up for.

The decision framework

After all the theory, the actionable question reduces to:

During a five-minute network blip between my regions, do my users prefer (a) the app refuses writes, or (b) the app accepts writes and might show inconsistent state?

For most consumer apps, the honest answer is (b). A Chitchat message sent during a partition between two cell towers should still go through, even if the recipient sees it slightly out of order. A BhojanBox order placed while the menu replica was momentarily stale should still succeed; you can refund or re-quote rather than refuse. Hard CP for these would mean a worse user experience for an event your users do not even perceive as a failure.

For ledgers and identity, the answer is (a). A bank cannot debit an account if it cannot also credit the counterparty atomically; an authorisation system cannot grant access if it cannot verify revocation. CP, with the resulting refusals during partition, is the only honest choice.

Most architectures are both, layered. The next worked example shows what that looks like.

**Worked example: a UPI payments app**

You are designing the backend for an Indian UPI payments app — think DigiPaisa, Google Pay, DhanWallet. The app does many things: holds the ledger, shows transaction history, runs search, displays balance, sends notifications, runs fraud detection. Each surface has a different CAP / PACELC profile, and a single-store design will either be too strict (slow everywhere) or too loose (incorrect on the ledger).

The ledger — the actual debit and credit. This is the irreducible CP/EC core. When you tap Pay ₹500 to Ravi, two things must happen atomically: ₹500 leaves your account, ₹500 arrives in Ravi's account. If a partition splits the cluster mid-transaction, the system must refuse the write rather than risk a debit-without-credit (or, almost worse, a credit-without-debit that you would discover only on monthly reconciliation). Every regulator, NPCI included, audits for this. Pick a Spanner-class store: Spanner if you are willing to pay GCP, CockroachDB or YugabyteDB if you want to self-host, FoundationDB if you want the bare KV with transactions and roll your own SQL on top. Default to SERIALIZABLE isolation; do not be tempted by READ COMMITTED here.

Transaction history view. Users open the app to see "what did I pay last week". This is read-heavy, append-only from the ledger's perspective, and a few hundred milliseconds of staleness is fine — if you just paid Ravi five seconds ago, the row appearing in your history a second later is invisible. Pick a Cassandra-class store, populated by an asynchronous CDC stream from the ledger. PA/EL, eventual consistency, served from the nearest region's replica for sub-50 ms p99 reads. If a partition cuts off Mumbai from Bengaluru, both regions keep serving history reads from their local copy.

Balance display. This is the trickiest. The user wants to see their current balance, but "current" can mean different things. Strict CP would route every balance query to the ledger, paying a quorum read every time the user opens the app — expensive, and unnecessary. The pragmatic answer is bounded staleness with read-your-writes: serve from a cache or follower replica that is at most a few seconds behind the ledger, and if this session has written a transaction in the last few seconds, route the read to a replica that has applied it. CockroachDB's follower reads with stickiness, or a Redis cache populated by ledger CDC with a "last write wins for this session" override. This gives the user their actual balance after their own payments, and a slightly stale (but converging) balance otherwise.

Search across counterparties. Type "Ravi" to find Ravi's UPI ID. This is approximate, eventual, and tolerates significant staleness. Pick Elasticsearch / OpenSearch, fed asynchronously from the user table. Pure AP/EL.

Fraud detection. Real-time scoring on each transaction. Needs some consistency (you do not want to skip a check because the model's feature store was stale and missed a flagged account), but not strict linearizability. Bounded staleness on the feature store (a few seconds), with a fallback rule that when the feature store is unreachable, the transaction is held for manual review. PC for the blocking path (the actual hold), EL for the feature read.

Notifications. "Ravi received ₹500 from you." Pure AP/EL — the message can arrive a second late, and if it never arrives, the user can pull-to-refresh the history. A queue (Kafka, SQS) with at-least-once delivery is the right tool; you handle duplicates with idempotency keys.

The picture, then. Your single "UPI app" is, behind the curtain, five or six different stores, each chosen for the consistency / availability / latency profile of one operation. Spanner-class for the ledger, Cassandra-class for history, a Redis or follower-read layer for balance, OpenSearch for search, a feature store for fraud, a queue for notifications. The CAP / PACELC trade-off is made per-operation, not per-app.

A UPI payments app, decomposed by per-operation CAP profileA central app icon labelled "UPI app" with arrows to six backing stores, each labelled with operation and CAP/PACELC profile. Top arrow goes to "Ledger — Spanner / CockroachDB" labelled "PC/EC, serialisable". Upper-right to "Balance — follower reads + Redis" labelled "PC during partition, bounded staleness otherwise". Right to "History — Cassandra" labelled "PA/EL". Lower-right to "Search — OpenSearch" labelled "PA/EL". Lower-left to "Fraud — feature store" labelled "PC for blocking path, EL for reads". Left to "Notifications — Kafka / SQS" labelled "PA/EL, at-least-once".UPI app: per-operation CAP / PACELC profileUPI appphone clientLedger — Spanner / CockroachDBPC/EC · serialisable · debit+credit atomicBalance — follower reads + RedisPC · bounded staleness · RYWHistory — CassandraPA/EL · CDC from ledgerSearch — OpenSearchPA/EL · async indexingFraud — feature storePC blocking · EL feature readsNotifications — Kafka / SQSPA/EL · at-least-once · idempotent
One app, six stores, six different CAP / PACELC profiles. The ledger pays for strict consistency because money requires it; history and notifications pay for low latency and availability because users perceive lag worse than they perceive a stale-by-one-second balance. This is what "designing for CAP" actually looks like in production — not picking one corner of the triangle, but mapping operations to corners.

The pattern generalises: find the smallest core of operations that genuinely require strict consistency, isolate them in a CP store, and let everything else live in cheaper AP-or-tunable stores fed from that core. The ledger is small (a few writes per second per user). The fan-out — history, search, notifications, analytics — is huge. Letting the fan-out run on a CP store would be paying linearizability tax on every query when the user does not need it; letting the ledger run on an AP store would be lying to your users about their money.

A small partition simulator you can run

Theory absorbs better when you watch it bend a real program. Open a Python file and type this in — do not paste, type it. It is a 40-line simulator of a two-replica key-value store with a configurable partition switch. You will run it three times: as a CP system, as an AP system, then with the partition healed, and watch the trade-off in your terminal.

import time, threading, random

class Replica:
    def __init__(self, name):
        self.name = name; self.store = {}; self.last_write_ts = 0

    def write(self, k, v, ts):
        if ts > self.last_write_ts:
            self.store[k] = v; self.last_write_ts = ts

    def read(self, k):
        return self.store.get(k, None), self.last_write_ts

class Cluster:
    def __init__(self, mode):
        self.mode = mode             # "CP" or "AP"
        self.r1 = Replica("mumbai")
        self.r2 = Replica("bengaluru")
        self.partitioned = False

    def client_write(self, k, v):
        ts = time.time_ns()
        self.r1.write(k, v, ts)
        if self.partitioned:
            if self.mode == "CP":
                raise RuntimeError("CP refuses write — quorum unreachable")
            return f"AP accepted on r1, r2 will heal later"
        self.r2.write(k, v, ts)
        return f"replicated to both at ts={ts}"

    def client_read(self, k, replica):
        r = self.r1 if replica == "mumbai" else self.r2
        v, ts = r.read(k)
        return f"{replica} sees {k}={v} (ts={ts})"

for mode in ("CP", "AP"):
    c = Cluster(mode); print(f"\n--- {mode} cluster ---")
    print(c.client_write("balance:dipti", "5000"))
    print(c.client_read("balance:dipti", "bengaluru"))
    c.partitioned = True; print("(network partition between mumbai and bengaluru)")
    try: print(c.client_write("balance:dipti", "4500"))
    except RuntimeError as e: print(f"refused: {e}")
    print(c.client_read("balance:dipti", "bengaluru"))

Run it. Output on the author's machine:

$ python cap_demo.py

--- CP cluster ---
replicated to both at ts=1745560123456789012
bengaluru sees balance:dipti=5000 (ts=1745560123456789012)
(network partition between mumbai and bengaluru)
refused: CP refuses write — quorum unreachable
bengaluru sees balance:dipti=5000 (ts=1745560123456789012)

--- AP cluster ---
replicated to both at ts=1745560123457001234
bengaluru sees balance:dipti=5000 (ts=1745560123457001234)
(network partition between mumbai and bengaluru)
AP accepted on r1, r2 will heal later
bengaluru sees balance:dipti=5000 (ts=1745560123457001234)

Cluster.client_write. The CP branch raises during a partition because a quorum write needs both replicas reachable; the AP branch lets the write land on r1 only and tags the conflict for later. The same call site, two completely different failure semantics — that is what "choosing CP vs AP" means in code.

client_read("balance:dipti", "bengaluru") after the AP write. Bengaluru still shows the pre-partition value of 5000, even though Mumbai has accepted 4500. This is the staleness window: the AP system stays available, but a read in the partitioned region sees a value the system itself knows is out of date. Real systems shrink this window with hinted handoff and read-repair, but they cannot eliminate it without becoming CP.

Why this is not "just a bug": both behaviours are correct for their declared mode. The CP cluster honours its contract by refusing to lie about quorum; the AP cluster honours its contract by staying responsive and accepting that two replicas have temporarily diverged. The job of the architect is to choose which contract each operation gets — the ledger gets the first, the timeline gets the second.

Common confusions

  • "Pick two of CAP, but not all three." The single most-repeated wrong sentence in the field. The theorem says no system can guarantee all three; it does not say you choose two. P is forced on you the instant you have two networked nodes. The actual choice is between C and A during a partition. Brewer himself disowned the "2 of 3" framing in his 2012 reflection.

  • "NoSQL is AP, SQL is CP." A category error. MongoDB v4+ defaults to CP. Cassandra at consistency level = ALL is CP. PostgreSQL with async replication is AP-leaning. MySQL with multi-primary group replication is even messier. Classification is per-system, per-config, per-query — never per-data-model-buzzword.

  • "My system is highly available, so it is AP." "Available" in CAP is a mathematical statement: every non-failing node responds to every request in finite time, even when the network is partitioned. "Available" in your SLO is operational: 99.99% uptime in normal conditions. You can be 99.99% operationally and still be CP — most of the year, no partition is happening, and CP systems respond fine.

  • "Linearizability is the same as serializability." They are different. Linearizability is about real-time order on a single object: a read sees the latest write, by wall-clock. Serializability is about transaction order across multiple objects: the result is equivalent to some serial order, not necessarily the real-time one. Spanner gives you both (external consistency = linearizable + serializable). CockroachDB gives serializability but not strict linearizability across regions without AS OF SYSTEM TIME. Most application bugs that look like CAP violations are actually missing serializability, not missing linearizability.

  • "Consistency in CAP is the C in ACID." Different concepts that share a letter. CAP's C is linearizability — a property of the read/write protocol. ACID's C is consistency-of-application-invariants — a property of the transaction (no negative balance, no orphaned foreign key). A database can be ACID-C (every transaction respects invariants on a single node) without being CAP-C (across replicas, reads can be stale). This collision is one of the worst pieces of historical naming in distributed systems.

  • "Eventual consistency means consistent within a few seconds." No formal bound. "Eventual" means if writes stop, all replicas will converge to the same state in a finite but unbounded time. In practice, Cassandra and DynamoDB heal within seconds because they aggressively read-repair and run anti-entropy; in adversarial conditions (a node down for an hour, a regional partition, a network with high reordering) the convergence window can be much longer. Always ask the operator what the measured p99 staleness is, not what the docs aspire to.

Going deeper

The headline theorems — CAP and PACELC — are coarse. The real classification taxonomy used in research is finer-grained, and the production patterns that matter are not in the original papers. This section maps the rest of the territory.

The CAP impossibility proof, in two paragraphs

Gilbert and Lynch's proof is a one-page argument by contradiction. Suppose a system S guarantees C, A, and P. Take two nodes G_1 and G_2 each holding a copy of object x with initial value v_0. Partition the network between them. A client writes v_1 to G_1. By A, G_1 must respond; by P, the message to G_2 may be lost. A second client now reads x from G_2. By A, G_2 must respond; by C (linearizability), it must return v_1 — but G_2 only knows v_0. Contradiction. Therefore at most two of C, A, P hold simultaneously.

The proof is sharp because it lives in an asynchronous network model with no upper bound on message delay. In a synchronous model (messages arrive within \Delta), you can sometimes detect a partition (a message did not arrive within \Delta), and the trade-off becomes a delicate-but-doable problem. Real networks are partially synchronous — usually messages arrive in 10 ms, occasionally a kernel hangs and they take 30 seconds — which is why production systems use timeouts as their crude partition detector and accept that timeouts produce false positives ("flapping leadership").

Per-operation CAP — the FaunaDB and Cosmos approach

The honest modern view is that CAP is per-operation, not per-system. Cosmos DB exposes five consistency levels — Strong, Bounded staleness, Session, Consistent prefix, Eventual — and each query picks one. A single application can use Strong for the ledger row and Eventual for the analytics counter, both stored in the same Cosmos DB collection. FaunaDB advertises strict serializability for transactions but offers explicit cheaper modes for read-only queries that do not need wall-clock freshness.

This is what the architecture pattern in the UPI worked example is codifying: a UPI app written on Cosmos DB or Fauna would not need six different stores; it would use one store and pick the consistency knob per operation. The data layout is unified; only the quorum logic is per-call. This is where industry is heading; the "different store per operation" pattern is itself a workaround for systems that do not yet expose per-call CAP knobs.

Jepsen — what real partition testing finds

Kyle Kingsbury's Jepsen project has run the same experiment against every major distributed database for a decade: simulate partitions, network reorderings, clock skew, process pauses; check whether the system delivers the consistency guarantees its docs claim. The results are humbling.

  • MongoDB v3.4 lost acknowledged writes during partitions despite claiming linearizability. Fixed in v4 by adopting Raft.
  • Elasticsearch for years offered no usable consistency mode under partition; documented behaviour did not match observed behaviour.
  • etcd, ZooKeeper, and Spanner generally behaved as their docs claim — these are systems with tight test harnesses and engineering teams that take Jepsen seriously.
  • Cassandra at LOCAL_QUORUM does what it says under partitions, but operators routinely misconfigure replication factors and effectively run at ONE.

The lesson is not that distributed databases lie. It is that correctness under partition is hard, and unverified claims are wrong with high probability. A system that has been Jepsen-tested and fixed is safer than one that has not, even if both claim the same letters.

Spanner's TrueTime — buying linearizability with a clock

Spanner achieves PC/EC across continents because it has a hardware-assisted clock. Every datacenter has GPS receivers and atomic clocks; the TrueTime API returns an interval [\text{earliest}, \text{latest}] that is guaranteed to contain the real wall-clock time. To commit a transaction, Spanner picks a timestamp t > \text{latest} and waits until \text{earliest} > t before acknowledging. The wait is typically 1–7 ms — small enough to ignore for OLTP, large enough to guarantee that any later transaction at any datacenter sees a strictly larger timestamp.

This is the only known way to get global linearizability without a single coordinator. CockroachDB, which does not have GPS clocks, uses a software analogue (HLC — hybrid logical clocks) plus uncertainty intervals on reads, and accepts a slightly weaker guarantee (serializability + per-region linearizability rather than global). The hardware lets Spanner do something CockroachDB structurally cannot.

The Harvest-Yield framework — a CAP refinement

Brewer and Fox's Harvest, Yield, and Scalable Tolerant Systems (1999) is older than CAP and arguably more useful. It distinguishes:

  • Yield — the fraction of requests answered (the operational sense of "available").
  • Harvest — the fraction of the data reflected in each answer (e.g. "we returned 9 of 10 shards' results").

Search engines and analytics dashboards trade harvest for yield: rather than fail the query when one shard is unreachable, return what the others have, with a "results may be incomplete" footer. CAP forces you to refuse the request; harvest-yield lets you serve a partial answer. Most production systems use one or both axes, and CAP without harvest-yield is too binary a model for what real apps need.

Why "PA/EC" is structurally rare

The empty-ish quadrant is not an oversight. To be PA, the system must accept divergent writes during a partition; to be EC in the no-partition case, it must give linearizable reads. But the merge logic that resolves divergent writes after a partition heals must reorder events, which means a read in the no-partition case may see a value that has not yet been merged with a concurrent (and potentially partitioned) write somewhere else. The only way out is to use a global wait — like Spanner's TrueTime — but a global wait is itself a synchronous coordination, which forfeits PA. So PA/EC requires either accepting weaker-than-linearizable reads (which makes you EL in disguise) or accepting global coordination (which makes you PC). Riak's strong-consistency mode is the closest production attempt and is documented as "use only for the small subset of buckets that need it"; everyone else stays in PA/EL.

Where this leads next

You now have the right vocabulary for the rest of Build 22 (and most production-systems interviews). The next chapters apply this map to specific design choices.

After this, you will stop seeing "CP" or "AP" as system labels and start seeing them as per-operation choices in a layered architecture. That is the discipline this chapter exists to install.

Stop arguing about CAP

After ten years of "CAP is dead" posts and another ten of "actually it is misunderstood" rebuttals, the productive position is: CAP is a precise theorem about a precise corner case (network partition). It does not tell you what to build. It tells you what trade-off appears at one specific failure mode. PACELC tells you what trade-off appears the rest of the time. Together they give you vocabulary for the actual conversation, which is per-operation: what does this query, this write, this surface need?

The next time someone tells you "X is a CP system" or "Y chose A over C", the right reply is: "Per which operation? At which consistency level? With which read concern? In which configuration?" The answer is almost never one letter. The answer for the system you build will not be one letter either. That is fine — that is the discipline you have learned.

References

  1. Brewer, E. (2000). Towards Robust Distributed Systems. PODC keynote. PDF — the original conjecture.
  2. Gilbert, S. & Lynch, N. (2002). Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. SIGACT News. PDF — the formal proof.
  3. Brewer, E. (2012). CAP Twelve Years Later: How the "Rules" Have Changed. IEEE Computer / InfoQ. Article — Brewer's own retraction of the "2 of 3" framing.
  4. Abadi, D. (2012). Consistency Tradeoffs in Modern Distributed Database System Design. IEEE Computer. PDF — the PACELC paper.
  5. Helland, P. (2015). Memories, Guesses, and Apologies. Blog summary — the "guess and apologise" mental model that frames CAP pragmatically.
  6. Kingsbury, K. Jepsen analyses — a decade of empirical partition-testing of every major distributed database. The single best resource for "does this system actually do what it claims under partition."
  7. Fox, A. & Brewer, E. (1999). Harvest, Yield, and Scalable Tolerant Systems. HotOS — the harvest/yield refinement that predates CAP and is, for many real workloads, more useful.