Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

The CAP theorem (and its misuse)

It is 03:14 on a Friday at PaySetu. The on-call engineer, Aniketh, watches the inter-DC link between Mumbai and Bengaluru flap for the fourth time in twenty minutes. The transaction database is configured as a five-node Raft cluster — three replicas in Mumbai, two in Bengaluru. With the link down, Bengaluru cannot see the leader. Bengaluru's two nodes can keep serving reads from their local stale state, but accepting writes would risk forking the ledger and double-spending ₹4.7 crore of in-flight UPI settlements. The on-call playbook says: refuse writes, return 503 to the API gateway, page the network team. That choice — preserve consistency, sacrifice availability during the partition — is the entire content of CAP. Everything else marketing has wrapped around the theorem in twenty-five years is either trivial or wrong.

CAP says that a distributed system that experiences a network partition (P) must, during the partition, either drop consistency (C) or drop availability (A) — you cannot keep all three. The theorem is a narrow impossibility result about a specific failure mode, not a design taxonomy. "CA databases" are a category error: every real distributed system has partitions, so every real distributed system makes a CP-vs-AP choice. PACELC is the strictly more useful refinement: it adds the latency-vs-consistency trade-off that operates outside a partition, which is what most production tuning is actually about.

What CAP actually says (and what it does not)

Eric Brewer's 2000 PODC keynote stated the conjecture; Seth Gilbert and Nancy Lynch's 2002 SIGACT paper proved it. The formal statement is narrow, and it is worth reading slowly:

A web service that is replicated across asynchronous, possibly-partitioning network links cannot simultaneously guarantee:

  • Consistency (C): every read returns the most recent write or an error (specifically: linearizability — see linearizability).
  • Availability (A): every request to a non-failed node returns a non-error response, in finite time.
  • Partition tolerance (P): the system continues operating despite arbitrary message loss between nodes.

The proof is two paragraphs. Suppose a system claims all three. Place two nodes N1 and N2 in different network partitions — they cannot exchange messages. A client writes v=1 to N1; by A, N1 must accept the write. The same client immediately reads from N2; by A, N2 must respond. By C, N2 must return v=1 — but N2 has not heard from N1 and cannot know v changed. Contradiction.

Why the theorem is so narrow: the proof assumes (a) asynchronous network — no upper bound on message delivery time — and (b) the strongest form of consistency, linearizability. Relax either assumption and the impossibility weakens. With a synchronous network and a known timeout you can detect partitions deterministically; with a weaker consistency model like causal or eventual, you can keep both A and the weaker-C during partitions. Most production databases sit in the relaxed-consistency space precisely so they don't have to live under the strict CAP impossibility every minute of every day.

The crucial misreading: CAP does not say "pick two of three". The system does not get to "give up P" — partitions are a fact of physical networks, not a design dial. What CAP actually says is: during a partition, the system designer must pick whether reads/writes return stale data (give up C) or refuse to respond (give up A). Outside a partition, both C and A are available simultaneously. CAP is a statement about behaviour during a specific failure mode, not a permanent design tier-list.

CAP triangle with the CA region crossed out as physically unreachableAn equilateral triangle with vertices labelled C (consistency), A (availability), P (partition tolerance). The three edges are labelled CP, AP, CA. The CA edge is rendered in dashed grey with a strike-through and a label "physically unreachable on real networks". The CP and AP edges are highlighted in accent colour with example systems labelled — CP: Spanner, etcd, ZooKeeper, HBase. AP: Cassandra, Dynamo, Riak. The diagram is illustrative. The CAP triangle, drawn honestly C Consistency (linearizable) A Availability P Partition tolerance CP Spanner · etcd · ZooKeeper · HBase AP Cassandra · Dynamo · Riak CA unreachable on real networks — partitions happen Illustrative. Vendor placement reflects default modes; most systems are tunable across the boundary.
The triangle is drawn honestly with the CA edge struck through. Real networks partition, so a real distributed system lives on either the CP or the AP edge. The choice is not which two of three to keep — it is which behaviour to exhibit when the partition arrives.

What "give up C" and "give up A" actually look like at runtime

The abstract trade-off becomes concrete only when you watch a system during a partition. There are two production-realistic modes; CAP is the choice between them.

CP mode — refuse to respond on the minority side. The cluster has a quorum (majority) requirement. When a partition isolates a minority of nodes, the minority side stops accepting writes (and possibly reads) — it returns 503 Service Unavailable or blocks until the partition heals. The majority side, which still has quorum, keeps serving. Examples: Raft-based stores like etcd, ZooKeeper, CockroachDB's transaction layer; HBase's region servers behind a quorum'd HMaster; Spanner under Paxos. Cost: a fraction of clients see errors during the partition.

AP mode — keep responding everywhere, with stale or divergent state. Both sides of the partition keep accepting reads and writes. They diverge: the same key can hold different values on each side. Once the partition heals, the system reconciles via last-write-wins, version vectors, CRDTs, or application-level merge logic. Examples: Cassandra with CONSISTENCY_LEVEL=ONE, DynamoDB in eventually-consistent mode, Riak. Cost: clients can read stale values during the partition, and post-healing reconciliation may discard or merge writes in surprising ways.

A nuance the marketing pages always miss: the CP-vs-AP choice is usually per-operation, not per-database. Cassandra at QUORUM is CP (a partition that leaves the coordinator without quorum returns errors); the same cluster at ONE is AP. CockroachDB at default serializable isolation is CP; with follower reads enabled the read path becomes AP-with-bounded-staleness. The database is a knob set, not a fixed coordinate on the triangle.

Why this per-operation framing matters in production: a single database serves many workloads with different consistency needs. PaySetu's transaction ledger and PaySetu's audit-log dashboard share one Cassandra cluster, but the ledger writes go at QUORUM (CP — the ledger must not fork) and the dashboard reads go at ONE (AP — a 200 ms stale graph is cheaper than a 30-second wait during a partition). Treating the cluster as monolithically CP or AP would force the wrong trade-off on one workload or the other. The right design question is "which consistency level for which call?" — a question CAP-the-triangle does not even ask.

Partition timeline showing CP and AP behaviour side by sideA horizontal timeline from t=0 to t=200 seconds. A partition begins at t=60 and heals at t=140. Two parallel rows: CP cluster (top) and AP cluster (bottom). For the CP cluster, before t=60 both sides serve writes; from t=60 to t=140 the minority side returns errors (red); after t=140 both serve again. For the AP cluster, both sides keep serving the entire time, but during t=60 to t=140 their states diverge (shown as branching lines); at t=140 a merge event reconciles them. The diagram is illustrative. CP vs AP during the same partition window t=0 60s (partition) 140s (heal) 200s CP both sides serve majority serves minority: 503 both sides serve AP both sides serve, in sync side A serves, diverging side B serves, diverging merge event reconciled, both sides serve CP loses availability on the minority. AP loses consistency until merge. Both modes are correct — they fit different workloads.
The same 80-second partition window. CP keeps consistency by refusing the minority side; AP keeps availability by letting both sides diverge and reconciling later. PaySetu's ledger needs CP (double-spend is unrecoverable). CricStream's view-counter needs AP (one missed view is recoverable). Pick per workload, not per company.

A simulator: CP vs AP during a partition

The Python below builds a 3-node key-value store, partitions one node, and shows what each mode does. Both modes start identical; only the partition behaviour differs.

# cap_sim.py — CP and AP behaviour during a forced partition
import random, time
from collections import defaultdict

class Node:
    def __init__(self, name):
        self.name = name
        self.store = {}                # key -> (value, version)
        self.peers = []                # references to other nodes
        self.partitioned_from = set()  # set of peer names we cannot reach

    def can_reach(self, peer):
        return peer.name not in self.partitioned_from

    def quorum_peers(self):
        return [p for p in self.peers if self.can_reach(p)]

class CPCluster:
    def __init__(self, nodes): self.nodes = nodes
    def write(self, coord, k, v):
        reachable = [coord] + coord.quorum_peers()
        if len(reachable) <= len(self.nodes) // 2:
            return ("ERROR_NO_QUORUM", None)
        version = max((n.store.get(k, (None, 0))[1] for n in reachable), default=0) + 1
        for n in reachable: n.store[k] = (v, version)
        return ("OK", version)
    def read(self, coord, k):
        reachable = [coord] + coord.quorum_peers()
        if len(reachable) <= len(self.nodes) // 2:
            return ("ERROR_NO_QUORUM", None)
        return ("OK", max((n.store.get(k, (None, 0)) for n in reachable), key=lambda x: x[1]))

class APCluster:
    def __init__(self, nodes): self.nodes = nodes
    def write(self, coord, k, v):
        version = coord.store.get(k, (None, 0))[1] + 1
        coord.store[k] = (v, version)
        for n in coord.quorum_peers():
            n.store[k] = (v, version)        # best-effort, async in real life
        return ("OK", version)
    def read(self, coord, k):
        return ("OK", coord.store.get(k, (None, 0)))   # local read, possibly stale

if __name__ == "__main__":
    A, B, C = Node("A"), Node("B"), Node("C")
    for n in (A, B, C): n.peers = [p for p in (A, B, C) if p is not n]
    cp = CPCluster([A, B, C]); ap = APCluster([A, B, C])

    # Healthy: both modes accept writes
    print("CP write (healthy):", cp.write(A, "balance", 1000))
    print("AP write (healthy):", ap.write(A, "balance", 1000))

    # Partition: C is isolated from A and B
    C.partitioned_from = {"A", "B"}
    A.partitioned_from = {"C"}; B.partitioned_from = {"C"}

    print("CP write to majority A:", cp.write(A, "balance", 1500))   # OK (A+B = quorum)
    print("CP write to minority C:", cp.write(C, "balance", 9999))   # ERROR
    print("AP write to majority A:", ap.write(A, "balance", 1500))
    print("AP write to minority C:", ap.write(C, "balance", 9999))   # both succeed → divergence

    print("CP read from minority C:", cp.read(C, "balance"))         # ERROR
    print("AP read from minority C:", ap.read(C, "balance"))         # returns 9999 — stale fork

Sample output (deterministic for this run):

CP write (healthy): ('OK', 1)
AP write (healthy): ('OK', 1)
CP write to majority A: ('OK', 3)
CP write to minority C: ('ERROR_NO_QUORUM', None)
AP write to majority A: ('OK', 2)
AP write to minority C: ('OK', 2)
CP read from minority C: ('ERROR_NO_QUORUM', None)
AP read from minority C: ('OK', (9999, 2))

Why the AP mode looks "wrong" but is actually a deliberate design choice: the minority node C accepted a write of 9999 even though A and B had moved on to 1500. Two version-2 values now exist for the same key. When the partition heals, AP must reconcile — last-write-wins by timestamp, vector-clock merge, or application-level resolution (CRDTs, manual merge). For a banking ledger this is unacceptable; for a social-media like counter it is fine. The CP mode's ERROR_NO_QUORUM is unfriendly to clients but preserves the invariant that there is one true ledger value per key. CAP forces you to pick which kind of unfriendliness your workload tolerates.

How CAP gets misused — three patterns engineering teams pay for

"We picked a CA database, so partitions can't break us." This claim usually appears in vendor docs for single-machine or single-AZ databases (early MongoDB pre-3.4, MySQL with synchronous replication marketed as "no partition tolerance needed"). The category is incoherent. A CA database is one that assumes the network never partitions. When the network does partition (and it will — switch reboots, AZ outages, BGP misconfiguration), the database has no defined behaviour. PaySetu inherited this exact pattern from a 2019-era stack: the team had been told they ran a "CA" PostgreSQL primary-replica and were "safe from CAP". They were not safe. They were running a CP system whose partition behaviour was "primary stalls, replica diverges silently, automatic failover sometimes promotes the wrong node". Brewer himself wrote in 2012 that "the 2 of 3 formulation is misleading" — this is the pattern he meant.

"NoSQL is AP and SQL is CP." This is a marketing artefact from 2010 that has aged poorly. CockroachDB and Spanner are SQL and CP. Cassandra and DynamoDB are NoSQL but configurable across CP and AP per-query (QUORUM vs ONE, strong vs eventual reads). The CP/AP choice is about replication and quorum design, not query language.

"Our system is fine because we have multi-region replication." Multi-region replication makes partitions more likely (cross-region links are flakier than intra-DC links) and forces CP-vs-AP decisions to actually matter. Spanner solved this with TrueTime so it can be globally CP at a 7 ms commit-latency cost. CockroachDB allows per-table region affinity so most rows are CP within one region and the cross-region traffic is bounded. KapitalKite — a fictional stockbroker — learned this in 2024 when its "geo-replicated for resilience" trading database was actually serving stale prices on its Singapore replica during a 90-second cross-region partition, and three customers placed buy orders at 1.4% off market. The post-mortem read: "We bought multi-region replication and forgot we owed the network a CAP decision per workload."

Common confusions

  • "CAP says you can have any two of C, A, P." No. CAP says that during a partition, you cannot have both C and A. Outside a partition, you have both. P is not a dial — it is the precondition for the impossibility. The "pick two" framing is the most damaging misreading of the theorem.
  • "My single-node database is CA." A single-node database has no partitions because there is no network to partition. CAP does not apply; the system is just consistent and available in the local sense. Calling it "CA" suggests it would survive partitioning, which it would not (a single node failing is not a partition; it is unavailability). See partial failures and why they're the worst.
  • "AP systems are eventually consistent, CP systems are strongly consistent." AP systems are at most eventually consistent during partitions; outside partitions they can be strongly consistent. CP systems are strongly consistent always but unavailable on the minority side during partitions. The model is per-mode, not a permanent label. See eventual consistency.
  • "PACELC is just CAP with two more letters." PACELC adds the case CAP ignores: what does the system do when there is no partition? Latency-vs-consistency is the daily-life trade-off; CP/AP is the partition-day trade-off. PACELC captures both; CAP captures only the partition. PACELC is the more useful framework for production tuning. See "Going deeper" below.
  • "If I use Raft, I am CP forever." Raft gives you CP for the consensus log, but most systems built on Raft (etcd, CockroachDB, TiKV) layer non-Raft reads on top — leaseholder reads, follower reads, stale reads. Each layer has its own CP/AP profile. Calling the whole system "CP" hides the fact that the read path may be AP under specific configurations.
  • "I can't have both consistency and high availability." You can — outside a partition, both are available. The theorem only forces a choice when the partition arrives, and modern partitions are usually short (seconds to minutes). Spanner's measured availability is 99.999% precisely because partitions are rare and short, and during them the unavailable fraction is small. The CAP trade-off is per-partition-second, not per-architectural-decade.

Going deeper

Gilbert and Lynch's 2002 proof — the asynchronous-network corner case

Gilbert and Lynch's SIGACT paper formalised Brewer's conjecture by separating two network models: asynchronous (no upper bound on message delivery) and partially synchronous (messages eventually arrive, but the bound is unknown). In the asynchronous model, the impossibility is sharp: you cannot have all three of CAP. In the partially synchronous model, you can — during periods of synchrony — implement consensus and thus all three; but every period of asynchrony (a partition) re-imposes the trade-off. Most real networks are partially synchronous, which is why production systems can sometimes look like they have all three. They don't; they just don't see partitions for long enough to be forced to choose. The proof technique — partition the world, observe what messages a single node could have received, derive the contradiction — is the template for FLP impossibility and crash-omission-timing-byzantine failure-mode arguments too.

PACELC — the framework CAP should have been

Daniel Abadi's 2010 PACELC formulation is the strictly more useful refinement: if Partition (P), choose Availability (A) or Consistency (C); Else (E), choose Latency (L) or Consistency (C). The ELC arm is what most production tuning is actually about — your cross-region replicated store has no partition right now, but every read either waits for a quorum acknowledgement (high latency, strong C) or returns a local stale read (low latency, weak C). DynamoDB defaults to ELC-favouring-L (low-latency stale reads) and lets you opt into ELC-favouring-C with strongly-consistent reads. CockroachDB defaults to ELC-favouring-C and lets you opt into ELC-favouring-L with follower reads. PACELC names the dial that 95% of database operators are turning every day, while CAP names the one that turns once a quarter when a switch fails.

"What is a partition, anyway" — Bailis on operational reality

Peter Bailis's blog series and the 2014 Highly Available Transactions paper measured what real-world partitions look like. The summary: partitions are bursty, short (median tens of seconds, p99 hours), and asymmetric (A can reach B but B cannot reach A is common). That asymmetry is what makes "we use a quorum, we're fine" naive — a node that can write but not read its peers can think it's the leader long after it has been deposed. The CAP theorem is silent about asymmetric partitions; production systems must handle them anyway. Brewer's keynote slide showing a clean two-cluster partition was a teaching simplification; real partitions are partial graphs that change every few seconds.

MealRush's CAP misadventure

MealRush — a fictional food-delivery app — ran its order-state database on a "CA" MySQL primary-replica setup in 2023. The team's CAP intuition was: "single primary, replicas are read-only, no partitions can break us". Then the AZ where the primary lived had a 14-minute network blip. The replica's automatic-failover script promoted itself based on a 30-second heartbeat timeout, but the original primary was still up and accepting writes from clients in its own AZ. Both nodes wrote to the order-status table; for 14 minutes, two MealRush primaries existed. When the partition healed, the replication catch-up tool's last-write-wins resolved several thousand orders to the wrong state — orders marked DELIVERED were silently rolled back to PREPARING; refunds for orders marked REFUNDED were re-issued. The customer-impact bill was ₹1.2 crore. The post-mortem named the failure "CAP-by-accident": the system had been CP all along (the failover was supposed to provide CP), but the heartbeat-based fencing was inadequate, so the actual mode during the partition was undefined-and-divergent. The fix was a Raft-based control plane (etcd) for primary election with a hard fencing token written into the storage path — a textbook CP design.

When the CAP triangle is genuinely the wrong frame

For some workloads the CAP triangle is not the right axis at all. Stream-processing pipelines care about exactly-once delivery and watermarks, not C/A/P. Eventually-consistent CRDTs (G-counters, OR-sets, see CRDT) deliberately give up linearizability and instead promise strong eventual consistency: every replica that has received the same set of updates is in the same state, regardless of order. CRDTs are AP at the read-write API level but provide a different correctness guarantee that CAP does not name. CAP is a useful frame for read-write KV stores; for richer data types and richer guarantees, use CRDT or session-guarantee language instead.

Where this leads next

CAP sits at the head of Part 12. The chapters that follow refine its trade-off into actually-implementable models:

Part 13's CRDT chapters tackle the post-partition reconciliation problem CAP's AP side leaves open. Part 14's distributed-transactions chapters revisit CP under the harder constraint of multi-key atomicity.

PACELC will reappear in Part 17's geo-distribution chapters, where the L-vs-C trade-off dominates day-to-day latency budgets, and CAP's partition arm will reappear in Part 10's failure-detection chapters, where the question "is this peer partitioned or just slow?" becomes the operational core of every consensus implementation.

References

  • Brewer, E. — "Towards Robust Distributed Systems" (PODC 2000 keynote). The original conjecture.
  • Gilbert, S. & Lynch, N. — "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services" (SIGACT News, 2002). The proof.
  • Brewer, E. — "CAP Twelve Years Later: How the Rules Have Changed" (IEEE Computer, 2012). Brewer's own retraction of the "pick 2 of 3" framing.
  • Abadi, D. — "Consistency Tradeoffs in Modern Distributed Database System Design: CAP is Only Part of the Story" (IEEE Computer, 2012). The PACELC paper.
  • Bailis, P. & Kingsbury, K. — "The Network is Reliable" (ACM Queue, 2014). Empirical study of real-world partitions.
  • Helland, P. — "Standing on Distributed Shoulders of Giants" (ACM Queue, 2016). Architectural lessons after fifteen years of CAP.
  • Kleppmann, M. — Designing Data-Intensive Applications, Chapter 9. The clearest book-length treatment of CAP and its limits.
  • Linearizability, eventual consistency, partial failures and why they're the worst — the consistency models and the failure mode CAP composes.