Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Causal consistency
It is a Friday evening on CricStream — a fictional cricket-streaming service — and 18 million viewers are watching the Mumbai-vs-Chennai final. A user named Rohan posts a comment: "What a six!" Two seconds later he replies to himself: "Actually that was a four, my bad." His friend Kavya, watching from Bengaluru on a different replica, sees the reply appear at the top of the comment thread — without the original post. The thread is incoherent. The replica that served Kavya had received the second comment before the first because they took different paths through the geo-replication network. CricStream is not linearizable, not sequentially consistent — and it does not need to be, because nobody cares whether Rohan's post or some unrelated user's post landed first. What everyone cares about is that Rohan's reply must not appear before Rohan's original. That is exactly what causal consistency promises, and that is the model this article makes precise.
Causal consistency requires that if operation A happens-before operation B (Lamport's → relation: same-process order, or read-of-A precedes B, or transitively through chains), then every replica that sees B must have already seen A. Concurrent operations — those with no happens-before relation — may be observed in different orders by different replicas without violating the model. There is no single global order at all, only a directed acyclic graph of dependencies. This is the strongest model achievable while remaining available under network partition (Mahajan-Alvisi-Dahlin 2011), which is why every "AP-side" system that wants more than eventual consistency converges on it.
What "happens-before" actually means — Lamport 1978
The bedrock is Leslie Lamport's Time, Clocks, and the Ordering of Events in a Distributed System (CACM, 1978). Lamport defined a partial order → (read "happens-before") on events in a distributed execution by three rules:
- Process order. If
aandbare events on the same process andaprecedesbin that process's local sequence, thena → b. - Message order. If
ais the send of a message andbis the receive of that same message, thena → b. - Transitivity. If
a → bandb → c, thena → c.
Two events a and b are concurrent (written a ‖ b) if neither a → b nor b → a. Concurrency is not a positive claim that the events happened "at the same time" in some wall-clock sense — it is the absence of any causal chain connecting them. A causally consistent system is one where every replica's observed history of operations respects →: for every operation B a replica applies, all operations A with A → B must already be applied.
Why this is genuinely weaker than sequential consistency: sequential consistency demanded a single total order S that all replicas agree on. Causal consistency only demands that each replica's local order be a linear extension of the partial order → — and different replicas can pick different linear extensions. For a pair of concurrent writes (W₁ ‖ W₂), replica R1 might apply them in order W₁, W₂ while R2 applies them as W₂, W₁. Both replicas are correct under causal consistency. They would not both be correct under sequential consistency, because no single global S can put W₁ before W₂ and after W₂ simultaneously.
A1 → A2, A1 → B1 (message), A2 → B2 → C1, and A1 → C1 by transitivity. But A2 and B1 are concurrent: no chain of process or message edges connects them in either direction. Different replicas may apply this concurrent pair in different orders without violating causal consistency.How replicas track dependencies — vector clocks and explicit causality
Causal consistency is not free. To enforce it the system must, at every replica, answer "before applying operation B, have I already applied every operation A such that A → B?" The two implementation strategies dominate practice: vector clocks (track all dependencies) and explicit causality (track only application-named dependencies).
A vector clock for n replicas is an n-dimensional integer vector. Each replica i increments its own component V[i] on every local event. When replica i sends a message, it attaches its current vector. When replica j receives a message with vector V_msg, it sets each V_j[k] := max(V_j[k], V_msg[k]) and then increments V_j[j]. Two operations are causally ordered iff one vector dominates the other component-wise; concurrent iff neither dominates. Vector clocks are exact — they capture happens-before precisely, no false positives, no false negatives — but their size grows with the replica count, which is why they fall over above ~64 replicas.
Explicit causality is the technique that scales. Used by COPS (Lloyd-Freedman-Kaminsky-Andersen 2011) and Eiger (Lloyd et al. 2013), it says: don't track every dependency, only the ones the application names. When a client writes B, it tags B with the keys of all prior reads in this session — those are B's nearest causal dependencies. The server stores B's metadata as (key, version, deps=[(k1,v1), (k2,v2), ...]). When replicating B to another datacenter, the receiving replica checks whether all dependencies are present locally before applying B. If not, B is queued. The trick is that the dependency set is bounded by how much work the client did, not how many replicas exist — a write after one read carries one dependency, regardless of cluster size.
Why explicit causality is sound when applications behave: the soundness theorem requires that every operation a client causally depends on must appear in the dependency set. If a client reads X, then reads Y, then writes Z, the dependency set for Z is {X, Y}. If the client also observed something through a side channel (e.g. they read X, then a friend told them about Y over WhatsApp, then they wrote Z) — that out-of-band channel is invisible to the system, and the system cannot enforce the resulting dependency. Causal consistency in real deployments thus depends on the application boundary: every operation that influences a later write must flow through the system's APIs. CricStream's comment thread is safe because Rohan's reply is server-typed against the original post's id; if Rohan instead screenshotted his post, sent it to Kavya on Telegram, and Kavya replied based on the screenshot — the system has no way to enforce that ordering, and the model legitimately does not.
A causal-consistency checker that exposes the boundary
The Python below builds a tiny three-replica system, applies a sequence of writes and reads with explicit dependency tags, and verifies that every replica's observed history respects happens-before. It then injects a violation — a replica that applies a dependent write before its dependency — and shows the checker catching it.
# causal_check.py — verify a multi-replica history is causally consistent
from collections import defaultdict
from dataclasses import dataclass, field
@dataclass
class Op:
op_id: str # globally unique
client: str
kind: str # 'write' or 'read'
key: str
val: int | None
deps: set = field(default_factory=set) # op_ids this op causally depends on
def is_causally_consistent(replica_histories):
"""replica_histories: dict[replica_id -> list[Op]] in apply-order at that replica.
Returns (True, None) iff every replica applied every op only after all its deps."""
for r, hist in replica_histories.items():
applied = set()
for op in hist:
missing = op.deps - applied
if missing:
return False, f"replica {r}: applied {op.op_id} before deps {missing}"
applied.add(op.op_id)
return True, None
# Rohan posts, then replies. Reply causally depends on the post.
post = Op('o1', 'rohan', 'write', 'comment:42', 1, deps=set())
reply = Op('o2', 'rohan', 'write', 'comment:43', 2, deps={'o1'})
# An unrelated user posts concurrently — no dependency on Rohan's thread
other = Op('o3', 'kavya', 'write', 'comment:44', 9, deps=set())
# Replica R1 (Mumbai) applies in arrival order, all deps respected
r1 = [post, other, reply]
# Replica R2 (Bengaluru) applies the *reply* before the *post* — VIOLATION
r2 = [reply, post, other]
# Replica R3 (Hyderabad) reorders concurrent ops differently from R1 — STILL OK
r3 = [other, post, reply]
for label, hist in [('R1', r1), ('R2', r2), ('R3', r3)]:
ok, err = is_causally_consistent({label: hist})
print(f"{label}: {'OK' if ok else 'FAIL — ' + err}")
Sample output:
R1: OK
R2: FAIL — replica R2: applied o2 before deps {'o1'}
R3: OK
Why R3 is fine but R2 is not: causal consistency does not require every replica to apply operations in the same order. R3 swaps other and post — but other and post are concurrent (other.deps = ∅, post.deps = ∅, neither depends on the other), so the swap is permitted. R2's failure is structural: reply.deps = {o1}, and R2 applied reply (o2) before post (o1), which is a causal violation. The checker is enforcing exactly the partial order from Lamport 1978 — and crucially, it is only enforcing that. The freedom to reorder concurrent ops is the whole point.
This is the line CricStream's gateway must hold. When the Bengaluru replica receives Rohan's reply via the inter-region replication stream, it inspects the explicit deps={'o1'} field. If o1 has not yet arrived, the reply is queued in a pending-dependency buffer keyed on o1. When o1 finally arrives (typically 50–200 ms later, sometimes seconds during cross-region congestion), the buffer is flushed and the reply is applied. Kavya does see the thread out-of-order for a brief window if she opens it before o1 arrives — but she sees it as empty (no comments yet), not as the reply alone. That is the contract.
A war story: CricStream's "reply before original" bug
The CricStream incident from the opening was not new code. It was the absence of code. The platform's comment service stored every comment with a parent_id field, and the gateway team had assumed for years that parent_id implicitly enforced causal ordering — after all, you couldn't see a child without first seeing the parent, right? In practice, the Bengaluru replica's ingest path was three independent worker pools (one per Kafka partition), and Rohan's post and Rohan's reply were keyed differently enough to land on different partitions. The reply's worker pool was lightly loaded; the post's pool was backed up by 800 ms behind a noisy-neighbour user spamming a different thread. The reply showed up on Kavya's screen first.
The fix that Sneha (the comment-service tech lead) pushed was a four-line change: every write's payload was extended with an explicit deps array, populated by the front-end from the user's session view. When a comment was rendered to a client, the client recorded the comment's op_id in a session-local "seen" set. When the client posted a reply, the reply's deps field was set to the post-id it was replying to. The Bengaluru replica's apply path was wrapped in a wait_for_deps() guard that blocked the apply until every op in deps was present locally. The pending-dependency buffer had a 30-second TTL — beyond which the buffered op was rejected with a 503 and the client was asked to retry, on the assumption that the dependency had been lost in cross-region replication.
The postmortem from CricStream's reliability lead Vikrant captured the lesson: "For three years we believed the database was giving us causal consistency because the schema had foreign keys. Foreign keys give you referential integrity at apply time — they say a row can't be inserted if its parent doesn't exist. They do not say the parent will arrive first across replicas. That is a property of the replication protocol, not the schema. Two different layers, two different guarantees."
The post-fix p99 latency for cross-region comment apply went from 60 ms (broken-but-fast) to 180 ms (correct, with dependency-wait), but the replicas-disagree-on-thread-order alert dropped from 4–6 fires per cricket final to zero across the next IPL season.
Common confusions
- "Causal consistency means every replica eventually applies operations in the same order." No. Two concurrent writes — operations with no
→relation between them — may be applied in different orders at different replicas, forever. Causal consistency only constrains the ordering of causally related operations. The replicas converge in the sense that every operation eventually arrives everywhere; they do not converge to a single agreed-upon sequence the way a sequentially consistent system does. - "If I use vector clocks I have causal consistency." Vector clocks are a mechanism for tracking happens-before; they are not a consistency model. A system can use vector clocks and still violate causal consistency if it applies an operation before checking its dependencies. Conversely, COPS uses explicit causality (no vector clocks) and is causally consistent. The model is about what ordering is enforced; the mechanism is how you enforce it.
- "Causal consistency composes." It does not, in the same way sequential consistency does not. Mahajan-Alvisi-Dahlin's 2011 Real-Time Causal Consistency paper showed that two causally consistent objects, used together by a client, can produce histories that violate causal consistency on the client's combined view — the cross-object dependencies are invisible to either object's local mechanism. This is why systems like Eiger introduce write-only transactions on top of causal storage: to bind cross-object dependencies that single-object causal tracking would miss.
- "Causal consistency is the strongest model that survives partition." Almost — but the precise statement is real-time causal consistency is the strongest available + convergent model under partition, per the 2011 result. Plain causal consistency (without the real-time clause) is what most systems implement. The distinction matters for systems making formal claims; in practice, COPS, Eiger, Bayou, and modern AntidoteDB implement causal+ (causal consistency + convergence) and call it causal consistency.
- "Read-your-writes is the same as causal consistency." Read-your-writes is a session guarantee — a single client always sees its own prior writes. Causal consistency is much stronger: it preserves causality across every chain of communication, including writes by client A that client B observed and then wrote based on. Read-your-writes is one of four session guarantees; causal consistency strictly implies all four (and more).
Going deeper
Mahajan-Alvisi-Dahlin 2011 — the optimality result
The paper Consistency, Availability, and Convergence (UT Austin TR-11-22, 2011) by Mahajan, Alvisi, and Dahlin proves that real-time causal consistency is the strongest model that is simultaneously available (every non-failing node responds in finite time) and convergent (replicas eventually agree on a state given enough exchange) under network partition. Anything stronger gives up either availability or convergence; anything weaker is dominated. This is the formal reason every "AP-side" system that aims to do better than eventual consistency converges on causal+: it is the ceiling. The proof is not long but is technically dense — read it after you've internalized the CAP theorem and PRAM consistency, both of which appear as comparison points.
COPS and Eiger — explicit causality at scale
Lloyd, Freedman, Kaminsky, and Andersen's 2011 SOSP paper Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS introduced explicit-causality tracking and showed it could handle 100,000+ ops/sec/node with three-datacenter replication — orders of magnitude better than vector-clock-based predecessors. Eiger (Lloyd et al., NSDI 2013) added column-family transactions on top, supporting causally consistent multi-key reads and writes. The metadata-size argument is the killer one: COPS's deps field grows with what the client did in the session (typically 1–10 entries), not with the cluster size. This is the trick that made causal consistency practical for systems with hundreds of replicas where vector clocks would have collapsed.
CRDT meets causal — convergence for free
Conflict-free Replicated Data Types (CRDTs, Shapiro et al. 2011) make convergence automatic for specific data types: G-counters, OR-sets, LWW-registers. When you compose causal consistency with CRDTs, the convergence-after-partition property comes for free — every concurrent operation pair has a deterministic merge function defined by the lattice. Bayou (Terry et al. 1995) was the first system to do this; modern AntidoteDB and Riak use it heavily. The combination is what most practical "causal+" systems actually deliver: causal ordering for happens-before pairs, lattice-merge for concurrent pairs.
Why MongoDB and DynamoDB are not causally consistent by default
DynamoDB defaults to eventual consistency; the strongly-consistent-read flag gives linearizable reads on a single key, but does not enforce cross-key causal consistency. MongoDB's causal consistency sessions (introduced in v3.6, 2017) are session-scoped only — they enforce read-your-writes within a session via cluster-time tokens, but do not propagate causality across clients. To get true cross-client causal consistency on either system you must layer it yourself: tag every write with explicit deps, store them in the same row, and on read, fetch the deps and verify they were applied. This is what most "happens-before" tracking in industry actually looks like — built by application teams on top of strong-eventual-consistency primitives.
Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
python3 causal_check.py
The checker is O(ops × deps) per replica — fine for histories up to 100k ops. For verifying a real production trace, the standard tool is Anish Athalye's porcupine with a custom --model=causal checker, or Jepsen's knossos extended to track happens-before edges from RPC ack-graphs.
Where this leads next
Causal consistency is the third rung of Part 12's lattice: weaker than linearizability and sequential consistency, stronger than session guarantees and eventual consistency. It is the model that every "AP-side" system that wants to do better than eventual lands on, by Mahajan-Alvisi-Dahlin's optimality result.
Beyond the consistency hierarchy, causal consistency feeds into vector clocks (the canonical mechanism), CRDTs as distributed state machines (the convergence partner), and the CAP theorem — which it sits on the AP side of, but only just: the real-time causal refinement is the precise CAP-AP ceiling. Part 13's CRDT chapters operate above causal storage; Part 17's geo-replication chapters discuss how to enforce causal+ across continental round-trip times where latencies make linearizability simply infeasible.
References
- Lamport, L. — "Time, Clocks, and the Ordering of Events in a Distributed System" (CACM, 1978). The defining paper for happens-before.
- Mahajan, P., Alvisi, L., Dahlin, M. — "Consistency, Availability, and Convergence" (UT Austin TR-11-22, 2011). The optimality result for causal+.
- Lloyd, W., Freedman, M., Kaminsky, M., Andersen, D. — "Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS" (SOSP 2011). Explicit causality at scale.
- Lloyd, W., et al. — "Stronger Semantics for Low-Latency Geo-Replicated Storage" (NSDI 2013). Eiger's causal transactions.
- Shapiro, M., Preguiça, N., Baquero, C., Zawirski, M. — "Conflict-Free Replicated Data Types" (SSS 2011). The CRDT framework.
- Terry, D., et al. — "Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System" (SOSP 1995). The first practical causal+ system.
- Bailis, P., Davidson, A., Fekete, A., et al. — "Highly Available Transactions: Virtues and Limitations" (VLDB 2014). The lattice including HAT-causal.
- Sequential consistency — the previous chapter, the rung above causal in the lattice.
- CRDTs as distributed state machines — the convergence mechanism that pairs naturally with causal consistency.