In short
Vector clocks detect that two writes are concurrent; last-write-wins breaks the tie by throwing one away; application-side merging works when the data is set-like but leaves bugs in every custom merge function. CRDTs — conflict-free replicated data types — are data structures engineered so the merge is part of the type itself. Hand two replicas the same operations in any order, and they end up in the same state. By construction.
The zoo. G-Counter — increment only, merge by per-node max. PN-Counter — two G-Counters in opposition, supports decrements. OR-Set — add-wins set with per-add tags and observed-remove tombstones. LWW-Register — last-write-wins baked into the type. Sequence CRDTs (Logoot, RGA, Yjs, Automerge) power Google Docs, Figma, Linear, and Notion.
Two flavours. State-based CRDTs ship the whole state and merge with a least-upper-bound; simple, bandwidth-heavy. Op-based CRDTs ship each operation over reliable causal broadcast; bandwidth-light, delivery-semantics-heavy.
The guarantee is strong eventual consistency — replicas that receive the same set of operations converge, regardless of order, with no coordination. Replicas can be offline for weeks and still converge when they reconnect. The cost is metadata: OR-Set tombstones, causal context, larger-than-naive payloads, and the fact that designing a useful CRDT for a new data type is hard. Production: Redis Enterprise CRDTs, Riak DT, AntidoteDB, Roshi, Automerge, Yjs.
Two users type into the same Google Doc at the same time. There is no "document leader" serialising their keystrokes. The network between them jitters, reorders. And yet after the dust settles, both their characters survive in the same order on every client, with no spinning "resolving conflict" dialog. A sequence CRDT solves this — the data structure does the merging; the application does nothing.
Chapter 82 covered vector clocks and last-write-wins — tools for detecting and resolving concurrent writes. Both leak the problem upward: the application must tolerate lost writes (LWW) or implement a merge function (vector-clock siblings). CRDTs are the next step — bake the merge rule into the type so neither the application nor the coordinator has to think about it.
The CRDT property — strong eventual consistency
Pin down precisely what CRDTs guarantee.
Eventual consistency (the weak sense) says: if writes stop, replicas eventually agree. Useful as an aspiration, useless as a specification. "Eventually" is an existence claim with no bound; "agree" is unqualified. Two Cassandra replicas that gossip forever without making progress technically satisfy it.
Strong eventual consistency (SEC) tightens this to a constructive claim. Two replicas that have received the same set of operations (in any order, with any duplicates) are in the same state. Not "will eventually be"; are. Convergence is a mathematical property of the data structure, not an emergent property of some background process.
The formal recipe. The state space forms a join-semilattice under a partial order ≤. A join-semilattice means every pair of elements has a unique least upper bound (lub), written a ⊔ b. Updates move state upward in the lattice — every update produces a new state s' ≥ s. Merge is lub. Because lub is associative, commutative, and idempotent, replicas that apply the same updates in any order, with any duplication, end up at the same lattice point.
That is the whole mathematical content. Everything else — G-Counter, OR-Set, sequence CRDT — is finding a useful semilattice for a useful data type.
Why associative-commutative-idempotent (ACI) is the magic triple: associative means you can group operations any way you like; commutative means order does not matter; idempotent means duplicate delivery is harmless. Together they mean a replica can receive operations from any peer in any order, including re-receiving the same operation twice, and the final state is uniquely determined by the set of operations — not the sequence. Dropping any one of the three breaks convergence: non-commutative means peers see history in different orders; non-idempotent means at-least-once delivery causes drift; non-associative means the merge depends on grouping. ACI is the coordination-free jackpot.
No leader, no quorum, no vector clocks (though some CRDTs use causal context internally). Replicas sync whenever convenient — on a timer, on reconnect, lazily on read — and the moment two have seen the same operations, they match.
G-Counter — the simplest CRDT
Start with a counter that only goes up. Hits on a blog post, messages posted in a group chat, items added to a distributed queue.
Naive attempt. Store a single integer n, increment on each node, merge by... what? max(a, b) loses increments (two nodes both at 5 merge to 5 when the real total is 10). a + b double-counts on repeated merge. Neither works.
The G-Counter fix (grow-only counter). Store a dictionary {node_id: count_at_that_node}. Only the owning node increments its own entry. Merge is per-node max.
- Increment at node N:
state[N] += 1. - Value:
sum(state.values())— the total across all nodes. - Merge two states: for each key, take the max.
Per-node max is idempotent (max(a, a) = a), commutative (max(a, b) = max(b, a)), and associative (max(max(a, b), c) = max(a, max(b, c))). Each node's counter grows monotonically because only that node writes it. Two replicas that have seen all the same increments have the same per-node counts — hence the same sum.
Walk-through. Three nodes A, B, C, all starting at {A:0, B:0, C:0}. A does two increments → {A:2}, B does one → {B:1}, C does three → {C:3}. A gossips with C and merges per-key max → {A:2, B:0, C:3}, value 5. A gossips with B → {A:2, B:1, C:3}, value 6. C gossips with B → {A:0, B:1, C:3}, value 4 — still missing A's contribution until one more round. Eventually all replicas converge to {A:2, B:1, C:3}, value 6.
No ordering constraint. C could have merged with B before C ever incremented; A could have merged with C twice; the final state depends only on the set of operations observed, not when they were observed.
PN-Counter — handling decrements
G-Counter only goes up. Likes need unlike; cart items need remove. You cannot allow subtraction in a G-Counter — the monotonicity argument breaks and per-node max merge starts losing decrements.
PN-Counter (positive-negative counter) is two G-Counters in opposition. One tracks increments (P), one tracks decrements (N). Value is sum(P) - sum(N). Merge is the pair of G-Counter merges.
class PNCounter:
def __init__(self, node_id, nodes):
self.node_id = node_id
self.P = {n: 0 for n in nodes} # increments
self.N = {n: 0 for n in nodes} # decrements
def inc(self): self.P[self.node_id] += 1
def dec(self): self.N[self.node_id] += 1
def value(self): return sum(self.P.values()) - sum(self.N.values())
def merge(self, other):
for k in set(self.P) | set(other.P):
self.P[k] = max(self.P.get(k, 0), other.P.get(k, 0))
for k in set(self.N) | set(other.N):
self.N[k] = max(self.N.get(k, 0), other.N.get(k, 0))
A counter can go negative if decrements outrun increments. Crucially, under arbitrary reordering of inc/dec operations across replicas, the final value is the same — because the count of increments and count of decrements each converge via per-node max, and subtraction is deterministic on converged inputs.
Why split into two counters instead of allowing signed increments: signed per-node counters are not a semilattice under max (max of -3 and +5 is +5, losing the decrement). The trick is to express every mutation as monotonic somewhere — for PN-Counter, monotonically increasing counts of how many times each operation happened. The actual value you care about is a derived quantity. This pattern repeats throughout CRDT design: never let a cell shrink; record history monotonically and derive the answer.
OR-Set — the observed-remove set
Counters are the easy case. Sets are harder. You want add(x), remove(x), and contains(x) on a distributed set that converges without coordination.
Naive attempts fail in instructive ways. A simple union-based set (G-Set) cannot express removes at all. A two-set 2P-Set (tracking adds and removes separately, contains if in adds and not in removes) breaks on re-adds — once you remove x, you can never add it back. An LWW-Set tags every add and remove with a timestamp and picks the latest; under clock skew it silently loses writes, and it is remove-wins on ties, which contradicts user intuition (when you add an item and someone else concurrently removes it, you typically want the add to survive).
OR-Set (observed-remove set) gives a principled add-wins rule and avoids timestamps entirely. The state is a set of (element, tag) pairs plus a set of tombstones — tags that have been removed. contains(x) is "there exists a pair (x, tag) with tag not in tombstones".
add(x)at node N mints a fresh tag(N, k)and inserts(x, (N, k)).remove(x)collects the tags forxcurrently observed at the local replica and adds them to the tombstone set.- Merge: union of pairs; union of tombstones.
The magic is in "observed". A remove retires exactly the add-instances the remover can see. Concurrent adds on other replicas mint different tags the remove never knew about, so they survive the merge.
Walk-through. Replica 1 adds apple → pairs {(apple, (1,1))}. It replicates to Replica 2; both hold {(apple, (1,1))}. Replica 1 then removes apple — observes (1,1), tombstones {(1,1)}. Meanwhile, before receiving the remove, Replica 2 adds apple again → {(apple, (1,1)), (apple, (2,1))}. On sync: pairs {(apple,(1,1)), (apple,(2,1))}, tombstones {(1,1)}. contains(apple)? The tag (2,1) is not tombstoned → true. Apple survives. The remove retired the specific (1,1) instance; the concurrent (2,1) add it did not know about is untouched. Add-wins on concurrent add-remove.
Why add-wins rather than remove-wins: set-membership semantics in collaborative systems feel more natural when "I added X, someone else didn't see my add and removed X" leaves X in the set. The user who added has new information; the user who removed was working with stale data. The alternative (remove-wins, RW-Set) is a valid CRDT too — it is sometimes chosen for access-control lists where "revocation should be sticky" is the desired policy. The right default depends on the workload, and OR-Set's add-wins is the most common pick for shopping carts, friend lists, and collaborative tag sets.
Python — G-Counter, PN-Counter, OR-Set in 40 lines each
Real, runnable Python. Start with G-Counter.
# crdts/gcounter.py
class GCounter:
def __init__(self, node_id, state=None):
self.node_id = node_id
self.state = dict(state) if state else {}
def inc(self):
self.state[self.node_id] = self.state.get(self.node_id, 0) + 1
def value(self):
return sum(self.state.values())
def merge(self, other):
keys = set(self.state) | set(other.state)
merged = {k: max(self.state.get(k, 0), other.state.get(k, 0)) for k in keys}
return GCounter(self.node_id, merged)
Ten lines of counter code. The entire merge is a dict comprehension taking per-key max. Idempotent, commutative, associative — convergence guaranteed.
PN-Counter composes two G-Counters.
# crdts/pncounter.py
from .gcounter import GCounter
class PNCounter:
def __init__(self, node_id, p=None, n=None):
self.node_id = node_id
self.p = GCounter(node_id, p)
self.n = GCounter(node_id, n)
def inc(self): self.p.inc()
def dec(self): self.n.inc()
def value(self): return self.p.value() - self.n.value()
def merge(self, other):
merged_p = self.p.merge(other.p)
merged_n = self.n.merge(other.n)
return PNCounter(self.node_id, merged_p.state, merged_n.state)
Composition is a theme. Once you have G-Counter, you get PN-Counter for free. You can build G-Map (key → G-Counter) the same way, and every value converges because the keys merge as a set and the values as G-Counters.
OR-Set is the first CRDT where the data carries metadata beyond the obvious.
# crdts/orset.py
class ORSet:
def __init__(self, node_id):
self.node_id = node_id
self.counter = 0 # local tag sequence
self.pairs = set() # (element, (node, seq))
self.tombstones = set() # retired tags
def _tag(self):
self.counter += 1
return (self.node_id, self.counter)
def add(self, element):
self.pairs.add((element, self._tag()))
def remove(self, element):
observed = {tag for (e, tag) in self.pairs if e == element}
self.tombstones |= observed
def contains(self, element):
return any(e == element and tag not in self.tombstones
for (e, tag) in self.pairs)
def merge(self, other):
merged = ORSet(self.node_id)
merged.pairs = self.pairs | other.pairs
merged.tombstones = self.tombstones | other.tombstones
return merged
About 20 lines. Two set unions and a membership predicate — every non-trivial property of OR-Set falls out of that. Run the earlier apple walk-through against this code; it does what the math says.
State-based vs op-based CRDTs
Two dialects. Both converge; the difference is what gets shipped on the wire.
State-based CRDTs (CvRDTs) ship the entire state on reconciliation. A replica computes local.merge(remote) and stores the result; merge is the lub of the semilattice. This is what the code above implements. Simple to reason about, idempotent under redelivery (merging the same state twice is harmless). The cost is bandwidth — a G-Counter with 10000 nodes ships a 10000-entry dict on every sync. Production CvRDTs (Riak DT, AntidoteDB) optimise with delta-state CRDTs — ship only the bits that have changed since last sync.
Op-based CRDTs (CmRDTs) ship each individual operation to every replica. Operations are commutative, so replicas converge by applying them locally. Bandwidth-light: one add is one message. The catch is delivery semantics: operations must be delivered to every replica exactly once in an order respecting causality — a reliable causal broadcast primitive, non-trivial to build.
Which to choose. State-based for gossip over intermittent connectivity. Op-based when you have a reliable broadcast bus and care about bandwidth. Yjs and Automerge are op-based; Riak DT is state-based; AntidoteDB supports both. A theoretical result: the two are equi-expressive — any op-based CRDT can be simulated by a state-based one, and vice versa with stronger delivery assumptions. The choice is engineering, not power.
Sequence CRDTs — collaborative editing
The hardest class of CRDT, and the one that powers Google Docs, Figma, Linear, Notion, Miro, and every serious collaborative-editing product.
Problem. Two users typing into a shared document. Each keystroke is an insert or delete at some position. Replicas must converge to the same character sequence regardless of operation order. The naive "each character has an integer index" fails immediately — indices shift under concurrent inserts, and two replicas disagree on what position 5 means.
The idea across all sequence CRDTs is to give every character a globally-ordered identifier that does not depend on surrounding state. Inserts pick an ID that sorts between the ID of the left neighbour and the ID of the right neighbour. Different algorithms pick different identifier schemes: Logoot uses rational-number-like sequences, RGA records each insert's left-neighbour ID plus a timestamp, WOOT links characters to their neighbours at insert time, Yjs is a modern heavily-engineered descendant of RGA, and Automerge is an op-based CRDT with JSON-shaped documents. What they share is an ID scheme that makes "insert between X and Y" commutative; what they differ on is identifier size, delete semantics, and tombstone GC. Yjs dominates production front-ends; Automerge dominates the local-first library space.
Collaborative cart across three replicas
Alice, Bob, and Carol share a shopping cart backed by an OR-Set. Three replicas A, B, C — one per user device — run in three cities on a flaky mobile network. Each replica accepts writes locally.
- Alice on A adds
milk. A mints taga1. Pairs{(milk, a1)}, tombstones{}. - Bob on B (has not yet received Alice's write because the network between A and B is stalled) adds
bread. B mints tagb1. Pairs{(bread, b1)}, tombstones{}. - Carol on C pulls from A first, sees
(milk, a1), then decides they already have milk at home and issuesremove(milk). C observes taga1and adds it to tombstones. Pairs{(milk, a1)}, tombstones{a1}. - Network heals; A, B, C gossip pairwise. Each computes pair-union and tombstone-union.
- Converged state everywhere: pairs
{(milk, a1), (bread, b1)}, tombstones{a1}.contains(milk) = false(a1 tombstoned),contains(bread) = true. Final cart:{bread}.
No coordinator. No serialiser. Every operation applied locally and replicated asynchronously. The merge is set.union twice. Had Bob concurrently added milk on B while Carol was removing it on C, Bob's add would mint a distinct tag b2 that Carol's remove never observed — milk would survive, add-wins style.
a1 instance of milk; any concurrent add (had there been one) would have minted a different tag and survived. Add-wins for concurrent add/remove.CRDTs vs traditional concurrency control
Traditional databases achieve consistency via serialisability — every concurrent execution is equivalent to some serial order. The cost is that writes block or abort during conflicts; the system stops progress to preserve invariants.
CRDTs trade in the opposite direction. No conflicts are ever possible — every pair of operations commutes by construction — so replicas never need to stop progress. The cost is expressivity. You cannot implement "transfer 100 rupees from A to B atomically" with a CRDT, because the invariant "money is conserved" is a cross-key constraint that no local-only data structure can enforce. A PN-Counter will happily go negative under concurrent sales.
Rule of thumb. If your data type admits a commutative merge that preserves the invariants you care about — shopping carts, presence lists, like counters, collaborative documents, configuration registers — CRDTs are excellent. If your invariants are cross-key or inequality constraints (balance ≥ 0; inventory ≥ 0; one owner at a time), CRDTs alone cannot enforce them. You need coordination — at the CRDT level (escrow, reservations) or stepping outside CRDTs for those specific operations. CRDTs are a specific-workload tool, not a consistency-model replacement.
The tombstone problem
Every remove on an OR-Set leaves a tombstone, and tombstones never vanish on their own, because a replica that was offline might return tomorrow with a delayed add for the removed tag. The tombstone is the only way to reject it. Over weeks, tombstones accumulate; a cart that has seen thousands of add/remove cycles carries thousands of tombstones. Storage grows linearly with operation history, not with current-set size.
GC is genuinely hard. The invariant you need is "no replica anywhere holds an unapplied operation referencing this tombstone's tag". Establishing that requires causally complete delivery — every replica has caught up past a given point. That is a coordination primitive. CRDTs are coordination-free for updates but cannot escape it for GC.
Production heuristics: time-based (tombstones older than 7-30 days deleted; Riak), causal barriers (periodic protocols confirming every replica has seen operations up to time T; AntidoteDB), vector-clock pruning (GC tombstones dominated by the min-peer frontier; Yjs), structural GC (coalesce adjacent tombstones in sequence CRDTs; Automerge). The elegant math says "just take unions"; the engineering reality says "please don't, the disk is full".
A minimal in-memory CRDT sync loop
Stitch the pieces together. Two nodes running in-memory OR-Sets synchronising over a trivial function call:
# crdts/sync.py
import time
def sync(nodes):
"""Pairwise state-based merge: every node learns from every other."""
snapshots = [(n, n.merge(n)) for n in nodes] # snapshot via self-merge
for a in nodes:
for (b_node, b_snapshot) in snapshots:
if a is not b_node:
merged = a.merge(b_snapshot)
a.pairs = merged.pairs
a.tombstones = merged.tombstones
# Demo
a, b, c = ORSet("A"), ORSet("B"), ORSet("C")
a.add("milk") # tag (A,1)
b.add("bread") # tag (B,1)
sync([a, c]) # C learns about milk
c.remove("milk") # tombstone (A,1)
sync([a, b, c]) # everyone converges
assert not a.contains("milk")
assert a.contains("bread")
assert a.pairs == b.pairs == c.pairs
Thirty-ish lines including the demo. Production CRDT engines add delta-compression, network transport, persistence, and tombstone GC; the core is the merge methods you already have.
Common confusions
-
"CRDTs are just eventual consistency." CRDTs are strong eventual consistency — convergence is a property of the data structure, not a hope about the network. Regular eventual consistency says replicas will agree; CRDTs say replicas that have received the same operations are equal.
-
"Any data structure can be made a CRDT." False. The data type must support a commutative, associative, idempotent merge. Types with global invariants (balance, inventory, uniqueness) resist. You can sometimes carve out a CRDT subset but cannot CRDT-ify arbitrary semantics.
-
"CRDTs replace vector clocks." Complementary, not replacement. Vector clocks detect concurrency; CRDTs resolve concurrency without application code. Some CRDTs carry vector-clock-shaped causal context internally.
-
"CRDTs are slow." In-memory operations are dict-lookup fast. The "slow" reputation is metadata overhead — tombstones bloat state, state-based sync ships full state, sequence CRDT identifiers grow with edit history. Space and bandwidth, not CPU.
-
"CRDTs eliminate coordination entirely." Updates are coordination-free; membership changes and tombstone GC still require coordination. Nothing escapes it entirely — it moves from the hot path (writes) to the cold path (cluster management).
-
"OR-Set add-wins is always right." Workload-dependent. Access-control lists often want remove-wins (revocation should stick); the matching CRDT is RW-Set. Pick the one whose conflict-resolution matches user expectations.
Going deeper
The foundational papers
Shapiro, Preguiça, Baquero, Zawirski, A comprehensive study of Convergent and Commutative Replicated Data Types (INRIA TR 7506, 2011) is the canonical reference. It formalises CvRDTs (state-based) and CmRDTs (op-based), proves the semilattice convergence theorem, and catalogues G-Counter, PN-Counter, G-Set, 2P-Set, OR-Set, LWW-Element-Set, MV-Register, and more. The SSS 2011 paper Conflict-free Replicated Data Types by the same authors is the peer-reviewed short version.
Production engines
- Yjs — op-based sequence CRDT library, JavaScript-first, powers Notion's block editor, Linear's titles, and many collaborative front-ends.
- Automerge — Martin Kleppmann's JSON-shaped CRDT with a compact binary format and full change history. Anchors Ink & Switch's local-first software projects.
- Riak DT — Basho's embedded CRDT library, the first production use of formalised CRDTs (Riak 2.0, 2014). Counter, set, map, register types.
- AntidoteDB — research geo-replicated database implementing CRDTs at cluster scale, by some of the Shapiro et al. authors.
Delta CRDTs and pure op-based CRDTs
Full-state shipping is bandwidth-heavy. Delta CRDTs (Almeida, Shoker, Baquero, 2016) decompose state-based CRDTs so a replica ships only the delta since last sync; equivalent correctness, vastly less bandwidth. Pure op-based CRDTs (Baquero, Almeida, Shoker, ECOOP 2014) strip the op-based model down to operations plus a causal context and make tombstone GC principled via causal stability.
Where this leads next
CRDTs look like the quiet miracle: no coordination, no conflicts, convergence by construction. Chapter 84 looks at where that magic runs out — the wall: eventual consistency leaks. Read-your-writes violations under naive eventual consistency, session-guarantee layering, why "eventual" is often not enough for user-facing features, and the engineering patterns (sticky sessions, client-side version stamps, causal consistency) that paper over the gap when SEC alone is not sufficient.
After ch.84, Build 10 closes on the consistency topic and Build 11 opens on wide-column stores — Cassandra and HBase at scale — where many of the techniques you have seen across Builds 9 and 10 (leaderless replication, gossip, CRDTs, tunable quorums) meet a concrete production architecture.
References
- Shapiro, Preguiça, Baquero, Zawirski, A Comprehensive Study of Convergent and Commutative Replicated Data Types, INRIA Technical Report 7506, 2011 — the foundational reference. Formal definitions of CvRDT and CmRDT, the semilattice convergence theorem, and an encyclopedic catalogue of CRDT variants with proofs of convergence.
- Preguiça, Baquero, Shapiro, Conflict-free Replicated Data Types, Encyclopedia of Big Data Technologies, 2018 — a shorter pedagogical survey by the same authors, with production experience from Riak DT and AntidoteDB, and a frank discussion of where CRDTs fit versus where they don't.
- Kevin Jahns, Yjs Documentation and Internals — the most detailed public description of a production-grade sequence CRDT. Covers the identifier scheme, delta compression, WebSocket and WebRTC providers, and performance engineering that makes collaborative editing viable at Google-Docs scale.
- Kleppmann et al., Automerge — a library for building local-first software — the pragmatic JSON-shaped CRDT library with a binary change format, full history, and a design paper at local-first.dev explaining the broader software architecture.
- Roh, Jeon, Kim, Lee, Replicated Abstract Data Types: Building Blocks for Collaborative Applications, Journal of Parallel and Distributed Computing, 2011 — the companion foundational paper to Shapiro et al., with emphasis on collaborative-application primitives and a careful treatment of sequence CRDTs (RGA).
- Kleppmann, Designing Data-Intensive Applications, Chapter 5 — Replication, O'Reilly 2017 — the clearest pedagogical treatment of CRDTs in the DDIA tradition: intuition, worked examples, and honest discussion of tombstone GC and the trade-offs versus application-side merge functions.