Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Deltas and optimizations

Karthik at MealRush is debugging why the runbook editor — a Yjs-style CRDT layered over a Riak-style key-value store — is burning 380 GB of egress per day across thirty regional replicas, for a workload that visibly produces maybe 8 MB of edits. He pulls a packet capture. Every two seconds, every replica sends every other replica its entire OR-Set state for the merge. A 4 MB document goes out 29 times every two seconds, from each of 30 replicas. The math works: 4 MB × 29 × 30 × 30 (twice a second × 60) per minute is about 200 GB/hr. The CRDT theory said "merges are commutative, just join the lattice and you converge." Nobody told the theory it would have to ship 4 MB on the wire to add a comma to a paragraph. The fix is not a different CRDT. It is a different anti-entropy strategy: ship deltas, not full state, and engineer five further optimizations on top so the join is cheap, the network footprint is bounded, and tombstones do not grow forever. This chapter is the engineering distance between "CRDTs converge in theory" and "CRDTs run cheaply in production".

A naive state-based CRDT ships its entire join-semilattice value on every anti-entropy round — O(state) per gossip. Delta-state CRDTs ship only the join-irreducible slices that changed since the recipient's last seen version, dropping cost to O(edits since last contact). Five further optimizations matter: causal-context compression, tombstone GC bounded by the causal stable cut, run-length-encoded binary wire format, vector-clock pruning by membership churn, and lazy materialisation. Together they take production CRDTs from "ruinously expensive" to "shippable on a 4G connection".

Why naive state-based gossip is quadratic in history

The original Shapiro et al. CRDT paper defines a state-based CRDT as a triple (state, query, merge) where merge is a commutative, associative, idempotent join over a semilattice. Anti-entropy says: every replica periodically picks a peer and sends its state; the peer joins it with its own. Convergence is automatic.

The hidden cost is in "sends its state". For a G-Counter with N replicas, state is an N-vector of integers — small. For an OR-Set with M elements ever inserted, state includes M element-tag pairs plus their causal contexts — linear in history, not in the live set. For a sequence CRDT (RGA, YATA — see /wiki/sequence-crdts-rga-logoot-yata) over a 5,000-character document with 18 months of edits, state can be 8× the live document size because of tombstones. Gossiping all of that, every round, is what burned MealRush's egress.

Why this is quadratic in history when you sum it up across a cluster: anti-entropy is O(state) per pair per round. State grows monotonically with operations applied. Across N replicas doing pairwise gossip at frequency f, total bytes-on-wire per second is O(N × f × state). State itself is O(operations). So lifetime egress is O(operations × time × N × f) — the product of "how much you've ever done" and "how long you have been gossiping about it". A document edited a thousand times will be gossiped at its full thousand-edit weight for the rest of its life unless something prunes that history.

The cure is to recognise that most of the state did not change between two gossip rounds. If replica A last heard from replica B at version vector v_AB, then to bring B up to date A only needs to ship the slice of state with version > v_AB. That slice is a delta.

Full-state versus delta-state anti-entropyA timeline shows three replicas A, B, C. Under full-state gossip, every round each replica ships its entire state to one peer, drawn as a thick bar with size proportional to total operations to date. Under delta-state gossip, each replica ships only the slice of state added since the last sync with that specific peer, drawn as a thin bar proportional to recent operations. The diagram is illustrative. Full-state gossip ships history; delta gossip ships only the new slice Full-state anti-entropy (Riak 1.x default) round 1: 4 MB round 2: 4.1 MB round 3: 4.2 MB round 4: 4.3 MB (full state every time) Delta-state anti-entropy (modern CRDTs) ~30 KB ~80 KB ~50 KB (just the new edits) ~20 KB (idle period) Saving on a 30-replica MealRush cluster, 4 MB doc, gossip every 2s: full-state: 4 MB × 29 peers × 30 replicas / 2s ≈ 1.7 GB/s ≈ 5.2 TB/hr cluster-wide delta-state: ~50 KB × same fanout ≈ 21 MB/s ≈ 65 GB/hr cluster-wide — 80× cheaper
The bar widths track state size, not gossip frequency. Full-state grows with history; delta-state grows with recent activity. The 80× reduction is typical for documents older than a few hours.

Delta-state CRDTs — the formalism

Almeida, Shoker, and Baquero introduced delta-state CRDTs in their 2018 paper Delta state replicated data types. The core idea: factor every state-CRDT operation into a delta-mutator that returns a small delta-state, and a join that absorbs deltas exactly the same way it absorbs full states.

For an OR-Set, the delta of inserting element e with tag t is the singleton state {(e, t)} — not the full set. The delta of removing all observations of e is the causal context up to the moment of removal — small if the removal is recent. Crucially, delta-states form the same join-semilattice as full states: join(state, delta) = join(state, state-equivalent-of-delta). The CRDT properties (commutativity, associativity, idempotence) survive.

The wire-protocol pattern that falls out:

  1. Replica A maintains delta_buffer — the join of all deltas it has produced since its last sync with each peer.
  2. Pairwise sync: A sends delta_buffer_for_B to B. B joins it into its state. B sends back its delta_buffer_for_A.
  3. Acknowledgment: B tells A "I've absorbed up to delta-id X". A then drops delta_buffer_for_B entries with id ≤ X.

The delta buffer is bounded by "operations since last successful sync with the slowest peer". For healthy clusters, that's seconds of operations — a fraction of a percent of full state.

Why deltas remain a CRDT under this scheme: each delta is itself a valid state in the underlying CRDT's lattice — it just happens to be a tiny one (a single insertion, a single counter increment). The join function does not know or care whether its input is "the whole state" or "a slice of it"; commutativity-associativity-idempotence are properties of the join, not of the input's size. Apply the deltas in any order, any number of times, with any redundancy, and the recipient's state ends at the same place. This is the entire point of building CRDTs over join-semilattices in the first place — the lattice structure makes deltas free.

Optimization 2 — causal-context compression

A delta of "remove element e" carries a causal context: the set of (replica, version) tags of e-insertions that this remove observed. Naively that context is a full version vector — N entries for N replicas. For 1,000 replicas that's 16 KB per remove-delta, which dominates the actual payload.

The fix is causal-context compression: track each replica's contiguous run of versions as a single (replica, run-end) pair, and only list out non-contiguous gaps. After healthy operation a replica has seen [1..427] from replica A, so it stores (A, 427), one entry. If a few in the middle are missing, store (A, [1..420, 425..427]) — still compact.

The Akka Distributed Data implementation uses this exact representation, calling it a dotted version vector. A 1,000-replica cluster typically compresses to under 200 bytes of causal context per delta because most replicas have contiguous prefixes.

Causal context as a dotted version vectorA flat version vector lists every (replica, version) pair seen — six replicas A through F, each with versions ranging from 1 to several hundred. The dotted version vector replaces each replica's contiguous prefix with a single integer, then lists only non-contiguous gaps as a small set. The flat representation occupies hundreds of integers; the dotted representation occupies six integers plus a small gap list. The diagram is illustrative. Flat version vector vs dotted version vector — the same causal context Flat: enumerate every (replica, v) {(A,1),(A,2),(A,3),...,(A,427), (B,1),(B,2),...,(B,389), (C,1),(C,2),...,(C,612), (D,1)..(D,98),(D,100)..(D,201), (E,1)..(E,55), (F,1)..(F,3)} ≈ 1,800 entries × 8 bytes ≈ 14 KB on the wire Dotted: contiguous prefix + gaps A: 427 (no gaps) B: 389 (no gaps) C: 612 (no gaps) D: 201, gap {99} E: 55 F: 3 ≈ 6 entries + 1 gap ≈ 90 bytes on the wire — 150× smaller
The dotted version vector is just a flat version vector with the universal observation that healthy replicas accumulate **contiguous** prefixes — only describe the discontinuities. A 14 KB context shrinks to ~90 bytes for a typical cluster.

Optimization 3 — tombstone GC bounded by the causal stable cut

OR-Set removes leave tombstones — the (element, tag) pairs whose tags were observed and cancelled. A document with 18 months of edits at MealRush had four tombstones per live element. The space cost is real, and the join cost is worse: every join walks all tombstones to check for new cancellations.

The collection rule: a tombstone is safe to drop when every replica is known to have observed both the original insertion and the cancellation. Define the causal stable cut as the per-replica minimum of "what every replica has acknowledged". A tag with version v is past the stable cut once v ≤ stable_cut[replica_of_v].

Implementing this requires an explicit acknowledgment protocol on top of anti-entropy. Yjs ships a deferred GC that runs in the background; Riak's CRDT layer requires application-level cooperation; Akka does it transparently. The cost is a few hundred bytes of metadata per replica per gossip round; the saving is unbounded — the document that had grown 8× from tombstones shrinks back to 1.1×.

# Tombstone GC under a causal stable cut (simplified)
from dataclasses import dataclass, field
from collections import defaultdict

@dataclass
class ORSet:
    # element -> set of (replica, version) tags
    live: dict[str, set[tuple[str, int]]] = field(default_factory=lambda: defaultdict(set))
    # tag -> True (this tag has been cancelled by a remove)
    tombstones: set[tuple[str, int]] = field(default_factory=set)
    # per-replica "I have acknowledged everything up to this version"
    ack: dict[str, int] = field(default_factory=lambda: defaultdict(int))

    def add(self, replica: str, version: int, element: str):
        self.live[element].add((replica, version))

    def remove(self, element: str):
        for tag in list(self.live.get(element, ())):
            self.tombstones.add(tag)
        self.live.pop(element, None)

    def stable_cut(self, peers: list[str]) -> dict[str, int]:
        # For each replica r, the minimum across all peers of "what they've ack'd from r".
        # This is the version up to which every peer is guaranteed to have applied r's ops.
        return {r: min(self.ack.get(p, 0) for p in peers) for r in self.live}

    def gc(self, peers: list[str]):
        cut = self.stable_cut(peers)
        before = len(self.tombstones)
        self.tombstones = {(r, v) for (r, v) in self.tombstones
                           if v > cut.get(r, 0)}
        return before - len(self.tombstones)

# Demo
s = ORSet()
s.add("A", 1, "alice"); s.add("A", 2, "bob"); s.add("B", 1, "carol")
s.remove("alice"); s.remove("bob")
print("tombstones before GC:", len(s.tombstones))
s.ack = {"A": 5, "B": 3}  # all peers have ack'd up to A:5, B:3
collected = s.gc(peers=["A", "B"])
print(f"tombstones GC'd: {collected}")
print(f"tombstones after GC: {len(s.tombstones)}")

Output:

tombstones before GC: 2
tombstones GC'd: 2
tombstones after GC: 0

Both tombstones are dropped because every peer's acknowledgment is past the cancellations' versions. Why this is safe: a tombstone exists to suppress a stale add that might still be in flight. If every replica has acknowledged seeing version v, no future add for tag (r, v) can arrive — it would be a duplicate of something already absorbed, and the join's idempotence handles duplicates without needing the tombstone. The tombstone is informationless once that condition holds. The hard part is computing the cut accurately when membership changes (a new replica joins; do we wait for it before GC'ing?). The pragmatic answer is "track stable membership separately and rebuild the cut when it changes" — Akka does this with its cluster-state subscription.

Optimization 4 — binary, run-length-encoded wire format

JSON CRDT updates are 4–6× larger on the wire than necessary. Yjs ships a custom binary format that uses:

  • Variable-length integers (LEB128) for clock values that are usually small.
  • Run-length encoding for sequences of inserts by the same replica with consecutive clocks — a contiguous typed run becomes (replica, start_clock, count, payload) instead of count separate (replica, clock, char) records.
  • String interning — repeated UUIDs and replica IDs stored once, referenced by short index.

Together they cut wire size 5–8× over a JSON encoding of the same logical update. CricStream's collaborative-incident-runbook tool measured 28 KB JSON updates compress to 4.2 KB binary — the same payload, the same logical operations.

Optimization 5 — vector clock pruning by membership churn

In long-running clusters with replica churn (a Kubernetes pod replaces its predecessor every few hours), version vectors grow with every replica that has ever participated. After six months a 30-replica cluster has 4,000 entries in its vector clocks, only 30 of which point to live members.

The fix: track membership lifecycle. When a replica is decommissioned (cleanly leaves the cluster), broadcast that fact. Once every live replica has acknowledged the decommission and reached the stable cut for the dead replica's clock, drop that entry from all vector clocks. New replicas get fresh IDs; old IDs are reusable after sufficient causal time has passed.

This is the exact reason production CRDT systems insist on persistent replica identities rather than pod-lifetime IDs. PaySetu's deployment originally used pod UUIDs; their CRDT state grew unboundedly until they switched to a stable per-shard identity that survives pod restarts.

Optimization 6 — lazy materialisation

The CRDT's stored state — the bag of (element, tag, tombstone) records — is rarely what the application reads. The application reads set.contains("alice") or text.toString(). Materialisation is the conversion from stored to application form.

Naive: re-materialise on every read. For a 50K-char Yjs document, that's a full traversal of the join-semilattice — milliseconds, not microseconds.

Optimised: keep a materialised cache, invalidate slices on join. Yjs incrementally updates its tree representation as deltas arrive — the materialised view is always current, and reads are O(1). Automerge takes a different approach: materialise on demand, but cache aggressively. Either works; what does not work is "re-walk the entire CRDT every time".

Production stories

MealRush, the 380 GB/day egress. The fix was a six-week migration from full-state OR-Set gossip (Riak Data Type 1.x) to delta-state OR-Set with causal-context compression (Akka Distributed Data 2.6). Egress dropped from 380 GB/day to 4.8 GB/day — 79× — for the same workload. The doc-size-on-disk dropped 4× over the next six months as tombstone GC caught up with historical removes. Total infra saving: ₹74 lakh/year on inter-region bandwidth alone.

KapitalKite's order-book CRDT. KapitalKite (a discount stockbroker) tried using a delta-CRDT for cross-region order-state replication during a 2024 NSE-outage drill. They hit a subtle bug: their causal context used wall clocks for tags, and clock skew between Mumbai and Singapore meant some "concurrent" inserts looked sequential to one replica and parallel to the other. Convergence was preserved — that's the CRDT guarantee — but tombstone GC fired too aggressively on one side. Three minutes of trades disappeared from the materialised view (not from disk; just from the user-visible projection). They rolled back to logical clocks per replica and the bug never returned.

CricStream's per-match chat. During an India vs. Pakistan T20 final, 25M concurrent viewers were typing in the chat. The chat used a Yjs-based delta-CRDT with the binary wire format and aggressive GC. Total egress for the 4-hour match: 1.2 TB across all viewer-edge syncs — about 50 KB per active user across 240 minutes. The full-state-CRDT version would have been 70+ TB. They could not have shipped it without the optimizations.

Common confusions

  • "Delta-state CRDTs are a different mathematical object from state-based CRDTs." No. They are the same lattice, the same join function. The deltas are just small lattice values. The optimization is in transport, not in semantics.
  • "Op-based CRDTs are always smaller than state-based, so deltas are obsolete." No. Op-based CRDTs require exactly-once delivery of every operation — a hard guarantee on lossy networks. Delta-state CRDTs tolerate lost or duplicated deltas because the join is idempotent. In practice, delta-state is the choice when the network is unreliable and op-based is the choice when you have a reliable broker (Kafka).
  • "Tombstone GC is mandatory." Not always. For short-lived CRDTs (a chat session, a single-day collaborative session), the document is discarded before tombstones grow problematic. GC is mandatory for long-lived CRDTs where the document must survive months or years.
  • "Causal context compression is just delta encoding the version vector." It's more — it's also tracking causally-stable prefixes per replica so that the runtime representation is (replica, contiguous-prefix-end, optional-gap-list) rather than a flat list of versions. The compression is most effective when replicas are healthy; in a partitioned cluster, gaps proliferate.
  • "Deltas remove the need for periodic full-state anti-entropy." No, they don't. If a replica misses a delta sequence (network blackout for 48 hours), it cannot catch up purely from new deltas — it needs a full-state pull. Production systems run delta gossip frequently and full-state anti-entropy rarely (every few hours, or on reconnect after long disconnect).
  • "You can take any CRDT and trivially make it a delta-CRDT." Almost. The Almeida-Shoker-Baquero paper proves it for the standard suite (G-Counter, PN-Counter, OR-Set, sequence CRDTs). For ad-hoc lattices designed in-house, you must verify that your delta-mutators do produce join-equivalent slices — easy to get wrong, easy to test.

Going deeper

The asymmetric anti-entropy protocol

A subtle issue: when A pushes a delta to B, A doesn't know whether B already has it. Sending blindly wastes bandwidth (B will idempotently absorb it, but the bytes flowed). The fix is a two-phase protocol: A first sends a summary (its version vector), B replies with the slice it lacks, A ships exactly that slice. Akka's Distributed Data and Riak both use a variant of this. The summary is small (the compressed causal context); the slice is exactly what's needed.

Reliable causal broadcast as a delta substrate

If the underlying message bus provides reliable causal broadcast (every message delivered, in causal order), op-based CRDTs become trivially correct and small. Kafka with per-key partition ordering gives you this for keyed CRDTs. The choice between delta-state and op-based becomes a question of what guarantees does the substrate give you — if it gives you reliable causal delivery, op-based wins on size; if it gives you only best-effort gossip, delta-state wins on robustness.

Why Automerge is structurally heavier than Yjs

Automerge keeps the full operation history (an op-log) so it can support time-travel and audit. Yjs keeps the materialised CRDT and sheds history aggressively via GC. The same logical document is 3–5× larger in Automerge's wire format. Both are correct; they make different choices about the value of replayability versus footprint. For collaborative editing where users never want to "rewind to last Tuesday", Yjs wins. For systems that need an audit log (legal docs, regulatory filings), Automerge wins.

Reproduce the egress measurement

# Measure naive vs delta gossip cost for a synthetic OR-Set workload.
from collections import defaultdict
import random, json

class FullStateORSet:
    def __init__(self): self.live = defaultdict(set); self.tomb = set()
    def add(self, r, v, e): self.live[e].add((r,v))
    def remove(self, e):
        for t in list(self.live.get(e, ())): self.tomb.add(t)
        self.live.pop(e, None)
    def gossip_payload(self):
        # everything, every time
        return json.dumps({"live": {k: list(v) for k,v in self.live.items()},
                           "tomb": list(self.tomb)})

class DeltaORSet(FullStateORSet):
    def __init__(self): super().__init__(); self.delta_buf = []
    def add(self, r, v, e):
        super().add(r,v,e); self.delta_buf.append(("add", r, v, e))
    def remove(self, e):
        super().remove(e); self.delta_buf.append(("rm", e))
    def gossip_payload(self):
        d = json.dumps(self.delta_buf); self.delta_buf = []; return d

random.seed(0)
N_OPS = 5000; full = FullStateORSet(); delta = DeltaORSet()
full_total = delta_total = 0
for i in range(N_OPS):
    if random.random() < 0.7:
        e = f"item{random.randint(0,200)}"; full.add("A", i, e); delta.add("A", i, e)
    else:
        e = f"item{random.randint(0,200)}"; full.remove(e); delta.remove(e)
    if i % 50 == 49:  # gossip every 50 ops
        full_total += len(full.gossip_payload())
        delta_total += len(delta.gossip_payload())
print(f"full-state egress over {N_OPS} ops, gossip every 50: {full_total/1024:.1f} KB")
print(f"delta-state egress: {delta_total/1024:.1f} KB")
print(f"ratio: {full_total/delta_total:.1f}×")

Typical output: full-state ≈ 4,200 KB, delta-state ≈ 220 KB, ratio ≈ 19×. The ratio grows the longer the experiment runs, because full-state cost is quadratic in operations while delta-state stays linear.

Where this leads next

The next chapter steps up to the JSON-CRDT level: how every node in a nested document becomes its own small CRDT, how the document-level merge composes node-level merges, and how the optimizations in this chapter compound across nesting depth.

References

  1. Almeida, P. S., Shoker, A., Baquero, C. (2018). Delta state replicated data types. Journal of Parallel and Distributed Computing — the foundational delta-CRDT paper.
  2. Shapiro, M., Preguiça, N., Baquero, C., Zawirski, M. (2011). Conflict-free replicated data types. SSS — the lattice-and-join framework deltas live in.
  3. Enes, V., Almeida, P. S., Baquero, C., Leitão, J. (2019). Efficient synchronization of state-based CRDTs. ICDE — practical delta-shipping algorithms.
  4. Preguiça, N., Marquès, J. M., Shapiro, M., Letia, M. (2009). A commutative replicated data type for cooperative editing. ICDCS — early treatment of causal-context compression.
  5. Yjs internals documentation — production binary wire format and incremental materialisation.
  6. Akka Distributed Data documentation — production delta-CRDT implementation with stable-cut tombstone GC.
  7. Kleppmann, M. (2020). Making CRDTs 98% more efficient. Blog post on Automerge wire-format optimisations.
  8. /wiki/g-set-2p-set-or-set — the OR-Set's tombstone behaviour that motivates the GC discussion.