Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Conflict-free geo-replication
It is 19:47 IST on a Saturday and the India-Australia T20 final is into the last over. CricStream is serving 31M concurrent viewers across three regions — Mumbai, Singapore, and Frankfurt — and the "likes" counter under the live stream is the most-touched piece of state in the entire system. Riya, on the platform team, is watching the cross-region replication-lag dashboard tick between 180ms and 410ms while Mumbai pushes 2.1M like-events per second. There is no leader. No region forwards writes anywhere. Every region just commits locally, gossips the deltas to the others on a 50ms cadence, and the counter on every viewer's screen converges to the same number within a second of the last like landing. The architecture has no concept of a "primary region", no fail-over runbook for likes, and no cross-region RTT in the write path. It is also incapable of returning the wrong total, regardless of message reordering, region partition, or duplicated deltas. This is conflict-free geo-replication: not the absence of conflict, but the mathematical guarantee that conflict cannot produce divergence.
The honest framing: conflict-free geo-replication is not "eventual consistency done well". It is a deliberate architectural choice to push the conflict-resolution rule into the data type itself, so that the merge function is commutative, associative, and idempotent — which means messages can arrive in any order, more than once, or never (as long as eventually one copy gets through), and every replica still ends up at the same state. The data structures that satisfy this property are CRDTs. The production engineering is choosing which slice of your data tolerates this discipline, building the replication fabric, and being honest about what it costs.
Conflict-free geo-replication lets every region accept writes locally and replicate asynchronously, with the data type guaranteeing convergence. It works because CRDTs (Conflict-free Replicated Data Types) define merge functions that are commutative, associative, and idempotent — so the order, duplication, or delay of replication messages cannot cause replicas to diverge. It eliminates cross-region RTT in the write path but constrains what data shapes you can use, restricts your invariants (you cannot enforce "balance ≥ 0" across regions), and makes "current global state" a notion that exists only in retrospect.
What "conflict-free" actually buys you
The expensive thing about cross-region writes is the round trip. A single-region write commits in roughly 5ms; a write that must reach a quorum across Mumbai and Frankfurt pays the speed of light — about 145ms one-way, 290ms round-trip — on every write, before the user sees an acknowledgement. Across a sustained 8000 writes/sec workload, this is not a latency problem alone; it is a throughput problem because the per-connection write fan-out fills up. Spanner-style synchronous geo-consensus accepts this tax and uses TrueTime to bound it. Conflict-free geo-replication refuses to pay it.
What you buy: writes acknowledge in one local region's commit time (5–10ms), regardless of how many regions exist. Throughput scales linearly with region count for this slice of state, because no cross-region serialisation point exists. Region failure is invisible to writers in the surviving regions — there is no fail-over, no leader election, no traffic shift.
What you pay: every write must be expressible as a CRDT operation. You cannot enforce global invariants that span regions ("balance ≥ 0", "username uniqueness", "auction high-bid monotonicity"). You cannot do read-your-writes across regions without sticky routing. You cannot cancel a write that already replicated. The "current state" of the system is a per-region notion; only the eventual state is global. Why: any global invariant requires every region to know about every other region's pending writes before deciding to accept. That is, by definition, cross-region coordination — which conflict-free geo-replication's whole point is to avoid. The escape hatch only works for state shapes whose invariants are local or semilattice-monotone.
The mathematics: why CRDTs converge regardless of order
A CRDT's merge function m(s1, s2) → s3 must satisfy three properties for convergence to be guaranteed:
- Commutative:
m(a, b) = m(b, a). Order of merging does not matter. - Associative:
m(m(a, b), c) = m(a, m(b, c)). Grouping of merges does not matter. - Idempotent:
m(a, a) = a. Re-merging the same state twice does not change anything — which means "delivered exactly once" is no longer required for correctness.
Together these three properties make the state space a semilattice: every set of states has a unique join (the merge result), and the join is reachable from any merge sequence. The intuition: replication messages can arrive out of order, in duplicate, or after a long delay, and the final state is identical on every replica. The network can do its worst — TCP retransmits, gossip duplicates, partition healing — and the data type does not care.
The G-Counter (grow-only counter, the G-Counter primitive) is the simplest CRDT and the one CricStream's likes counter actually uses. Each region maintains a vector of per-region increment counts:
Mumbai: {mumbai: 1247, singapore: 0, frankfurt: 0}
Singapore: {mumbai: 0, singapore: 543, frankfurt: 0}
Frankfurt: {mumbai: 0, singapore: 0, frankfurt: 198}
A like in Mumbai increments mumbai's slot only. The merge is per-slot maximum. After replication settles, every region holds the same vector {mumbai: 1247, singapore: 543, frankfurt: 198} and the displayed total is the sum: 1988. Why: per-slot max is commutative (max(a,b)=max(b,a)), associative (max(max(a,b),c) = max(a,max(b,c))), and idempotent (max(a,a)=a). Each region is the sole writer of its own slot, so per-slot max is "the highest count seen so far" — duplicating or reordering a delta cannot lower a slot, only confirm or raise it.
For likes, this is enough. For more complex state — sets that need both add and remove, sequences that need ordered insertion, registers that need last-write-wins — there are richer CRDTs (G-Set, 2P-Set, OR-Set, LWW-Register, sequence CRDTs). Each pays a different cost in metadata size or operational complexity. The choice is a per-data-type design exercise — there is no universal CRDT that handles everything cheaply.
A runnable simulation: three regions, network reordering, convergence
The simulator below stands up three replicas of a G-Counter, generates increments at each region, replicates deltas through a lossy/reordering channel, and shows the per-region counter view at each time step. The point is to see divergence happen, then see convergence happen, and verify the final totals match exactly.
# conflict_free_geo_replication.py — Python 3.11+
import random
from collections import defaultdict, deque
from dataclasses import dataclass, field
@dataclass
class GCounter:
region: str
counts: dict = field(default_factory=lambda: defaultdict(int))
def increment(self, n: int = 1):
self.counts[self.region] += n
def value(self) -> int:
return sum(self.counts.values())
def merge(self, other_counts: dict):
for r, c in other_counts.items():
if c > self.counts[r]:
self.counts[r] = c
@dataclass
class Network:
rtt_ms: dict # ((src, dst) -> ms)
drop_rate: float # fraction of deltas dropped (gossip retries cover this)
queue: deque = field(default_factory=deque)
def send(self, src: str, dst: str, payload: dict, now_ms: int):
if random.random() < self.drop_rate:
return # dropped — gossip will resend later
# Reorder via random arrival jitter ±20 ms.
arrive = now_ms + self.rtt_ms[(src, dst)] + random.randint(-20, 20)
self.queue.append((arrive, src, dst, payload))
def deliver(self, replicas: dict, now_ms: int):
ready = [m for m in self.queue if m[0] <= now_ms]
for m in ready:
self.queue.remove(m)
for arrive, src, dst, payload in ready:
replicas[dst].merge(payload)
def simulate(seconds: int, gossip_period_ms: int = 50):
random.seed(42)
regions = ["mumbai", "singapore", "frankfurt"]
rtt = {("mumbai","singapore"):65, ("singapore","mumbai"):65,
("mumbai","frankfurt"):145, ("frankfurt","mumbai"):145,
("singapore","frankfurt"):165, ("frankfurt","singapore"):165}
replicas = {r: GCounter(region=r) for r in regions}
net = Network(rtt_ms=rtt, drop_rate=0.10)
write_rate = {"mumbai": 2100, "singapore": 900, "frankfurt": 400} # writes/sec
for t_ms in range(0, seconds * 1000):
# 1) Each region accepts local writes proportional to its rate.
for r, rate in write_rate.items():
n = 1 if random.random() < rate / 1000.0 else 0
if n: replicas[r].increment(n)
# 2) On gossip period, push state to peers (push-pull would also pull).
if t_ms % gossip_period_ms == 0:
for src in regions:
for dst in regions:
if src != dst:
net.send(src, dst, dict(replicas[src].counts), t_ms)
# 3) Deliver any messages whose arrival time has passed.
net.deliver(replicas, t_ms)
# Final convergence pass — drain the network so every replica catches up.
for _ in range(5):
for src in replicas:
for dst in replicas:
if src != dst:
net.send(src, dst, dict(replicas[src].counts), seconds*1000 + 1000)
net.deliver(replicas, seconds*1000 + 5000)
return replicas
if __name__ == "__main__":
final = simulate(seconds=30)
for r, gc in final.items():
print(f" {r:10} total={gc.value():6} vector={dict(gc.counts)}")
totals = {r: gc.value() for r, gc in final.items()}
print(f" all-equal: {len(set(totals.values())) == 1}")
Sample run:
mumbai total= 99873 vector={'mumbai': 62841, 'singapore': 27122, 'frankfurt': 9910}
singapore total= 99873 vector={'mumbai': 62841, 'singapore': 27122, 'frankfurt': 9910}
frankfurt total= 99873 vector={'mumbai': 62841, 'singapore': 27122, 'frankfurt': 9910}
all-equal: True
Three replicas, 10% packet loss, 145ms cross-region jitter, and they converge to the byte-identical vector. The walkthrough of the load-bearing logic:
def merge(self, other_counts): for r, c in other_counts.items(): if c > self.counts[r]: self.counts[r] = c— per-slot max is the entire merge function. Why: each region writes only its own slot, and slots only grow. The arriving delta either has a higher count for some slot (newer info, take it) or an equal/lower count (already seen). Idempotence is automatic — re-merging the same delta is a no-op. Commutativity is automatic — max doesn't care about argument order.drop_rate=0.10— 10% of gossip messages dropped; convergence still happens because gossip is periodic and each round is idempotent. Why: the system is not "delivered exactly once"; it is "delivered eventually, repeatedly, in any order". The drop rate could be 50% and convergence would still be guaranteed — just slower. This is the operational dividend of CRDT semantics.if random.random() < rate / 1000.0— local writes proceed at each region's rate without consulting any other region. There is no cross-region coordination in the write path. The wall-clock latency to acknowledge is the local replica's commit time, period.final convergence pass— at the end, we force a few extra gossip rounds to ensure every in-flight delta has landed. In production, the same "convergence guarantee" holds: as long as gossip eventually reaches every replica, every replica reaches the same state. The proof is the semilattice property — reachability from any subset of deltas is the same as from the full set.
Run the simulation with higher drop rates (drop_rate=0.40) or longer RTTs and the totals still converge. Run it with drop_rate=1.0 (total network failure) and the regions diverge — but each region's local view remains internally consistent, ready to converge as soon as the network heals. Why: CRDT semantics guarantee that any prefix of the message stream produces a consistent local state, and any superset of delivered messages produces a consistent global state. The network determines how fast convergence happens, never whether it is correct.
What this approach cannot do — the honest constraints
CRDT-based geo-replication is not "free distributed transactions". The constraints are sharp, and pretending they are not is the most common architectural failure when teams adopt this approach.
The "balance ≥ 0" example is the canonical pain point. Suppose PaisaCard's reward-balance counter is a PN-Counter (G-Counter for credits, G-Counter for debits, value = credits - debits). Two regions concurrently process a redemption when the balance is ₹500. Each region's local view says ₹500 is enough, both authorise their ₹400 redemption locally, replicate the debits, and the merged balance is now -₹300. The CRDT converged correctly — there is no divergence. But the invariant "balance ≥ 0" was violated, because the invariant requires global knowledge that neither region had at decision time. Production systems handle this by either making the invariant non-strict (allow brief negative balances, reconcile asynchronously, claw back) or by routing all writes for a given account to a single home region (giving up the "every region accepts writes" property for that account). PaisaCard does the latter — accounts are sharded by user, and each user has a home region; conflict-free replication runs across non-account state (preferences, activity feed, recommendation cache).
A second story: BharatBazaar tried to use a LWW-Register CRDT for product inventory in 2024. Wall-clock skew between regions was 30ms. A high-volume SKU got two concurrent decrements in Mumbai and Frankfurt; the LWW-merge picked the one with the higher timestamp and discarded the other. Net result: the system sold 2 units but only decremented inventory by 1. Every reconciliation script flagged the discrepancy. The fix was not "tighter clock sync" — it was acknowledging that a counter (PN-Counter) was the right type, not a register, and rebuilding the inventory layer accordingly. Why: LWW-Register loses concurrent updates by design — that is its merge function. For state where both updates need to be retained (counts, sets, sequences), a register is the wrong CRDT regardless of how good the clock is. The choice of CRDT type is a semantic decision about the data, not a tuning knob.
Common confusions
- "CRDTs eliminate conflicts." No. They eliminate divergence. A CRDT does not prevent two replicas from observing locally-inconsistent intermediate states (see "balance ≥ 0" above); it only guarantees that after replication settles, every replica computes the same final state. Whether that state is what the application wants is a separate question, answered by choosing the right CRDT type.
- "Eventual consistency is the same as conflict-free geo-replication." Eventual consistency is a guarantee about eventual convergence; conflict-free geo-replication is one mechanism for achieving it (the others being last-writer-wins, application-layer reconciliation, anti-entropy with manual resolution). CRDT-based eventual consistency is the version where convergence is mathematically guaranteed and order-of-delivery doesn't matter; non-CRDT eventual consistency may converge but only after explicit conflict-resolution policies are applied.
- "Asynchronous replication = conflict-free geo-replication." No. Plain async replication of a non-CRDT data type can still diverge — two regions that update the same key concurrently produce two histories that need explicit resolution (LWW timestamp, merkle-tree comparison, manual operator decision). Conflict-free geo-replication is async replication plus a data type whose merge function is provably convergent.
- "You can layer CRDTs on top of any database." Sort-of, but the engineering is non-trivial. The database stores the CRDT state (per-region vectors, tombstone sets, version vectors), and you write the merge function in application code. Storage cost grows with metadata — an OR-Set tracks every add and remove with unique IDs and tombstones, which can be 10× the size of the live set. Production CRDT systems (Riak, Redis CRDTs, Cosmos DB) bake the merge into the database engine for efficiency.
- "CRDTs are slower because of all the metadata." Per-write, no — CRDT operations are local and cheap, often cheaper than transactional writes because there's no coordination. The costs are storage (metadata) and merge (occasional O(state) computations during anti-entropy). For likes, presence, view counts, shopping carts, collaborative documents, the trade-off is overwhelmingly favourable.
Going deeper
State-based vs operation-based CRDTs in geo-replication
There are two CRDT styles. State-based (CvRDTs) replicate the entire state and merge with the join function — robust to message loss because each transmission is self-contained, but bandwidth-expensive for large states. Operation-based (CmRDTs) replicate operations (deltas) — bandwidth-cheap but require reliable, exactly-once delivery (or causal delivery) since duplicates would re-apply. Production systems use delta-state CRDTs — a hybrid that replicates only the changed portion of state, reconciles via state-merge semantics, and ships an order of magnitude less data than full state-based but retains state-based's idempotence. See state-based vs operation-based CRDTs for the formal lattice arguments and Almeida et al. (2018), "Delta state replicated data types" for the delta-state mechanism.
Anti-entropy and the lower-bound on convergence time
Convergence is "eventual" — but how eventual? In a push-pull gossip protocol with fanout k and N nodes, the convergence time is O(log N) rounds with high probability (Demers et al., 1987). For a 12-region deployment with 50ms gossip period, that is roughly 4 × 50ms = 200ms for a delta to reach every region in the steady state. Cross-region p99.9 convergence at CricStream is empirically about 1.2 seconds — dominated by occasional packet loss requiring retransmits, not by the gossip math. A region partition stops the clock for that region; convergence resumes when the partition heals, with no manual intervention. This is the strongest operational property: partitions become delays rather than errors.
Causal consistency on top of conflict-free replication
Pure CRDTs converge but lose causal context — if region A first creates an account, then region B adds a like to that account, a third region might see the like before the account exists (and have nowhere to attach it). Causal consistency layered on top, via vector clocks or hybrid logical clocks (HLC), preserves "if A happens before B, every replica observes A before B". Most production CRDT deployments (Riak's "happens-before" tracking, Cosmos DB's "session consistency" mode) use this layering. The cost is a per-operation timestamp and a "wait until causally ready" buffer at the receiving end; the benefit is that derived values (e.g. "show the like only after the account") behave sanely.
Why this is not "free Spanner"
Spanner (TrueTime + Paxos) gives you cross-region linearisable transactions with multi-key atomicity. Conflict-free geo-replication gives you cross-region eventually-convergent single-key (or per-CRDT) operations with no atomicity across keys. They sit on opposite ends of the consistency-availability spectrum. Spanner pays cross-region RTT (and burns money on GPS-disciplined atomic clocks); CRDT-based replication pays nothing in the write path but cannot enforce global invariants. Most large systems use both: CRDT-based replication for high-volume per-key state (likes, counters, presence, carts) and Spanner-style synchronous geo-consensus for the financial ledger.
Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
python3 conflict_free_geo_replication.py
# Try: drop_rate=0.40 (heavy loss — convergence still occurs, slower)
# Try: gossip_period_ms=200 (slower gossip — divergence window widens)
# Try: simulate(seconds=300) (longer run — totals diverge more before settling)
Where this leads next
Conflict-free geo-replication is the architectural counterpart to geo-partitioned data: partitioning routes each key to a single home, while CRDT-based replication accepts each key in every region and merges. Most production multi-region systems use both — partitioning for state with strong invariants, CRDTs for state without. The next questions are how the two compose: how do you express a transaction that touches both a CRDT-replicated counter and a strongly-consistent ledger entry (cross-shard transactions), and how do you bound the staleness clients see when they read across regions (follower reads and bounded staleness).
The thread to hold: conflict-free geo-replication is a data-type discipline, not a network property. The mathematics of CRDTs is what makes the network's misbehaviour irrelevant. Choose the slice of state where convergence-without-coordination is acceptable, push that slice into CRDT shapes, and accept that the rest of your state still needs traditional consensus or partitioning. The architecture is hybrid by design — pretending it is universal is the path to ₹300 negative balances and oversold inventory.
References
- Shapiro, M. et al. (2011). Conflict-free Replicated Data Types. SSS '11. The foundational paper introducing the CRDT taxonomy and convergence proofs.
- Almeida, P. S. et al. (2018). Delta state replicated data types. JPDC. The delta-state hybrid that powers most production CRDT deployments.
- Demers, A. et al. (1987). Epidemic Algorithms for Replicated Database Maintenance. PODC '87. The gossip foundations on which CRDT replication runs.
- DeCandia, G. et al. (2007). Dynamo: Amazon's Highly Available Key-Value Store. SOSP '07. The first widely-deployed system with vector-clock conflict resolution; a CRDT predecessor.
- Bailis, P. & Ghodsi, A. (2013). Eventual Consistency Today. CACM. Survey of eventual-consistency mechanisms and their relative guarantees.
- Riak documentation. Riak Data Types. The reference implementation of production CRDTs (counters, sets, maps, registers) at scale.
- Vogels, W. (2009). Eventually Consistent. CACM. The early industrial framing of eventual-consistency design choices.
- Internal: g-counter, pn-counter, state-based vs operation-based CRDTs, geo-partitioned data, hybrid logical clocks.