Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Wall: eventual consistency needs conflict resolution

It is 02:14 on a Saturday at MealRush. A user named Riya opened the app on her phone (replica A, Mumbai region) and added an extra-cheese pizza to her cart. Her partner Karan, sharing the same account from their tablet (replica B, Chennai region), removed the pizza and added a paneer roll instead. The replicas are healthy. There is no partition. The async replication stream is humming along at 80 ms lag. Twelve seconds later both replicas have heard about both writes — and the cart now disagrees with itself. Replica A's cart says {pizza}. Replica B's cart says {paneer-roll}. "Eventual consistency" promised convergence. Convergence to what?

Eventual consistency guarantees that replicas, given enough quiet time, will agree — but it does not specify what they will agree on. Concurrent writes to the same key produce conflicts that the system must resolve, and the choice of resolver (last-writer-wins, multi-value, CRDT merge, application callback) is the part of "eventual consistency" that gets silently dropped from architecture diagrams. The wall is: silence about conflict resolution is a correctness bug, not an implementation detail.

What "eventual" leaves unsaid

Werner Vogels's 2009 CACM article defined eventual consistency precisely: if no new updates are made to an object, eventually all accesses will return the last updated value. Read that sentence twice. The word doing the work is last. Two replicas accepted writes concurrently. Which write is "last"? The wall-clocks differ by 9 ms. The vector clocks are incomparable. There is no last. The promise of convergence is silent on what value the replicas will converge to when concurrent writes happen to the same key.

This is not a hypothetical edge case. In any production AP system — Cassandra at consistency_level=ONE, DynamoDB at default reads, Riak, CouchDB — concurrent writes to the same key happen all day. A user with two devices writes from both. A retry storm replays an old write while a new write is in flight. A geographic failover lands writes in two regions before the failover completes. The system must accept all of these (otherwise it is no longer A in CAP) and then, sometime later, decide what the key's value is. The decision is the conflict-resolution policy. It is the part of the system that determines what your data looks like, and it is the part the documentation buries on page 47.

Why this is a wall and not a footnote: the choice of conflict-resolution policy changes the observable correctness of the application. A shopping cart that uses last-writer-wins quietly drops items every time two devices add at once — and the user discovers this at checkout when the bill is wrong. A shopping cart that uses set-union merge keeps every item ever added — including ones the user explicitly removed. A shopping cart that uses an OR-Set CRDT preserves both adds and removes correctly. Same store, three different applications, three different bug profiles. The "eventual consistency" label tells you none of this.

Two replicas accepting concurrent writes and converging differently under different resolversA timeline diagram with two horizontal tracks for replica A in Mumbai and replica B in Chennai. At t=0 replica A accepts a write add(pizza), at t=0.1s replica B accepts a write set(paneer-roll). Async replication exchanges the writes by t=12s. Three outcome boxes on the right show what the converged state looks like under three resolution policies: LWW timestamp (paneer-roll wins because B is later), set-union (both pizza and paneer-roll), OR-Set CRDT (just paneer-roll because the set was replaced not added). The diagram is illustrative. Same writes, three resolvers, three different "eventual" answers Replica A — Mumbai t=0.0s add(pizza) Replica B — Chennai t=0.1s set({paneer-roll}) async replication, ~80ms lag By t=12s, both replicas have both writes. The state they converge to depends on the resolver: LWW (timestamp) B's write has later ts → {paneer-roll} pizza is silently lost Set-union merge both → {pizza, paneer-roll} paneer-roll's intent lost OR-Set (CRDT) add+remove tracked w/ tags → {paneer-roll} replace semantic preserved Illustrative — not measured data. The MealRush cart conflict is the hypothetical motivating story.
Three resolvers, three different "correct" answers. The system documentation that calls itself "eventually consistent" is silent about which one your application got — that decision lives in the client library, the schema, or the application callback, and it determines whether the user's cart is right at checkout.

The four families of conflict resolvers

Every production AP system implements at least one of these. Most implement two or three and let you pick per-bucket.

1. Last-writer-wins by timestamp (LWW). Each write carries a wall-clock timestamp. On conflict, the higher timestamp wins. Cassandra is the canonical example — every column has an associated timestamp and reconciliation always picks the larger one. Simple, fast, and silently destructive: a write whose clock was 50 ms behind the other replica's clock loses, and the user who issued it is never told. LWW is correct for idempotent overwrite workloads (a status flag, a cached counter that gets recomputed) and incorrect for almost everything else. It is the default in Cassandra, ScyllaDB, and many in-memory caches because it requires no application cooperation.

2. Multi-value reconciliation (MVR), aka sibling-aware reads. The system stores all concurrent writes and returns them as a set on read. The client sees [{pizza}, {paneer-roll}] and must pick or merge. Dynamo's 2007 paper introduced this — they called the un-reconciled values "siblings". Riak inherits the design. This pushes the resolution decision to the application, where domain knowledge can act on it (e.g. shopping carts can take the union; bank balances cannot). The cost is that every read code path must handle the multi-value case, which production teams forget approximately every time.

3. Application-defined merge callback. The system invokes a function the developer registers, passing both conflicting values. The callback returns the merged value. CouchDB's update_doc and Riak's per-bucket merge function operate this way. Powerful, because the merge can encode any business logic; risky, because the function must be commutative, associative, and idempotent to guarantee convergence — and most application developers writing merge callbacks have never heard those words.

4. Conflict-free replicated data types (CRDTs). The data structure itself is designed so that all valid operations commute. A G-Counter merges by per-replica max. An OR-Set tracks (element, unique-tag) pairs and a tombstone set; merge is union of both. An LWW-Register uses (timestamp, value) pairs and merge picks the higher timestamp. By construction, replicas converge regardless of message order. The cost is a more complex on-disk representation (a 32-byte set element becomes a 200-byte tombstone-bearing structure) and a more complex API. Riak's data types and Redis Enterprise's CRDT subsystem expose CRDTs as the primary value type. See the chapters on G-Counter, PN-Counter, OR-Set, LWW-Register for the lattice formalism.

Why no single resolver dominates: each one is correct for a workload and wrong for another. LWW is correct for "what is the current sensor temperature" (later reading wins). MVR is correct for "what is in this user's shopping cart" (preserve all intentions, let the user resolve). Application-defined callbacks are correct for domain-specific logic (max of two stock-quantity decrements). CRDTs are correct when the operations have an algebraic structure that admits a join-semilattice. A general-purpose resolver does not exist because the right answer depends on what the data means — and the storage system does not know that.

A runnable demonstration: three resolvers, one workload

The simulator below runs the MealRush cart conflict against all three of LWW, set-union, and an OR-Set CRDT, and prints what each replica's converged state looks like. The output makes the silent data loss in LWW visible — you can count how many "intended adds" were thrown away.

# eventual_consistency_resolvers.py — three resolvers, one workload
import time, random, statistics, uuid
from dataclasses import dataclass, field
from typing import Set, Tuple

@dataclass
class LWWRegister:
    value: Set[str] = field(default_factory=set)
    ts: float = 0.0
    def write(self, value, ts): 
        if ts > self.ts: self.value, self.ts = value, ts
    def merge(self, other):
        if other.ts > self.ts: self.value, self.ts = other.value, other.ts

@dataclass
class UnionRegister:
    value: Set[str] = field(default_factory=set)
    def add(self, x): self.value.add(x)
    def remove(self, x): self.value.discard(x)
    def merge(self, other): self.value |= other.value  # union: removes are lost

@dataclass
class ORSet:
    elements: Set[Tuple[str,str]] = field(default_factory=set)  # (value, unique-tag)
    tombstones: Set[Tuple[str,str]] = field(default_factory=set)
    def add(self, x): self.elements.add((x, str(uuid.uuid4())))
    def remove(self, x):
        for el in list(self.elements):
            if el[0] == x: self.tombstones.add(el)
    def value(self):
        return {v for (v,t) in self.elements if (v,t) not in self.tombstones}
    def merge(self, other):
        self.elements |= other.elements
        self.tombstones |= other.tombstones

def run_conflict(resolver_a, resolver_b, op_a, op_b, kind):
    op_a(resolver_a); op_b(resolver_b)
    resolver_a.merge(resolver_b); resolver_b.merge(resolver_a)
    if kind == "lww":     return resolver_a.value
    if kind == "union":   return resolver_a.value
    if kind == "orset":   return resolver_a.value()

if __name__ == "__main__":
    # Riya adds pizza on replica A at t=0.0; Karan replaces with paneer-roll on B at t=0.1
    a, b = LWWRegister(), LWWRegister()
    print("LWW:   ", run_conflict(a, b, lambda r: r.write({"pizza"}, 1000.0),
                                    lambda r: r.write({"paneer-roll"}, 1000.1), "lww"))
    a, b = UnionRegister(), UnionRegister()
    print("Union: ", run_conflict(a, b, lambda r: r.add("pizza"),
                                    lambda r: r.add("paneer-roll"), "union"))
    a, b = ORSet(), ORSet()
    a.add("pizza")  # initial state shared via earlier sync
    b.elements |= a.elements
    print("OR-Set:", run_conflict(a, b, lambda r: None,
                                    lambda r: (r.remove("pizza"), r.add("paneer-roll")), "orset"))

Sample run:

LWW:    {'paneer-roll'}
Union:  {'pizza', 'paneer-roll'}
OR-Set: {'paneer-roll'}

Walk through the three lines:

  • LWW: B's timestamp 1000.1 is higher than A's 1000.0, so B's value {paneer-roll} overwrites A's {pizza} on merge. Riya's pizza intention is silently dropped. No callback fires, no audit log records the loss; the cart at checkout is just {paneer-roll}.
  • Union: both adds survive, merging to {pizza, paneer-roll}. Useful if both writes should accumulate (e.g. a contact list); broken for a cart where Karan explicitly replaced the items.
  • OR-Set: A added pizza first (with a unique tag), B then removed pizza (tombstoning that tag) and added paneer-roll (with a different tag). Merge unions both element-sets and both tombstone-sets, so the visible set is {paneer-roll}. The replace semantic is preserved across the conflict.

Why the OR-Set output is "right" but not free: the OR-Set carries every (element, tag) pair and every tombstone forever — the on-disk size grows with the number of write operations, not the number of distinct elements. Production OR-Set implementations periodically run garbage collection that removes tombstones whose causal antecedent is known to have been received by every replica (a non-trivial coordination problem). The CRDT is convergent by construction; the engineering challenge is keeping its size bounded.

What real production systems chose, and why

Cassandra — LWW everywhere, by default. Every column value is (value, timestamp, ttl). On read, the merge picks the column with the highest timestamp; on tie, the larger value (lexicographic). The reason is performance: LWW reconciliation is O(1) per column, no callback, no client-side merge. The reason it works at all is that Cassandra's primary use cases (time-series telemetry, mutable status fields, append-only event logs with primary-key disambiguation) tolerate LWW. The reason it bites teams is that "this is just a normal database, right?" leads people to use it for shopping carts and bank balances. There is even a phrase for the resulting bug class — "phantom writes" — where a write appears to succeed but is silently overwritten by an older write whose clock skew was larger than the gap between the two writes.

DynamoDB — LWW by default, conditional writes for application-defined resolution. Single-region DynamoDB is LWW. Global Tables (multi-region) was originally LWW with explicit per-region last-writer-wins semantics. Conditional writes (ConditionExpression) let the application express "only write if the current value is X", giving compare-and-swap semantics that effectively forbid concurrent overwrites. Most production teams that need cross-region correctness use conditional writes or a separate consensus-backed coordination primitive (DynamoDB transactions, which are PC/EC) for the writes that matter.

Riak — MVR by default, with optional CRDTs for opt-in correctness. Riak's core API returns siblings (multiple concurrent values) when conflicts exist. Read code is required to handle the multi-value case. Riak 2.0 added native CRDT data types (counters, sets, maps, registers) that converge automatically without exposing siblings. The Riak team's blog posts from 2013-2015 are the best public record of teams adopting CRDTs in production and discovering the size-growth problem the hard way.

CouchDB — multi-version with application-callback merge. CouchDB stores every concurrent write as a separate document revision (a tree of revisions, not a chain). On read, the application sees the conflict and must call a _bulk_docs API to write the resolved version back. Pouch DB (CouchDB's browser-side counterpart) leans on this for offline-first applications — the user's device accumulates writes while disconnected, syncs on reconnect, and the application code resolves any conflicts that emerged during the offline window.

Conflict-resolution policy trade-off matrixA 4-row table comparing LWW, MVR, application-callback, and CRDT across four columns: per-write cost, read-path complexity, silent data loss risk, and on-disk size growth. LWW: O(1), trivial reads, high silent-loss risk, no growth. MVR: O(1), siblings exposed, no silent loss, modest growth. Application callback: O(callback), depends on callback, depends on callback, modest. CRDT: O(merge-fn), trivial reads, no silent loss, grows with op count until GC. The diagram is illustrative. Conflict-resolution policies — what each one costs Resolver Per-write cost Read complexity Silent loss? Disk growth LWW O(1) compare ts trivial — single value YES none MVR (siblings) O(1) append vector clk app must handle list no modest App callback O(callback) depends on callback depends on callback modest CRDT O(merge-fn) trivial — value() materialise no grows w/ ops, GC When the workload is overwrite-only: LWW is fine. Sensor readings, last-known-status fields, cached materializations. Illustrative — not measured data. Production teams typically pick per-bucket, not per-database.
The cost is not a single dimension — LWW pays in silent data loss, MVR pays in read-path complexity, callbacks pay in correctness-of-the-callback, CRDTs pay in storage growth. There is no policy that dominates on all four axes; pick the one whose cost is acceptable for the workload.

When eventual consistency is silently the wrong choice

PaySetu's wallet team learned this the hard way during a 2024 IPL final. The wallet-balance store was a Cassandra cluster at consistency_level=ONE for both reads and writes, with LWW reconciliation. A user named Aditi loaded ₹500 onto her wallet from her phone (replica A) at the same moment her brother Jishant, sharing the account on the laptop, redeemed a ₹400 cashback offer (replica B). Both writes succeeded. The replicas exchanged state 80 ms later. Replica B's "redeem ₹400" had a timestamp 12 ms higher than replica A's "load ₹500", so LWW kept the redeem and discarded the load. Aditi's ₹500 was gone from the wallet view but the load had charged her bank account. Customer support saw 47 such tickets in two hours.

The fix was not "tune Cassandra". The fix was to recognise that wallet operations are not overwrites — they are deltas — and to switch the wallet's storage to a different model entirely. PaySetu moved the wallet to a single-region PostgreSQL with serializable isolation (PC/EC, accepting the latency) and kept the Cassandra cluster only for non-financial state (last-login-time, push-notification preferences) where LWW is correct. The post-mortem labelled the original design "LWW for a delta workload — silent corruption guaranteed".

Why this took customer-support tickets to discover: the silent-data-loss failure mode of LWW does not produce errors. The writes both succeed. The reads return a value. The dashboards are green. The only signal is the gap between "what users think they did" and "what the database thinks happened" — and that gap is invisible to the system. The lesson is that "the system is healthy" and "the data is correct" are independent claims under eventual consistency, and the second one needs application-level monitoring (audit-log reconciliation, event-sourcing replay, or a write-through ledger) to verify.

Common confusions

  • "Eventual consistency means the replicas eventually agree." Eventual consistency means the replicas eventually agree on a value. It does not specify which value, and concurrent writes turn the choice into a policy decision. Without a conflict-resolution policy, "eventual consistency" is incomplete.
  • "LWW is the same as the last write the user made." LWW is the last write by wall-clock timestamp, which is not the last write the user made if clocks skew or if the user's two devices issued writes simultaneously. LWW with a 50 ms clock skew can drop a write that arrived at the system 30 ms after the winning write.
  • "Vector clocks solve the conflict problem." Vector clocks detect concurrent writes — they tell you whether two writes are causally ordered or concurrent. They do not resolve concurrent writes. After detection, you still need one of the four resolver policies. See vector clocks.
  • "Strong consistency just costs latency." Strong consistency also costs availability under partition (by CAP). The PACELC framework (see PACELC) names both. Picking eventual consistency to "save latency" is a partial picture; you also save availability under partition, but pay in conflict-resolution complexity.
  • "CRDTs make conflict resolution free." CRDTs make conflict convergence automatic. They do not make conflict resolution free — the cost moves from runtime callbacks into the data structure design and into the on-disk size, which grows with the operation count and requires garbage collection coordinated across replicas.
  • "My system is eventually consistent; the application doesn't need to do anything." The application either explicitly chose a resolver (by configuring the database) or implicitly accepted whatever the database default was. There is no third option. If your team has not had the conversation about which resolver this workload uses, your team is using LWW by default and discovering the silent data loss in production.

Going deeper

The CALM theorem — when a workload can be safely eventually consistent

The CALM theorem (Consistency As Logical Monotonicity, Hellerstein and Ameloot 2010, formalised by Bailis et al. 2014) states: a distributed program produces a deterministic outcome under eventual consistency if and only if it can be expressed using only monotonic operations (operations whose output set only grows as more inputs arrive — set union, max, addition over non-negative reals). Non-monotonic operations (set difference, division, average over finite windows) require coordination to produce deterministic outcomes. CALM gives a precise way to ask "can this code path be eventually consistent without conflict?" The answer for a shopping cart's add-item operation is yes (set union is monotonic). The answer for a wallet balance's debit operation is no (debit is non-monotonic — the balance can go down). The CALM theorem tells you which workloads are inherently coordination-free and which workloads need a stronger consistency level to be correct, regardless of how clever your resolver is.

The Bayou paper and application-defined merge — the 1995 origin story

Conflict resolution as a first-class operating-system problem was the contribution of the Bayou project (Terry et al. 1995). Bayou was a mobile-first replicated database where laptops would disconnect, accept writes offline, then reconnect and reconcile. The Bayou team observed that pure LWW or set-union was insufficient for the calendar-and-meeting-room workload they targeted, and introduced dependency checks (preconditions a write must verify on each replica) and merge procedures (application-supplied callbacks that ran when dependency checks failed). Bayou's per-operation merge procedure is the direct ancestor of CouchDB's update functions and Riak's per-bucket merge callbacks. The 1995 paper is worth reading for the framing alone — it predicts the smartphone-era reconciliation problem 12 years before the iPhone.

Tombstone explosion — the practical CRDT cost

Production OR-Set deployments hit a wall around 10^7 cumulative add+remove operations: the tombstone set grows monotonically, the on-disk representation balloons, and merge times degrade from microseconds to seconds. The standard mitigation is causal stability — a tombstone whose causal antecedent has been received by every replica can be safely removed. Detecting causal stability requires every replica to know what every other replica has seen, which is itself a coordination problem (tracked via vector clocks or version vectors over the membership set). Riak's "delta-CRDT" work (Almeida et al. 2014) reduces the per-message size of CRDT updates but does not eliminate the tombstone cost. The practical advice is: bound the lifetime of OR-Set values, schedule periodic compactions, and treat CRDT storage growth as a first-class capacity-planning input rather than an afterthought.

Why concurrent writes happen even on a healthy network

It is tempting to think concurrent writes only happen during a partition. They do not. On a healthy network, two concurrent writes occur whenever:

  • A user has multiple devices issuing writes that arrive at different replicas before async replication catches up (the MealRush cart story).
  • A retry storm replays an old write while a newer write is in flight (idempotency keys mitigate but do not eliminate this).
  • A geographically distributed application receives writes in different regions before the replication topology has converged.
  • An offline-first mobile client accumulates writes for hours and then syncs them all at once into a system that has been accepting other writes meanwhile.

The conflict-resolution policy is exercised on a continuous basis under healthy operation, not just during partition events. This is why the wall is shaped the way it is: the silent-data-loss failure mode of LWW happens to PaySetu and CricStream every day, not just during a network event.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
python3 eventual_consistency_resolvers.py
# Modify the timestamps in the LWW call to see how a 1ms clock skew flips the winner.
# Modify the OR-Set to add 10000 elements with random adds/removes — watch element-set + tombstone size grow linearly.

Where this leads next

The conflict-resolution wall is the entry point to two of the curriculum's most technical chapters:

Part 13's CRDT chapters take the lattice formalism seriously and derive the merge functions from algebraic axioms. Part 17's geo-distribution chapters revisit the conflict-resolution wall in the cross-region setting, where async replication is a baseline and concurrent writes are the norm.

References

  • Vogels, W. — "Eventually Consistent" (CACM 2009). The original definition; pay attention to what the definition does not specify.
  • DeCandia, G. et al. — "Dynamo: Amazon's Highly Available Key-Value Store" (SOSP 2007). Introduces siblings and application-driven reconciliation.
  • Terry, D. et al. — "Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System" (SOSP 1995). The 1995 paper that named the problem.
  • Shapiro, M. et al. — "Conflict-free Replicated Data Types" (SSS 2011). The foundational CRDT paper.
  • Bailis, P. et al. — "Coordination Avoidance in Database Systems" (VLDB 2014). The CALM-theorem formalisation — when can a workload be safely eventually consistent?
  • Hellerstein, J. & Alvaro, P. — "Keeping CALM: When Distributed Consistency Is Easy" (CACM 2020). The CALM theorem in plain language.
  • Bailis, P. & Ghodsi, A. — "Eventual Consistency Today: Limitations, Extensions, and Beyond" (CACM 2013). Empirical look at production EC systems.
  • Kleppmann, M. — Designing Data-Intensive Applications, Chapter 5. Clear treatment of replication conflicts.
  • PACELC — the framework that names the trade-off whose EL arm forces the conflict-resolution question.
  • Eventual consistency — the consistency model whose silence on the resolver question is the wall this chapter names.