In short

In a leaderless system, two clients can write to the same key on different replicas at nearly the same instant. Neither has seen the other; neither is "newer" in any causal sense. When the replicas later reconcile — via read repair, anti-entropy, or a client-side read — the system finds two competing versions. Something must decide.

Two main strategies dominate. The first is last-write-wins (LWW). Every version carries a timestamp — wall clock, Lamport counter, or hybrid logical clock — and the version with the later timestamp wins; the loser is silently discarded. LWW is one line of code, which is why Cassandra, ScyllaDB, and DynamoDB-the-service use it. It also silently loses data whenever two clients legitimately update the same key concurrently: an increment vanishes, a cart-add is forgotten, a settings change evaporates.

The second is vector clocks, introduced independently by Fidge and Mattern in 1988. A vector clock is a dictionary {node_id: counter} attached to every version. Each node increments its own counter on every write and preserves others' verbatim. Comparing two clocks tells you whether one happened-before the other (safe to discard the older) or whether they are truly concurrent (neither is older; both must be preserved). Riak used pure vector clocks; Riak 2.0 moved to dotted version vectors, a compact variant that handles client-caused growth.

Vector clocks preserve data that LWW loses. The cost is reader complexity — applications must merge multiple versions semantically — plus metadata overhead. For shopping carts, collaborative documents, multi-device user state, and anything with genuinely concurrent updates, the cost is worth paying.

You are running a shopping cart service on top of a leaderless Dynamo-style database. Two replicas, X and Y, hold the same key cart:priya. A network partition isolates them for thirty seconds.

Alice opens the Amazon app on her phone. Her request routes to replica X. She adds milk to her cart. X accepts the write; milk is in the cart as far as X is concerned.

Her husband, on the other side of the partition, routes to replica Y. He adds bread. Y accepts the write; bread is in the cart as far as Y is concerned.

The partition heals. Replicas X and Y exchange data, as they must, because eventual consistency demands convergence. They discover they disagree about cart:priya. X says [milk]. Y says [bread]. Which is "correct"?

Both, obviously. The family wants both items. But the database does not know that. The database sees two conflicting values for the same key and must decide what to return on the next read. This chapter is about how that decision gets made.

The setup — when conflicts happen

Conflicts are a structural consequence of two design choices working together: eventual consistency and leaderless replication. Remove either and conflicts mostly vanish. A single leader serialises concurrent writes into a linear order; consensus (Raft, Paxos) forces every write to appear in an agreed-upon order before acknowledging; in either case, no ambiguity.

Dynamo-style systems threw both out. Any node accepts any write. Writes acknowledge as soon as W of N replicas have them. Two writes on two coordinators that arrive before gossip catches up — a window of tens to hundreds of milliseconds under healthy conditions, seconds to minutes under partition — never see each other. Both succeed. Both are durable. Neither is "first".

The conflict is discovered later, by one of three mechanisms covered earlier in this Build:

All three funnel into the same question: given two or more versions of a value for the same key, with some metadata, which do you keep? This chapter is about the answer.

Last-write-wins (LWW)

The simplest possible answer: tag every write with a timestamp; on conflict, keep the one with the larger timestamp.

def lww_resolve(versions: list[tuple[int, bytes]]) -> bytes:
    # versions is [(timestamp, value), ...]
    return max(versions, key=lambda v: v[0])[1]

Three lines. One merge function for every key in the database. Deterministic — every replica, seeing the same set of versions, picks the same winner. Idempotent — reconciling twice gives the same answer. Associative — (a merged with b) merged with c equals a merged with (b merged with c). These properties matter because reconciliation can happen in any order across replicas and must converge.

Cassandra uses LWW at the column level. Every column value has a timestamp (microseconds since epoch, 64-bit), and on read the coordinator picks the latest-timestamp version across all replicas' responses. Writes are idempotent at any timestamp; replays of old writes simply get out-voted by newer ones. ScyllaDB inherits the model. DynamoDB-the-service uses LWW between regions in its global-tables feature.

Why LWW is so popular despite its problems: for the common case — a single client issuing a stream of updates to a key, no concurrency, clocks roughly synchronised — LWW works perfectly. The "loser" in LWW's sense is almost always the stale overwrite of a new update, not the newer concurrent write. Database designers optimise for the common case and accept the pathological corner as a documented limitation.

The timestamp source matters. Three common choices:

All three share LWW's fundamental limitation: they impose a total order on events that are not actually totally ordered. If you and I simultaneously modify the same key without seeing each other's update, our writes are genuinely concurrent — but the tie-breaker of whichever timestamp is larger arbitrarily picks one and discards the other.

When LWW is a bad idea

Clock skew. Your cluster runs NTP but NTP drift can reach tens of milliseconds under load. Worse, a misconfigured node might be seconds or minutes off. LWW says: whichever node has the faster-running clock "wins" all conflicts, even if its writes are actually older in wall-time. A single bad VM with a skewed clock quietly shadows other nodes' updates.

Concurrent counter increments. You and I both read a counter at 100, both write back 101 (independently, within milliseconds). LWW picks one. The counter goes to 101 instead of 102. Every concurrent increment workload — ad click counts, page view counts, likes — suffers this. The only fix with LWW is to route every increment through a single node (defeating the point of leaderless) or to use a different data structure (CRDT counters).

Shopping cart adds. Exactly the scenario at the top of this chapter. Alice adds milk; Bob adds bread; both intend their item to be in the cart. LWW keeps whichever has the later timestamp and silently loses the other. The customer notices only when they check out and discover the missing item.

Collaborative edits. Two people editing the same document field, say a shared title. Both type, both save. LWW picks one and discards the other's keystrokes. Real collaborative editors (Google Docs, Notion) do not use LWW; they use operational transforms or CRDTs precisely because LWW destroys co-editing.

Multi-device user state. You tweak a setting on your phone; your laptop (which has a cached older version) syncs a slightly-stale version back a few seconds later. Without causal tracking, LWW on the laptop's timestamp can overwrite your phone's newer change.

The common thread: any workload where two independent writers can legitimately target the same key concurrently is a workload where LWW silently corrupts data. The user-visible symptom is not an error; the user-visible symptom is that their write disappears. This is the single worst class of bug a database can have, and LWW produces it by design.

When LWW is fine

LWW is not universally bad; it is a specific trade-off with specific applicability.

Single-writer-per-key workloads. If for any given key only one process ever writes it, concurrent writes to that key never happen, and LWW behaves exactly like leader-based replication would. Telemetry that is sharded per-device to per-device keys is a good example: device X writes telemetry:X, device Y writes telemetry:Y, no contention. Cassandra's column-per-row model is designed around this — different columns are effectively different keys, so two writers updating different columns of the same row do not conflict.

User settings where newer truly is newer. A user configuring their preferences generally does so from one device at a time; the moments of real concurrency are vanishingly rare, and the cost of occasionally losing a milliseconds-apart preference flip is tiny.

Cache layers. Memcached and Redis use LWW-like semantics (actually, last-write-wins-with-expiration). If a cache entry is lost, the worst case is a re-fetch from the source of truth. Cache layers accept occasional inconsistency by design.

Immutable append-only logs. If every write creates a new key (say events:uuid), no key is ever written twice, and the conflict question never arises. Event-sourced architectures lean on this.

The question to ask your workload is: are there keys that multiple independent writers can update concurrently, and is silently losing one of them acceptable? If "yes" and "acceptable", use LWW. If "yes" and "unacceptable", you need vector clocks or CRDTs.

Vector clocks — the formal alternative

Vector clocks, independently proposed by Fidge (1988) and Mattern (1989), answer a precise question: given two versions of a value, did one causally precede the other, or are they concurrent? LWW ignores this question and imposes a total order. Vector clocks answer it honestly.

The data structure is a dictionary mapping node identifiers to counters: {node_id: counter}. Every version stored in the database carries one. The rules:

Comparing two vector clocks a and b:

The last case is the whole point. LWW forces a tie-breaker; vector clocks accept that there is no legitimate tie-breaker and defer to the application.

Why this is correct and LWW is not: causality is a partial order, not a total order. Two events that never observed each other are genuinely incomparable. A system that pretends otherwise — by comparing timestamps or any other total-order tag — is telling you a lie about which event happened first. Vector clocks tell the truth: "these are concurrent; you, the application, know what they mean better than I do."

Python implementation

The three core operations in twenty lines of plain Python:

# vector_clock.py
VClock = dict[str, int]

def dominates(a: VClock, b: VClock) -> bool:
    """True if a happened-after b (and they are not equal)."""
    all_keys = set(a) | set(b)
    every_ge = all(a.get(k, 0) >= b.get(k, 0) for k in all_keys)
    some_gt  = any(a.get(k, 0) >  b.get(k, 0) for k in all_keys)
    return every_ge and some_gt

def concurrent(a: VClock, b: VClock) -> bool:
    """True if neither happened-before the other."""
    return not dominates(a, b) and not dominates(b, a) and a != b

def merge(a: VClock, b: VClock) -> VClock:
    """Coordinate-wise max — the clock for a version that observed both."""
    return {k: max(a.get(k, 0), b.get(k, 0)) for k in set(a) | set(b)}

def bump(vc: VClock, node_id: str) -> VClock:
    """Increment this node's entry — called on every local write."""
    new = dict(vc)
    new[node_id] = new.get(node_id, 0) + 1
    return new

Under forty lines. The entire semantics of causality tracking fits in four small functions. dominates answers "can we safely discard the older version?" concurrent answers "must we preserve both?" merge answers "what is the clock for a value that supersedes two siblings?" bump is how nodes signal they have accepted a write.

Integrating with a replica: every stored value is (value, vclock). On read, return all versions whose clocks are not dominated by any other in the response set. On write, the client provides the vclock it observed; the server bumps its own entry and stores (new_value, bumped_vc).

Worked example — two concurrent writes

Walk through the shopping-cart scenario step by step, with concrete vector clocks.

Initial state, empty cart on both replicas X and Y: vc = {}, value = [].

Step 1: client Alice writes ["milk"] to replica X. X bumps its entry:

Step 2: network partition isolates Y from X. Client Bob writes ["bread"] to replica Y. Y has never seen Alice's write; its stored vc is still {}. Y bumps:

Step 3: partition heals. Anti-entropy or read repair exchanges data. A reconciling node compares vc_A = {X: 1} and vc_B = {Y: 1}.

Step 4: Alice reads her cart. The coordinator returns two versions: (["milk"], {X:1}) and (["bread"], {Y:1}). The cart application (not the database) knows that cart items should be merged by set union. It produces:

Alice writes this back through some replica, say X. X bumps its own entry:

Step 5: future reads see one version. {X: 2, Y: 1} dominates both {X: 1} and {Y: 1} (greater-or-equal on every entry, strictly greater on at least one). The siblings can be garbage-collected; the merged value carries forward.

Two concurrent cart writes with vector clocksA timeline showing replica X writing milk with vc {X:1}, replica Y writing bread with vc {Y:1} during a partition, and reconciliation producing two concurrent siblings that the application merges into [milk, bread] with vc {X:2, Y:1}.XYreplicareplicaAlice: +milkvc={X:1}partitionBob: +breadvc={Y:1}healexchangeread → 2 siblingsmerge + writevc={X:2,Y:1}[milk, bread]
Two replicas accept concurrent writes during a partition. Each bumps only its own entry: X writes milk with {X:1}, Y writes bread with {Y:1}. After the partition heals, neither clock dominates the other — the writes are concurrent — so both siblings surface to the application, which merges them semantically (set union for a cart) and writes back with the coordinate-wise-max clock {X:2, Y:1} (the extra X bump coming from the merge-writing coordinator).

Note what did not happen. LWW would have compared the wall-clock timestamps of the two writes and picked whichever was microseconds later, silently discarding the other. Vector clocks have no such tie-breaker; they preserve both and hand the resolution to the application, which has semantic knowledge the database lacks.

Client-side responsibilities with vector clocks

Vector clocks shift work to the application. The client must be prepared to:

Receive multiple versions on read. The common case is one version (no concurrent writers). Occasionally two. Very rarely more. The client code must handle the list, not assume a single value.

Merge semantically. The merge is type-specific:

Write back the merge. After merging siblings, the client performs a write whose vector clock is merge(vc_1, vc_2, ..., vc_n). The server bumps its own entry and stores the result. This "heals" the divergence — future reads see one version, not siblings, because the merged clock dominates every sibling's clock.

Carry vector clocks on reads to writes. Riak exposed vector clocks to clients as an opaque token (the X-Riak-Vclock header). The client read, got (value, vclock), and passed vclock along with the modified value on the subsequent write. This "causal context" is how the server knows which versions the write is intended to supersede. A write without a vclock is a blind write — the server cannot tell if it supersedes old siblings or creates a new concurrent one.

Cassandra does not expose vector clocks to clients; the entire model is LWW with timestamps (per-column, not per-row). DynamoDB-the-service exposes neither but provides conditional writes (ConditionExpression: attribute_version = :expected) as a separate mechanism for optimistic concurrency control on individual keys. These are three different philosophies about where to put the merge responsibility.

Hybrid logical clocks (HLCs)

A middle ground that deserves mention. Hybrid logical clocks, introduced by Kulkarni et al. (2014) and deployed by CockroachDB and YugabyteDB, combine physical wall-clock time with a logical counter.

An HLC is a pair (wall_time, logical). On a local event, the clock advances to the maximum of its current wall-time and the physical clock's reading; if it matches an existing HLC, the logical counter bumps. On receiving a message with a higher HLC, the local HLC jumps to the received one's wall-time (plus logical tie-break).

The result:

HLCs are widely used for timestamp ordering in transaction managers (CockroachDB's distributed SQL) and for snapshot isolation timestamps (YugabyteDB). They are used less often as conflict-resolution metadata in the Dynamo-paper sense.

Why HLCs are not a substitute for vector clocks: HLCs still impose a total order. Two truly concurrent writes get two distinct HLCs, one larger than the other, and "the larger wins" is still LWW — just with a cleverer clock. Concurrent writes are still silently lost. HLCs shine when you want loosely-synchronised wall-time timestamps for ordering and auditing; they do not solve the data-loss problem that motivated vector clocks.

CRDTs — the next level

Vector clocks detect conflicts; the application still has to merge. For many data types the merge is mechanical and repetitive — set union, counter sum, max, register last-writer-wins. CRDTs (conflict-free replicated data types) bake the merge into the data structure itself, so the database can merge without asking the application.

A G-counter (grow-only counter) stores a per-node counter; the total is the sum; the merge is element-wise max. Two concurrent increments converge automatically because each node's contribution is tracked separately.

An OR-set (observed-remove set) tracks add and remove operations with unique tags; concurrent adds and removes converge with well-defined semantics.

Chapter 83 covers CRDTs in depth. For now, the relationship to vector clocks: CRDTs do not replace vector clocks, they complement them. A CRDT-based system still tracks causality (often with vector clocks or dotted version vectors); the difference is that the merge step is deterministic and mechanical rather than a callback into the application.

The partition scenario in detail

Priya runs a kirana store in Bengaluru. She and her business partner Arjun share a shopping list for their wholesale buys; the list lives on a leaderless KV store with N=3.

Day 1, 14:00 IST. The list is empty. Both replicas X (data centre Mumbai) and Y (data centre Chennai) agree: value = [], vc = {}.

14:05. Priya adds "rice 25kg" from her shop's terminal; request lands on replica X. X bumps its entry:

  • X now holds (["rice 25kg"], {X: 1}).
  • Y still holds ([], {}). Gossip has not yet caught up.

14:05:30. A fibre cut partitions Mumbai from Chennai. Replicas X and Y can no longer see each other.

14:06. Arjun, sitting in Chennai, opens the app. His read routes to Y (local DC preference). He sees an empty list — Y never got the rice update. He adds "atta 10kg":

  • Y now holds (["atta 10kg"], {Y: 1}).

14:08. Arjun adds "sugar 20kg":

  • Y now holds (["atta 10kg", "sugar 20kg"], {Y: 2}).

14:15. Partition heals. Anti-entropy between X and Y kicks in. Merkle-tree comparison flags the shared key as divergent. X and Y exchange versions:

  • X offers (["rice 25kg"], {X: 1}).
  • Y offers (["atta 10kg", "sugar 20kg"], {Y: 2}).

Comparison: dominates({X:1}, {Y:2})? X:1 >= X:0 true, Y:0 >= Y:2 FALSE — no. dominates({Y:2}, {X:1})? X:0 >= X:1 FALSE — no. Neither dominates; they are concurrent. Both replicas now hold both siblings.

14:16. Priya opens the app on her phone in Mumbai. The coordinator reads from R=2 replicas, gets both siblings, returns both to the client. The kirana app knows that a shopping list is a set; it merges:

  • value_merged = ["rice 25kg", "atta 10kg", "sugar 20kg"]
  • vc_merged = merge({X:1}, {Y:2}) = {X:1, Y:2}

Priya sees the full list on her screen. She taps save. The write goes through replica X; X bumps its entry:

  • Final stored version: (["rice 25kg", "atta 10kg", "sugar 20kg"], {X:2, Y:2}).

14:17. Arjun reads in Chennai. {X:2, Y:2} dominates both {X:1} and {Y:2} (every entry greater-or-equal, at least one strictly greater). The siblings are garbage-collected on the next background pass. Arjun sees one unified list. The kirana is ready for tomorrow's wholesale run.

What LWW would have done. Alternative universe: the database uses LWW by wall-clock timestamp. Arjun's ["atta 10kg", "sugar 20kg"] write at 14:08 has a later timestamp than Priya's ["rice 25kg"] at 14:05. On reconciliation, LWW keeps Arjun's version and discards Priya's. The morning run forgets the rice. The kirana is short 25kg of its main staple. The customers grumble.

What vector clocks bought. The guarantee that concurrent intent is preserved. Both Priya's and Arjun's contributions survive the partition; the application, which knows what a shopping list means, merges them sensibly. The cost: client code that handles 2+ siblings and knows how to merge them. For a shopping list, set union is six lines of Python.

Common confusions

Going deeper

Lamport clocks — the precursor

Lamport's 1978 paper "Time, Clocks, and the Ordering of Events in a Distributed System" is the foundational work. A Lamport clock is a single integer per node; every event increments it; every message carries the sender's clock; receivers take max(local, received) + 1. The result is a scalar logical time that respects causality: if A happens-before B, then LT(A) < LT(B). The converse is not guaranteed — two events can have LT(A) < LT(B) and be concurrent.

Lamport clocks give a total order when you need one (break ties by node ID) and detect some causal violations, but they cannot distinguish "older" from "concurrent". That distinction is what motivated vector clocks ten years later.

Dotted version vectors (DVV)

Riak 2.0 replaced vector clocks with dotted version vectors, introduced by Preguiça et al. (2010). The motivation: classical vector clocks grow with the number of client sessions writing to a key, not the number of replicas, because each client's write bumps the coordinator's entry. Long-running clients with many writes to hot keys can produce pathologically long vclocks.

DVVs decompose the causal context into a per-replica "dot" (a single (node, counter) marking this particular write) plus a summary of observed history. The metadata stays bounded by the number of replicas regardless of client activity. The paper "Dotted Version Vectors: Logical Clocks for Optimistic Replication" is the reference; Riak 2.0's release notes describe the engineering.

Where this leads next

Chapter 83 covers CRDTs — convergence without coordination: data structures whose merge is commutative, associative, and idempotent by construction, so replicas converge without application-level merge code. Chapter 84 examines the leaky abstraction of eventual consistency — the corner cases where even well-designed merges produce user-visible weirdness, and what application-layer mitigations (read-your-writes, monotonic-reads, causal sessions) look like.

References

  1. Fidge, Timestamps in Message-Passing Systems That Preserve the Partial Ordering, 11th Australian Computer Science Conference, 1988 — one of the two independent introductions of vector clocks. Defines the element-wise increment, element-wise comparison, and the happened-before relation that motivates the whole edifice.
  2. Mattern, Virtual Time and Global States of Distributed Systems, Parallel and Distributed Algorithms, 1989 — the other independent introduction, with a deeper treatment of causal ordering and global snapshots. Together with Fidge 1988 the foundational citation pair.
  3. Lamport, Time, Clocks, and the Ordering of Events in a Distributed System, Communications of the ACM, 1978 — the predecessor paper. Introduces logical clocks and the happened-before relation that vector clocks later extended from scalars to vectors.
  4. DeCandia et al., Dynamo: Amazon's Highly Available Key-value Store, SOSP 2007 — the paper that brought vector clocks into production practice at internet scale. Section 4.4 covers data versioning; section 6.3 reports on the pruning strategy used in Amazon's deployment.
  5. Preguiça et al., Dotted Version Vectors: Logical Clocks for Optimistic Replication, arXiv 2010 — the DVV paper that Riak 2.0 adopted. Formalises the client-session problem with classical vector clocks and presents a compact alternative whose size is bounded by the replica count rather than the client-write count.
  6. Kleppmann, Designing Data-Intensive Applications, Chapter 5 — Replication, O'Reilly 2017 — the clearest pedagogical treatment of leaderless conflict resolution, LWW pitfalls, vector clocks and version vectors, with worked examples and discussion of dotted version vectors.