Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

LWW registers and their gotchas

Riya, a senior engineer at PaisaCard, is staring at a support ticket. A user updated their delivery address from "MG Road, Bengaluru" to "Indiranagar, Bengaluru" at 14:32:17 IST yesterday. The mobile app confirmed. The web dashboard, refreshed three minutes later, still showed MG Road. The user updated again on the web — this time to "Koramangala". This morning the courier delivered to MG Road. Three writes, three regions, three replicas, one wall-clock-based last-writer-wins register holding the address — and the answer the system converged on is the one nobody asked for. The cause is not a bug in any single line of code; it is the LWW register itself, the smallest CRDT in the lattice, and the one whose gotchas catch every team eventually. Cassandra defaults to LWW. DynamoDB defaults to LWW. Riak's default conflict resolver is LWW. It is the simplest convergent register and the most common source of silent data loss in distributed systems.

A last-writer-wins register stores a (value, timestamp) pair; merge keeps the entry with the larger timestamp (ties broken by replica ID). It is a CRDT — convergent, commutative, associative — and its merge is the join of a join-semilattice. The gotchas are not in the algebra: they are that wall-clock timestamps lie under skew, that concurrent writes silently lose data with no audit trail, and that a write 9 ms behind another can be erased forever by a clock that ran fast yesterday. LWW is correct as a convergence mechanism and almost always wrong as a conflict resolution policy.

The lattice — a (value, timestamp) pair, max-by-timestamp

An LWW register's state is a single pair (v, t) where v is the value and t is a timestamp. The merge is:

merge((v_a, t_a), (v_b, t_b)) = (v_a, t_a) if t_a > t_b
                                (v_b, t_b) if t_b > t_a
                                (v_a, t_a) if t_a == t_b and id_a > id_b   # tie-break

The lattice is (V × T, ≤_t) where two states compare by their timestamp component, with replica-ID tie-breaking turning the partial order into a total order. The merge is associative, commutative, and idempotent by inspection. Why this counts as a CRDT join: a join-semilattice requires a least upper bound for any two elements, and max(t_a, t_b) with ID tie-break delivers exactly that — there is always one designated "winner" between any two states, and merging in any order with any duplication arrives at the same final pair. Shapiro's 2011 CRDT theorem then guarantees convergence regardless of message-arrival order.

The contract LWW signs with you is precisely: all replicas eventually agree on a single (value, timestamp) pair. Nothing more. It does not promise the agreed-on value reflects user intent, causality, or anything about which write happened first in real time. It promises convergence — full stop.

LWW register merge as max-of-timestampThree replicas A, B, C of an LWW register holding a delivery address. A has (MG Road, t=10). B has (Indiranagar, t=14). C has (Koramangala, t=12). After gossip and merge, all three converge on (Indiranagar, t=14) because t=14 is the maximum. The lower-timestamp values are silently dropped. The diagram is illustrative. LWW: merge keeps the pair with the larger timestamp; the others vanish Replica A ("MG Road", t=10) Mumbai-edge — first write Replica B ("Indiranagar", t=14) Bangalore-edge — newest Replica C ("Koramangala", t=12) Delhi-edge — middle After gossip — merge by max-of-timestamp ("Indiranagar", t=14) "Koramangala" and "MG Road" are dropped — convergence achieved. If C's clock had read 15 instead of 12, "Koramangala" would have won — same writes, different outcome.
The LWW lattice is a totally-ordered set of (value, timestamp) pairs. Merge is max-of-timestamp; the loser's value is dropped without audit. Convergence is mathematically guaranteed; the user's intent is not.

Why wall-clock LWW is dangerous: the three failure modes

The convergence proof says nothing about correctness of intent. Three independent failure modes turn LWW from "simple convergent CRDT" into "silent data-loss machine".

1. Clock skew silently inverts write order

Two replicas in two AZs typically run NTP and stay within ±5 ms of each other on a good day. On a bad day — a leap second, an NTP-server flap, a VM live-migration that paused the guest clock for 1.8 seconds — the skew can balloon to seconds. Why this matters for LWW: the register's merge function does not know about real time, only about the timestamp written into the pair. If replica A's clock is 9 ms ahead and the user issues write1 to A at "real time T" followed by write2 to B at "real time T+5ms", A's write1 carries timestamp T+9ms and B's write2 carries timestamp T+5ms. After gossip, A's earlier write wins because it was timestamped later. The user issued write2 chronologically last; the system silently picked write1.

In normal operation this hits a tiny fraction of writes — perhaps 0.01% under healthy NTP. During a network event, an NTP server replacement, or a VM migration, it can spike to 1–10% for a window. Cassandra's mailing list has multiple threads documenting "writes that disappear" traced back to drift between coordinator nodes. The defense is to never trust local wall-clocks for ordering — use a hybrid logical clock instead, which composites a logical-counter component with the wall-clock so that even when wall-clocks disagree, the logical-counter component preserves causality.

Clock skew inverts LWW write orderTwo replica timelines, A and B, drawn against real wall-clock time on the horizontal axis. Replica A's clock runs 9ms ahead of real time; replica B's clock is correct. At real time T, replica A receives a write and stamps it T+9ms. At real time T+5ms, replica B receives a chronologically-later write and stamps it T+5ms. After gossip, the merge picks A's earlier-in-real-time write because its timestamp is numerically larger. The diagram is illustrative. Clock skew inverts LWW order — the chronologically-later write loses Real time T T+5ms Replica A clock +9ms write("MG Road") stamped T+9ms Replica B clock correct write("Indiranagar") stamped T+5ms merge: max(T+9ms, T+5ms) = T+9ms winner = "MG Road" — the write that happened FIRST in real time
Replica A's clock is 9 ms ahead of real time. The user wrote "MG Road" first and "Indiranagar" 5 ms later — but A's stamp on the older write (T+9ms) is numerically larger than B's stamp on the newer write (T+5ms). LWW picks the larger timestamp. The newer-in-real-time write is silently dropped. Illustrative — not measured data.

2. Concurrent writes lose data with no audit trail

Two users simultaneously edit the same field — a doc title, a cart item count, a user-profile field — from different regions. Both writes go to different replicas; both timestamps are within microseconds of each other. The merge picks one. The other write is silently discarded — there is no event in any log, no callback to the application, no flag on the survivor saying "I beat a contemporary".

Compare this to a vector-clock-based register, where the merge would detect concurrency and surface both versions to the application as siblings (Riak's "siblings" model). LWW gives you one answer; the user who lost has no way to know they lost. Why this is the most insidious LWW pitfall: the silent-loss property is not detectable from the application's view, because every replica reports the same converged state. The lost write does not show up as a conflict, an error, or a slow operation — it just isn't there. By the time a user calls support saying "I changed my address yesterday and the courier went to the old one", the audit trail is gone.

3. The "future write" problem and clock rollback

Replica A's clock is running 200 ms fast (a VM-migration glitch). It writes (v_old, T+200ms). Replica B writes (v_correct, T+5ms) 5 ms later in real time. NTP corrects A's clock back to T. Now B's write — chronologically last — is forever stamped earlier than A's stale value. Even after A's clock is corrected, the LWW register on every replica has converged on v_old and there is no rollback mechanism. Cassandra has shipped the USING TIMESTAMP clause specifically because operators sometimes need to overwrite "future" writes with explicit timestamps just to recover from this scenario.

A runnable LWW register, with each gotcha demonstrated

The simulator below implements the LWW register and runs three failure scenarios — clock skew, concurrent writes, and a clock rollback — to make each pathology concrete and reproducible.

# lww_register.py — LWW register with three demonstrative failure scenarios
from dataclasses import dataclass, field
from typing import Any, Tuple
import copy, time

@dataclass(order=False)
class LWWRegister:
    value: Any = None
    timestamp: float = 0.0
    replica_id: str = ""

    def write(self, v, ts: float, rid: str):
        if (ts, rid) > (self.timestamp, self.replica_id):
            self.value, self.timestamp, self.replica_id = v, ts, rid

    def merge(self, other: "LWWRegister"):
        if (other.timestamp, other.replica_id) > (self.timestamp, self.replica_id):
            self.value, self.timestamp, self.replica_id = other.value, other.timestamp, other.replica_id

# --- Scenario 1: clock skew silently inverts order
def scenario_clock_skew():
    A, B = LWWRegister(replica_id="A"), LWWRegister(replica_id="B")
    real_T = 1000.0                              # real wall-clock at user-event-1
    A.write("MG Road",   ts=real_T + 0.009, rid="A")   # A's clock is +9ms
    B.write("Indiranagar", ts=real_T + 0.005, rid="B") # B writes 5ms LATER in real time
    A.merge(copy.deepcopy(B)); B.merge(copy.deepcopy(A))
    return A.value                                # "MG Road" — the OLDER write wins

# --- Scenario 2: concurrent writes — silent loss
def scenario_concurrent():
    A, B = LWWRegister(replica_id="A"), LWWRegister(replica_id="B")
    same_t = 1000.0
    A.write("Address-from-mobile",  ts=same_t, rid="A")
    B.write("Address-from-web",     ts=same_t, rid="B")
    A.merge(copy.deepcopy(B)); B.merge(copy.deepcopy(A))
    return A.value                                # "Address-from-web" — picked by replica-ID tie-break
                                                  # The user on mobile has NO indication they lost.

# --- Scenario 3: clock rollback — a "future" write erases later truth
def scenario_clock_rollback():
    A, B = LWWRegister(replica_id="A"), LWWRegister(replica_id="B")
    A.write("stale_address", ts=1000.200, rid="A")    # A's clock was running fast
    B.write("correct_address", ts=1000.005, rid="B")  # B wrote chronologically AFTER, but clock is now correct
    A.merge(copy.deepcopy(B)); B.merge(copy.deepcopy(A))
    return A.value                                    # "stale_address" — the future-stamped write wins permanently

if __name__ == "__main__":
    print("Clock skew (9ms ahead): the older write wins   →", scenario_clock_skew())
    print("Concurrent writes:      one is silently lost    →", scenario_concurrent())
    print("Clock rollback:         stale future write wins →", scenario_clock_rollback())

Sample run:

Clock skew (9ms ahead): the older write wins   → MG Road
Concurrent writes:      one is silently lost    → Address-from-web
Clock rollback:         stale future write wins → stale_address

Walk through the three losses:

  • Scenario 1 (clock skew). Real-time order was: write("MG Road") first, then write("Indiranagar") 5 ms later. But replica A's clock is 9 ms fast, so MG Road is stamped T+9ms and Indiranagar is stamped T+5ms. The merge keeps the larger timestamp — T+9ms — and the chronologically-later write is dropped. The user updated to Indiranagar; the system kept MG Road. Why this is not a bug in the LWW algorithm: the merge is doing exactly what the contract says — it picks the pair with the larger timestamp. The bug is in trusting wall-clock timestamps under skew. A hybrid logical clock would have detected that the two writes are concurrent (overlapping in their logical-counter component) and either preserved both or used a deterministic-but-causal tie-break.
  • Scenario 2 (concurrent writes). Both replicas record the same wall-clock timestamp. The replica-ID tie-break decides: replica B's value wins because "B" > "A" lexicographically. The mobile user gets no error, no conflict callback, no sibling versions to choose from. Their write is gone, indistinguishable from never having happened.
  • Scenario 3 (clock rollback). Replica A's clock was running fast and wrote stale_address with timestamp T+200ms. Then NTP corrected A. Replica B writes chronologically later, with the correct clock, and gets T+5ms. The merge picks T+200ms because that is mathematically larger. There is no way for the system to know that the larger timestamp is from a clock that has since been retracted — once written, it's permanent. Recovery requires manual intervention with USING TIMESTAMP in Cassandra or an explicit "force-write" tombstone.

Run the same three scenarios with simpy and 1000 replicas instead of 2, and the loss rates do not improve — they often get worse, because the probability of at least one replica having a skewed clock at any moment grows with the cluster size.

Production stories: where LWW silently breaks systems

Cassandra's default LWW is the most-cited example. Cassandra's CQL writes carry a microsecond-precision wall-clock timestamp from the coordinator node, not the client. If two coordinators have skewed clocks and you UPDATE the same row through both, whichever coordinator's clock is ahead wins. The nodetool documentation explicitly recommends "ensure all nodes are NTP-synchronised" — but that is precisely the assumption that fails when you most need it (during a network event or a leap-second). Cassandra ships USING TIMESTAMP to let the application supply its own clock if it has a better one (an HLC, or an external sequencer).

DynamoDB's vector-clock retreat. Amazon's original Dynamo paper described a vector-clock-based conflict resolution that surfaced concurrent writes as siblings to the application. Production DynamoDB (the managed service) defaults to LWW because the sibling-resolution callback was operationally complex and most teams just wanted "give me one answer". DeCandia et al.'s 2007 SOSP paper describes the trade-off explicitly. The Dynamo team's lesson — and the one this article inherits — is that LWW is the choice teams gravitate to, not because it is correct, but because the alternative is conceptually expensive.

Riak's LWW-vs-siblings choice. Riak ships both modes. The default is "siblings" (vector-clock-based), where the application gets a list of conflicting versions and resolves them with custom logic. Operators frequently switch to LWW for "simplicity"; the production failure mode is then exactly the three scenarios above. Riak's documentation explicitly warns: "If you choose LWW you are choosing convergence over correctness."

MealRush's "user preferences" LWW disaster. MealRush — fictional food-delivery startup — stored each user's preferred cuisine, default delivery address, and tip percentage in an LWW register per field, replicated across two regions for sub-100ms reads. During an NTP-server replacement in ap-south-1, the region drifted +180 ms for 4 minutes. During those 4 minutes, 78,000 users updated at least one field. After NTP healed, 11% of those users found their changes had reverted to a state from before their edit. The post-incident review counted ₹47 lakh in support cost (call-center load) and migrated delivery_address and payment_method (the two highest-stakes fields) from LWW to a CRDT-Map of (vector-clock-tagged) values. Tip percentage and cuisine preference stayed LWW because the team judged the silent-loss-rate acceptable for those fields.

KapitalKite's "user display name" field. KapitalKite — fictional discount stockbroker — stores the user's chosen display name as an LWW register. A user changing the display name on web and mobile within seconds of each other reliably gets their last edit. This is fine: the field is single-user, the only writer is the user themself, and concurrent writes from the same user are intentional retries the user is happy to lose. LWW is correct here. The lesson is not "LWW is bad" — it is "LWW is right for single-writer fields and wrong for multi-writer ones".

CricStream's "current viewer count" badge. CricStream — fictional cricket-streaming service — initially used an LWW register per match to display "viewers right now". Two edge regions kept overwriting each other's count during a final, with the badge flickering between 18M and 22M every gossip round. The team migrated to a G-Counter (additive across regions) and the flicker disappeared overnight. The lesson here is the inverse of the KapitalKite story: any field that is the sum of multiple regions' contributions is the wrong shape for an LWW register, regardless of clock skew, because the merge truncates rather than accumulates.

Common confusions

  • "LWW with NTP-synced clocks is fine." Even at sub-millisecond NTP skew, LWW silently loses concurrent writes (the replica-ID tie-break is not user-visible). And NTP itself is unreliable during the events you most need correctness — leap seconds, network partitions, server replacements. The fundamental issue is that wall-clock timestamps cannot represent concurrent operations — they impose a total order even where none exists. You can reduce the loss rate with NTP, but you cannot eliminate it.
  • "LWW is the same as last-modification-time on a filesystem." A filesystem's mtime is a single-writer property — only one process at a time has the file open for write. LWW registers in distributed systems are explicitly multi-writer, with no exclusion. The naïve transplant of mtime semantics to a distributed register is exactly where the silent-loss bugs come from.
  • "Hybrid logical clocks fix LWW." They fix the clock-skew problem, not the concurrent-write problem. HLC ensures that causally-ordered writes get monotone timestamps. But genuinely concurrent writes (two replicas with no causal relationship) still produce timestamps that need a tie-break, and LWW's "pick one" approach still loses the other write silently. HLC is necessary; it is not sufficient. To preserve concurrent writes you need vector clocks or version vectors.
  • "You can audit LWW losses by logging every write." The local replica logs only what it accepted; it does not see the writes that lost during merge. To audit losses you would need to log every (incoming-merge, current-state) pair and post-process them — Riak's siblings model essentially does this, but it requires the application to handle the resolution. Pure LWW gives you no after-the-fact view of what was overwritten.
  • "LWW-Set is the same as LWW-Register applied to set elements." Almost. An LWW-Set keeps a per-element timestamp, with adds and removes both timestamped; an element is in the set if its add-timestamp is greater than its remove-timestamp. This composes LWW per-element rather than for the entire set. It has the same gotchas as LWW-Register, multiplied by the number of elements — concurrent add and remove of the same element silently loses one of them depending on clock skew. The OR-Set (see G-Set, 2P-Set, OR-Set) avoids this by tagging each add uniquely.
  • "Cassandra timestamps come from the client, so LWW is fine." Cassandra's default is coordinator-side timestamps — the node receiving the write stamps it with its own clock. The application can override with USING TIMESTAMP, but the override is rarely used in practice because most teams don't have a logical clock to provide. The default remains the failure mode.
  • "LWW is fine because the loss rate is small." The loss rate is small on average. The conditional loss rate during the events you care about — NTP-server replacements, leap seconds, VM live-migrations, network partitions, AZ failovers — is one to three orders of magnitude higher. And the lost writes are silently lost: there is no log, no error, no callback. The "small loss rate" framing assumes you would notice if the rate jumped during an event. You will not.

Going deeper

The lattice: (V × T, ≤_t) and why total order is both correct and dangerous

LWW's lattice is a totally-ordered set: any two (v, t) pairs are comparable by (t, replica_id). This is mathematically the simplest possible non-trivial join-semilattice — every chain is the entire set, every join is one of the two elements. Convergence is trivial because the lattice has a single maximum chain. Why this total-order property is also the source of the problem: the lattice cannot represent concurrency. In a vector-clock-based register, two states may be incomparable (neither dominates the other), and the lattice's join is a third state representing both — preserving the concurrency. LWW's total order forces a winner even when none exists in the underlying causal model. The expressiveness of the lattice is what determines whether the CRDT preserves user intent.

Hybrid logical clocks as a partial fix

Kulkarni et al.'s 2014 HLC paper introduced a clock that composites wall-clock time with a logical counter. An HLC timestamp is (physical, counter); on every event, physical = max(local_wall_clock, max_observed_physical) and counter = max_observed_counter + 1 if physical did not advance. The result: causally-ordered events get monotonically-increasing HLC timestamps regardless of wall-clock skew, while concurrent events get incomparable HLC timestamps that the merge can detect. CockroachDB uses HLC for its transaction timestamps; it bounds clock skew via NTP and rejects writes whose HLC component drifts beyond a configured uncertainty interval. HLC turns LWW from "wall-clock LWW" into "causally-correct LWW" — but it still imposes a total order via tie-break, so the concurrent-write loss problem remains.

LWW with bounded uncertainty (Spanner's TrueTime trick)

Google's Spanner takes the opposite approach: instead of trusting wall-clocks more, it makes the uncertainty explicit. TrueTime returns an interval [earliest, latest] rather than a single instant; a write commits at time latest, and a subsequent read waits until its earliest is past latest of any prior write — guaranteeing serialisability even under clock skew. The cost is commit-wait: every write blocks for the uncertainty bound (typically 1–7 ms) before returning. This is the only production system that achieves linearisability with wall-clock timestamps; everyone else either tolerates LWW losses or uses logical clocks. Corbett et al.'s 2012 OSDI paper covers the design.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
python3 lww_register.py
# Try the variations:
# - Modify scenario_clock_skew to use HLC (composite of (physical, counter)) and observe the fix
# - Increase the skew to 1 second, then 10 seconds — the loss rate stays the same; the duration of vulnerability grows
# - Replace the LWW merge with a vector-clock merge that keeps siblings — observe the API surface change

Where this leads next

LWW is the floor of the CRDT register family. Climbing the lattice gives you registers that preserve more of what LWW silently drops:

Part 13 continues with sequence CRDTs (RGA, Logoot, YATA — used by collaborative editors), JSON CRDTs (Automerge, Yjs), and the formal limits of what CRDTs cannot model. Part 17's geo-distribution chapters revisit LWW under cross-region clock skew, where the wall-clock failure modes get sharper because typical inter-region NTP skew is an order of magnitude larger than intra-AZ skew.

References

  • Shapiro, M., Preguiça, N., Baquero, C., Zawirski, M. — "Conflict-free Replicated Data Types" (SSS 2011). LWW-Register is introduced in §3.3 alongside MV-Register.
  • Kulkarni, S., Demirbas, M., Madappa, D., Avva, B., Leone, M. — "Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases" (OPODIS 2014). The HLC construction that composes physical and logical components.
  • Corbett, J. et al. — "Spanner: Google's Globally-Distributed Database" (OSDI 2012). The TrueTime / commit-wait alternative to LWW.
  • DeCandia, G. et al. — "Dynamo: Amazon's Highly Available Key-Value Store" (SOSP 2007). The vector-clock-based conflict resolution that production DynamoDB later abandoned in favour of LWW.
  • Cassandra documentation — USING TIMESTAMP and the operator guidance about NTP synchronisation.
  • Riak documentation — siblings vs LWW conflict resolution; the explicit warning about LWW choosing convergence over correctness.
  • Bailis, P., Ghodsi, A. — "Eventual Consistency Today: Limitations, Extensions, and Beyond" (CACM 2013). The broader picture of what eventual consistency does and does not promise.
  • G-Set, 2P-Set, OR-Set — the set CRDTs whose tag-based design solves concurrency where LWW drops it.