Dotted version vectors

Riya is on-call at PaySetu when an alert fires: the per-key conflict-vector size on their Dynamo-style cart store has crossed 4 KB on a single shopping-cart key. The key has only 7 items in it. The vector has 312 entries — one for each unique client process that has written to the cart over the last six months, almost all of them long-dead browser sessions. Every read now ships back ~4 KB of clock metadata for ~80 bytes of cart data. Every write reads-modifies-writes the entire vector. The naive vector clock is correct — concurrency detection still works — but the vector has eaten the payload. Dotted version vectors (DVVs) are the construction that solves exactly this problem: keep one entry per server replica, not per client, plus a tiny per-write "dot" that records which event produced which value. The vector size becomes proportional to the replica count and stops growing with traffic.

A dotted version vector is a vector clock with two changes. First, only server replicas get a coordinate — clients are no longer first-class actors, so the vector size is bounded by your replication factor (typically 3). Second, every concrete value carries a dot — a (replica_id, counter) pair that uniquely identifies the write event that produced that value. Concurrent writes from the same client to different replicas land with different dots, and the merge layer can detect them without bloating the vector. This is the structure Riak ships in production as the vclock field on every key.

Why classic vector clocks bloat under client-direct writes

A textbook vector clock assumes a fixed, known set of processes. Every process that participates in the system gets one coordinate, and the vector dimension N is the number of processes. That assumption breaks the moment your data store accepts writes from short-lived clients — browser sessions, mobile-app instances, microservice pods that come up and tear down on every deploy. Each one has to be assigned a process id (otherwise its writes cannot be timestamped consistently), but each one also rapidly disappears, leaving its coordinate behind in every vector that recorded one of its writes. Over months, the vector accumulates thousands of zombie entries.

The original Dynamo paper acknowledges this: "in practice we have seen vector clocks with as many as 80 entries on rare occasions, when many clients update the same key concurrently." Riak engineers running production keys saw worse — vectors with hundreds of entries on heavily-written keys. The pathology has a name: sibling explosion. Each entry in the vector corresponds to a potential branch of history. When a read returns multiple "concurrent" versions to the application, the application has to reconcile them. With 300 entries and asynchronous replication, a single key can generate dozens of sibling versions on every read — the application code is now spending more time reconciling history than serving requests.

Vector-clock bloat under client-direct writes vs DVV bounded sizeTwo panels comparing vector size growth. Left panel labelled "Classic vector clock — one entry per client": vector grows from size 3 to size 312 as transient clients accumulate, displayed as a long horizontal bar of stacked entries. Right panel labelled "Dotted version vector — one entry per server replica": vector stays at size 3 even as the same client traffic flows through, with a small dot annotation on each value showing which server-replica counter produced it.Vector size after 6 months — same workload, two clock structures Classic vector clock one entry per client process 312 entries — 3 servers + 309 dead clients ~4 KB clock metadata payload itself: 80 bytes grows linearly with traffic; never shrinks under naive rules unbounded Dotted version vector one entry per server replica 3 entries — one per replica ~24 bytes vector + 12 byte dot per value payload itself: 80 bytes grows with replication factor, not with traffic bounded by N (replicas)
Illustrative — the same 6-month write workload produces a 312-entry classic vector or a 3-entry DVV depending on which actors are first-class. DVVs make server replicas the actors and reduce clients to "the thing that triggered a write on a server", which is why the dimension stops growing.

Why pruning the classic vector does not work: you cannot simply drop "old" entries from a vector clock without breaking the strong clock condition. If you drop entry i from V(a) because process i has been dead for a month, then a future event b that arrives with V(b)[i] = 5 cannot be correctly compared against V(a) — you've lost the information that says whether a was concurrent with anything that ever happened on i. Pruning is correct only when every replica has acknowledged that it has seen all events on i up to some count, which requires a global consensus mechanism that the clock structure was supposed to avoid in the first place. DVVs sidestep this by never letting clients into the vector to begin with.

The DVV construction — vectors of replicas, dots on values

Preguiça, Baquero, Almeida, Fonte, and Gonçalves formalised dotted version vectors in their 2010 paper. The construction has two pieces:

  1. Version vector V — exactly like a classic vector clock, but indexed by server replica id, not by process. If your data store has 3 replicas (call them r1, r2, r3), the vector has 3 entries. If you scale to 5 replicas, it has 5. The vector is attached to every value stored on every replica.
  2. Dot d — a single (replica_id, counter) pair attached to every concrete value. The dot identifies the unique write event that produced that value. Two values with the same dot are byte-identical. Two values with different dots are distinct events that may or may not be causally related.

A stored value on a replica is therefore a triple: (value, V, d) — the payload, the version vector summarising "what writes I have seen so far", and the dot summarising "which write produced this specific value".

The four protocol rules:

  1. Client read. Client requests key k from a coordinator replica. Coordinator returns the current set of (value, V, d) triples — usually one, but more if there are unresolved concurrent writes (siblings).
  2. Client write. Client sends (new_value, context) to a coordinator, where context is the version vector(s) the client read previously (proof of "I have seen these versions"). The coordinator increments its own counter c := V[my_id] + 1, builds the new value with vector V_new = context ⊔ V (componentwise max) and dot d = (my_id, c). It writes (new_value, V_new, d) locally and asynchronously replicates.
  3. Replica receive. When a replica receives a value (v, V', d') from another replica, it merges. If d' is dominated by the local vector (i.e., the local replica has already absorbed an event with d'.replica_idd'.counter), the value is discarded — already-seen. Otherwise, the value is kept; the local vector is updated to V := V ⊔ V'; and any locally-stored values whose dots are now dominated by V' are discarded — they're in the causal past of the incoming value.
  4. Sibling formation. If after merge there exist two stored values (v1, V1, d1) and (v2, V2, d2) whose dots are not mutually dominated by either vector, both are kept as siblings. The next client read returns both; the next client write uses both as context and produces a value that supersedes both.

The crucial difference from naive vector clocks: the dot d = (my_id, c) is what lets the merge layer say "this specific value came from event #c on replica my_id", even though the version vector itself has been merged with other vectors. Without dots, two replicas that have both absorbed the same set of events would have identical vectors and you couldn't tell their concrete values apart. The dot pins the value to its origin event.

DVV write flow with sibling formation under concurrent client writesThree server replicas r1, r2, r3 across the top. A client A writes value v1 via r1 producing dot (r1,1) and vector [1,0,0]. Concurrently, client B writes value v2 via r2 with empty context, producing dot (r2,1) and vector [0,1,0]. Replication propagates both. r3 receives both. Because neither dot is dominated, r3 stores both as siblings under vector [1,1,0].Concurrent client writes via different coordinators — DVV records both dots, no information loss replica r1 coordinator for client A replica r2 coordinator for client B replica r3 passive replica A: write v1 (ctx=[]) → stores (v1, V=[1,0,0], dot=(r1,1)) B: write v2 (ctx=[]) → stores (v2, V=[0,1,0], dot=(r2,1)) async replicate v1, dot (r1,1) async replicate v2, dot (r2,1) r3 receives v1 r3 receives v2 r3 stores BOTH as siblings V=[1,1,0]; dots (r1,1) and (r2,1) — neither dominates
Illustrative — two clients writing concurrently via different coordinators. Each write produces a dot tied to its coordinator. When replica r3 absorbs both, the version vector merges to `[1,1,0]` (it has now seen one event on r1 and one on r2), but neither dot is dominated by the other so both values remain as siblings. The next client read sees both; the next client write supersedes both.

Why dots are necessary even with the merged vector: after r3 absorbs both writes, its vector is [1,1,0] for both values. Without dots, you could not tell the two stored values apart by looking at their vectors — they'd carry the same metadata and you wouldn't know which one came from r1 vs r2. The dot is the per-value identifier; the vector is the per-replica summary. They serve different roles, and DVVs need both. A naive vector clock conflates them — every value carries its full causal history as the vector, which is why the vector grows with every write.

A 40-line DVV implementation with sibling formation

The cleanest way to internalise DVVs is to run two clients writing concurrently to different coordinators and watch siblings form, then watch a context-aware write supersede them.

# dvv_sim.py — minimal DVV with three replicas and concurrent client writes
from dataclasses import dataclass, field
from typing import Tuple

@dataclass
class StoredValue:
    value: str
    vv: dict           # {replica_id: counter}
    dot: Tuple[str, int]  # (replica_id, counter)

@dataclass
class Replica:
    rid: str
    vv: dict = field(default_factory=dict)
    values: list = field(default_factory=list)

    def write(self, value, context):
        c = self.vv.get(self.rid, 0) + 1
        self.vv[self.rid] = c
        new_vv = {**context, **{k: max(self.vv.get(k,0), context.get(k,0))
                                 for k in set(self.vv)|set(context)}}
        new_vv[self.rid] = c
        new = StoredValue(value, new_vv, (self.rid, c))
        self.values = [v for v in self.values
                        if not self._dominated(v.dot, new_vv)] + [new]
        return new

    @staticmethod
    def _dominated(dot, vv):
        rid, c = dot
        return vv.get(rid, 0) >= c

    def receive(self, incoming):
        if self._dominated(incoming.dot, self.vv): return
        self.vv = {k: max(self.vv.get(k,0), incoming.vv.get(k,0))
                    for k in set(self.vv)|set(incoming.vv)}
        kept = [v for v in self.values if not self._dominated(v.dot, incoming.vv)]
        if not any(v.dot == incoming.dot for v in kept):
            kept.append(incoming)
        self.values = kept

if __name__ == "__main__":
    r1, r2, r3 = Replica("r1"), Replica("r2"), Replica("r3")
    v1 = r1.write("cart=[milk]", context={})       # client A, no prior context
    v2 = r2.write("cart=[eggs]", context={})       # client B, no prior context
    for r in (r2, r3): r.receive(v1)
    for r in (r1, r3): r.receive(v2)
    print(f"r3 after both: {len(r3.values)} sibling(s)")
    for v in r3.values: print(f"  {v.value}  vv={v.vv}  dot={v.dot}")
    # client C reads from r3, sees siblings, writes a resolution with full context
    ctx = {k: max(v.vv.get(k,0) for v in r3.values) for k in r3.vv}
    v3 = r3.write("cart=[milk,eggs]", context=ctx)
    for r in (r1, r2): r.receive(v3)
    print(f"r1 after merge: {len(r1.values)} value(s)")
    for v in r1.values: print(f"  {v.value}  vv={v.vv}  dot={v.dot}")

Sample run:

r3 after both: 2 sibling(s)
  cart=[milk]  vv={'r1': 1}  dot=('r1', 1)
  cart=[eggs]  vv={'r2': 1}  dot=('r2', 1)
r1 after merge: 1 value(s)
  cart=[milk,eggs]  vv={'r1': 1, 'r2': 1, 'r3': 1}  dot=('r3', 1)

The output tells the whole story. After both concurrent writes propagate, replica r3 stores two siblings — neither dot is dominated, so the merge layer cannot collapse them automatically. A read from r3 returns both versions to the application. The application reconciles (here, it unions the cart) and issues a context-aware write — the ctx dict carries the union of vectors from the siblings, proving "I have seen both versions". The new value's vector is {r1:1, r2:1, r3:1} and its dot is (r3, 1). When r1 receives this new value, its dot is not dominated (r1 has seen no r3 events yet), so the value is stored. But the vector {r1:1, r2:1, r3:1} does dominate the previously-stored sibling (v1, vv={r1:1}, dot=(r1,1)) — because vv.get('r1', 0) = 1 >= 1. So v1 is discarded. The same happens for v2 on r2. The end state: every replica converges to the resolved cart with a single dot.

Why the line kept = [v for v in self.values if not self._dominated(v.dot, incoming.vv)] is the heart of the algorithm: this is where DVVs reclaim space. When a new value arrives whose vector says "I have seen all of {r1:1, r2:1, r3:1}", any locally-stored sibling whose dot is one of those events is now in the causal past of the incoming value — it's been superseded by the resolution. Dropping it is correct because the application has already explicitly absorbed it via context. Without dots, you couldn't safely drop siblings; with dots, you can drop precisely the ones the new vector dominates. This is what stops sibling explosion.

Why the context dict is the load-bearing API contract: the client must send back the version vector(s) it saw on read. If a client writes with context={} when there are existing siblings, it produces yet another concurrent value — sibling count grows by one. If it writes with the full union-context, the resolution dominates everything in the past. Riak's API exposes this as the vclock opaque blob the client must echo back; if you forget to echo it, you create an artificial sibling and slow the system down. The dot is the server's identifier for the write; the context is the client's claim about what it has read. Both are required.

A production incident — MealRush's lunch-hour sibling explosion

MealRush, a food-delivery platform, runs a per-user "active orders" cart on a Dynamo-style key-value store with replication factor 3. On 14 March 2025, between 12:45 IST and 13:15 IST — peak lunch-hour traffic — the on-call SRE Aditi noticed p99 read latency on the active-orders key climbing from 8 ms to 340 ms. Investigation showed the cart for one user, a frequent customer named Karan, had 47 sibling versions accumulated. Each read returned ~12 KB of conflict-resolution data for what should have been a 200-byte cart.

The proximate cause was a buggy mobile-app build. The app issued cart-update writes to the data store directly (talking to whichever replica its load balancer routed it to), but a bug had crashed the version-vector context cache on the device. Every write went out with context={} — the client claiming it had seen no prior versions of the cart. Every such write created a new sibling, because the dot (coordinator_replica, n) was unique each time and the empty context dominated nothing. Within 30 minutes Karan's cart had 47 siblings, each one a distinct concurrent write the system could not collapse.

The fix had two layers. Short-term, the data store engineering team enabled a server-side sibling cap of 8 — when sibling count exceeds 8, the server refuses new writes with a precondition_required error, forcing the client to read the current state and re-issue with full context. This stopped the bleed. Long-term, the mobile app was patched to persist the version-vector cache to disk. The interesting part of the post-mortem was that classic vector clocks would have made this incident worse — not just sibling count growing, but vector size itself growing, with every transient client adding a new entry. With DVVs, the vector stayed at 3 entries the whole time; only the sibling list grew. Which means the recovery (read the current state, write with full context, dominate everything) was a single-write fix once Karan's app was upgraded. With classic vector clocks, the 47 zombie client entries in the vector would have remained forever.

MealRush sibling growth and recovery via context-aware writeA bar chart of sibling count over time. Sibling count grows from 1 at 12:45 to 47 at 13:15 due to context-empty writes from a buggy mobile app. At 13:18, a context-aware write from a fixed client supersedes all 47 siblings and the count drops to 1.MealRush — sibling count on Karan's cart key during the buggy-app window 50 25 0 12:45 12:55 13:05 13:15 13:18 47 siblings 1 buggy app: context={} fix: write with full context → dominates all 47
Illustrative — sibling count growth on a single key under context-empty writes, and the single-write recovery once the mobile app was fixed. Vector size stayed at 3 throughout (one per replica) because DVVs do not let clients into the vector. With classic vector clocks the vector itself would have grown to 47-plus entries and the recovery would have been multi-step.

The deeper insight from the MealRush post-mortem was about API design. A data store that permits context-empty writes is implicitly accepting that every such write is a sibling-creator. Riak made this explicit in 2.0 by deprecating the legacy "vclock-less" API and requiring vclock in every write — a breaking change motivated by exactly this class of incident. The lesson generalises: when your conflict-detection structure depends on clients accurately reporting what they've seen, the protocol must make "I haven't seen anything" loud rather than silent. DVVs make zero-context writes detectable (their dot is unique, their vector is empty), but they don't prevent them. The prevention has to be at the API layer.

Common confusions

Going deeper

The Riak vclock field — DVVs in production

Riak's storage layer attaches a vclock to every key-value record. Internally it's a Erlang term encoding a list of {actor_id, counter, timestamp} tuples — the actor_id is the replica id, counter is the per-replica counter, and the timestamp is a coarse wall-clock value used only as a tiebreaker for sibling pruning under a sibling-cap limit (not for causality). A read returns a vclock blob the client must echo back on writes. The HTTP API exposes this as the X-Riak-Vclock header. Riak's tests show that with replication factor 3 and a typical workload, the vclock blob settles at ~80–120 bytes per key — bounded, predictable, no zombie growth. The 2.0 release added stricter context enforcement (rejecting writes whose vclock is older than the on-disk one) precisely to prevent the MealRush-style bug class.

Why dots compose with CRDTs cleanly

CRDTs need to identify individual operations to support remove-anywhere semantics. An OR-Set, for example, tags every element-add with a unique dot so that a concurrent remove can target that specific add rather than "all adds of the element so far". DVVs provide exactly the right primitive: every value gets a dot, and the dot is comparable against vectors via dominance. Riak's data-type subsystem (introduced in 2.0 as riak_dt) implements OR-Sets, G-Counters, and registers using DVVs as the underlying timestamp. The composition is clean because the primitive (DVV dot) was designed to be the correct shape for this use case — which it was, because the same research group that produced the DVV paper (Baquero et al.) also produced foundational CRDT work.

The state-vs-operation tradeoff and why DVVs are state-based

DVVs are a state-based clock — you replicate the full state (value + vector + dot) and merge with componentwise lattice operations. The alternative is operation-based — replicate operations (delta-states) and apply them in any order. Operation-based replication can use smaller messages but requires reliable causal broadcast, which is itself a distributed-systems problem. State-based replication via DVVs is "just gossip the state, the merge takes care of correctness", which is why Dynamo-derived systems (Riak, Cassandra's lightweight-transactions layer) chose it. The tradeoff is bandwidth: every replication message ships the full value, not just the delta. For small values this is fine; for large blobs (multi-megabyte), op-based or hybrid schemes win. The 2014 paper "Delta State Replicated Data Types" by Almeida et al. shows how to retrofit op-based deltas onto DVV-style state-based replication, which is what most modern systems do.

The Charron-Bost lower bound still applies — DVVs don't beat O(N)

Charron-Bost's 1991 result — that any timestamping satisfying the strong clock condition needs O(N) bits per timestamp, where N is the number of processes that ever participate — applies to DVVs just like to vector clocks. The reason DVVs don't violate the bound is that they redefine N to be the number of server replicas, not the number of all participants. Clients are not first-class processes in the DVV system; they are "the things that trigger writes on servers". The bound is still tight (N server replicas → N integers per vector), but N is now small and bounded. This is a pattern in distributed-systems engineering: when a lower bound bites, redefine the problem so the constants are small. DVVs are a textbook example.

Reproduce this on your laptop

# Reproduce the DVV simulation
python3 dvv_sim.py

# Stress: many concurrent writes with empty context — watch siblings grow
python3 -c "
from dvv_sim import Replica, StoredValue
r1, r2, r3 = Replica('r1'), Replica('r2'), Replica('r3')
writes = []
for i in range(20):
    coord = [r1, r2, r3][i % 3]
    writes.append(coord.write(f'v{i}', context={}))
for w in writes:
    for r in (r1, r2, r3):
        r.receive(w)
print(f'r1 sibling count: {len(r1.values)}')
print(f'r1 vector size: {len(r1.vv)}')
"

# Recovery: read all siblings, write with full context — sibling count collapses to 1
python3 -c "
from dvv_sim import Replica
r1 = Replica('r1')
v = r1.write('a', context={})
v = r1.write('b', context={})    # second context-empty write — sibling
print(f'before recovery: {len(r1.values)} siblings')
ctx = {k: max(s.vv.get(k,0) for s in r1.values) for k in r1.vv}
r1.write('a+b', context=ctx)
print(f'after recovery: {len(r1.values)} sibling')
"

Where this leads next

DVVs are the connective tissue between vector clocks and the production-deployable conflict-resolution machinery in eventually-consistent stores:

The unifying lesson: when classical theory says "you need O(N) per timestamp", production systems don't violate the bound — they redefine N. DVVs are an instance of a general engineering move: separate the structural identifiers (which clients should not control) from the per-value pins (which can be small and unique). Clean separations like that are what let theoretical guarantees survive the realities of mobile clients, transient processes, and a six-month uptime expectation.

References

  1. Preguiça, Baquero, Almeida, Fonte, Gonçalves, "Dotted Version Vectors: Logical Clocks for Optimistic Replication" — 2010. The canonical DVV paper. Introduces the construction and proves correctness.
  2. DeCandia et al., "Dynamo: Amazon's Highly Available Key-Value Store" — SOSP 2007. The system that brought vector-clock-style timestamps into production and motivated the DVV refinement.
  3. Riak vclock documentation — production reference for how DVVs are encoded and exposed via the Riak HTTP/PB API.
  4. Almeida, Shapiro et al., "Delta State Replicated Data Types" — 2016 (extended). Hybrid op-based deltas layered on top of DVV-style state-based replication.
  5. Charron-Bost, "Concerning the size of logical clocks in distributed systems" — Information Processing Letters 1991. The O(N) lower bound DVVs cleverly sidestep by redefining N.
  6. Shapiro et al., "Conflict-free Replicated Data Types" — SSS 2011. The CRDT framework that uses DVV dots as per-operation identifiers.
  7. Vector clocks — internal cross-link to the chapter that DVVs refine.
  8. Bryan Fink, "Why Vector Clocks Are Hard" — Riak engineering blog post on the operational pain that motivated the DVV deployment.