Why Dynamo Rejected Leaders Entirely

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

In 2007, Riverone published the Dynamo paper at SOSP. The headline claim was unusual: for Riverone's shopping cart, the leader — the single node serialising writes — was not a tool, it was the problem. Strict leader-based replication gives up writes whenever the leader dies or is partitioned away; for a cart on Black Friday, giving up writes is not an option. Dynamo threw leaders out entirely.

The design. Every node is equal. Any node can accept any write. A key's data lives on N replicas selected by consistent hashing; a write is acknowledged once W of those replicas have it, and a read queries R of them and merges the results. Tuning N, R, W gives you a continuous knob between "very consistent, slower" and "very available, faster". Conflicts — when two nodes concurrently accept writes to the same key — are resolved with vector clocks that record causal history, or with last-write-wins timestamp tie-breakers.

The trade. Dynamo chooses the AP side of CAP: during a partition, every partition side keeps accepting writes. The cost is that your reads may see concurrent versions the application has to merge — the shopping cart union — and you lose general-purpose strong consistency.

The legacy. Cassandra (Sociogram 2008), Riak (Basho), Voldemort (ProfNet), and Riverone's own managed DynamoDB service all descend from this paper. The vocabulary — N/R/W, hinted handoff, read repair, sloppy quorum — is now the shared language of leaderless distributed databases. This chapter walks through the design choices, the shopping-cart motivation, the N/R/W knob, and the workloads that fit the model versus the ones that don't.

You have spent Builds 7 through 9 making leader-based databases robust. Primary plus followers, WAL shipping, semi-sync, quorum acks, failover with fencing. Every playbook shares one assumption — at any moment, one node is the single authoritative writer for each key.

In 2007 Riverone published a paper that rejected that assumption. Dynamo let any node accept any write at any time. No elections. No leader. The paper spent most of its pages on the consequences: concurrent writes, conflicting versions, eventual consistency, vector clocks, the N/R/W knob. The motivation in section 1 was simple and operational. Riverone's shopping cart had to accept writes 24/7, including during regional partitions and the multi-hour windows when a leader-based system would be in failover. No amount of failover engineering on a leader-based system gave them a cart that never refused writes.

So they stopped engineering failover, and built a database that had nothing to fail over.

The shopping-cart requirement

The Riverone.com cart service had three properties that made conventional databases a poor fit:

Always-writable. A customer adding an item on Black Friday cannot be told "try again in 30 seconds, we are electing a new leader". The internal goal was that the cart accept writes every minute of every year. Leader failover that takes 10-30 seconds on primary death — and longer for cross-datacenter partitions — violates the goal multiple times a year.

Per-customer isolation. Your cart has nothing to do with mine. There is no cross-customer invariant — no "total inventory across all carts must not exceed the warehouse". Each cart is an island. This is unlike a bank's ledger, where cross-account invariants (conservation of money) matter absolutely.

Conflict tolerance. The worst outcome of a mishandled concurrent write is: the same item added twice, or a removed item reappearing after a racy remove-then-add-elsewhere. Neither is silently wrong — the user sees their cart and the rare duplicate is noticeable and correctable. The loss function is flat.

Why optimise differently for different workloads: consistency and availability are not free; both cost engineering and latency. If your workload can tolerate the occasional merge — cart A has {X}, cart B has {Y}, merged cart = {X, Y} — you do not have to pay the availability tax of strong consistency. You get a strictly better system by optimising for availability and pushing conflict resolution to the application.

Dynamo's design choices

The Dynamo paper enumerates a handful of techniques, all of which had prior art individually, but which were combined into a coherent leaderless whole for the first time. Walk through them in the order they appear in a write.

1. Leaderless replication. Every node in a Dynamo cluster accepts writes. When a client sends a put(key, value), it contacts any node — typically a coordinator picked by a load balancer or by client-side hashing. The coordinator does the replication; no request is ever forwarded to a "primary" because none exists. This is the single structural choice that makes everything else necessary; everything that follows is machinery to make leaderless replication correct.

2. Consistent hashing for placement. A key is placed on N nodes chosen by hashing the key onto a ring. Nodes are also hashed onto the ring; a key's "home" node is the first node clockwise from the key, and its replicas are the next N-1 nodes clockwise. This scheme (Karger et al. 1997) has the property that adding or removing a node only reshuffles a 1/N fraction of keys, not the whole dataset. Consistent hashing is so central to Dynamo that the next chapter (ch.76) covers it in depth; here it suffices to know that given a key, there is a deterministic, agreed-upon list of N nodes responsible for it, called the preference list.

3. Quorum replication — N/R/W. The coordinator sends the write to all N preference-list nodes and returns success once W of them ack. Reads query all N and return once R respond. N, R, W are configurable per bucket (Dynamo called them "buckets"; Cassandra calls them keyspaces). Typical values are N=3, W=2, R=2 — a balanced middle ground that tolerates one replica failure on both reads and writes.

4. Vector clocks for conflict detection. Every value carries a small metadata object — a vector clock — recording which nodes have accepted writes and in what logical order. When a read returns multiple values whose vector clocks are causally concurrent (neither strictly happens-before the other), both values are returned to the application, which must merge them. This is how the shopping-cart union works in practice.

5. Hinted handoff for transient failures. When one of the N preference-list nodes is down, the coordinator writes the data to a different healthy node with a "hint" that says "deliver this to node X when X comes back". The hinted replica stores the write durably and pushes it to X once X is reachable. Writes never fail because one specific replica is down; the cluster substitutes.

6. Read repair and anti-entropy. Replicas drift over time — missed writes, dropped hints, node-local corruption. Two mechanisms reconcile them. On every read, the coordinator compares the responses from the N replicas and, if they disagree, writes the "winning" version back to the stale replicas (read repair, opportunistic, no extra round-trip). In the background, periodic Merkle-tree comparisons between replicas detect and fix divergent ranges (anti-entropy, systematic, bounded cost).

Each of these six ideas predates Dynamo. Consistent hashing is Karger 1997. Quorum systems are 1970s. Vector clocks are Mattern and Fidge 1988. Hinted handoff is an engineering pattern from distributed filesystems. Merkle trees are 1979. Dynamo's novelty was assembling them — specifically, dropping the leader — and showing the result was production-worthy for a real Riverone workload.

The N/R/W parameters — the tunable knob

N, R, and W are three numbers that characterise how a Dynamo-style system trades off consistency, availability, and performance.

N — replication factor. How many replicas hold each key. Typically 3. Higher N gives more durability and better read scaling at the cost of more storage and wider coordination.
W — write quorum. How many of the N replicas must ack before the write returns. Typically 2 when N=3. W=1 is "fire and forget"; W=N waits for every replica and halts if any is down.
R — read quorum. How many of the N replicas must respond before the read returns a merged answer. Typically 2 when N=3. R=1 is "any one replica, fastest read, possibly stale"; R=N is "all replicas, slowest, most authoritative".

The consistency rule: R + W > N guarantees read-your-writes. If a write has been accepted by W replicas and a read queries R replicas, then R + W > N forces at least one replica to appear in both sets (pigeonhole — W write-acking replicas and R read-queried replicas out of N total cannot be disjoint). That one overlap replica holds the latest write, so the read is guaranteed to see it (possibly alongside older versions that the merge step will discard).

Common configurations:

N=3, W=2, R=2. Strong consistency (2+2>3). One replica can fail on writes, one on reads. Both commit and read latency are the 2nd-fastest of 3 acks — responsive even if one replica is slow.
N=3, W=3, R=1. Read-optimised. Every read is fast (one replica). Every write is slow (wait for all 3). Strong consistency. Bad if any replica is down — writes halt.
N=3, W=1, R=1. Write-and-read-optimised. Very fast. Not strongly consistent (1+1=2 ≤ 3). Writes may be lost if the one replica that acked dies before gossiping. Used when latency dominates correctness.
N=3, W=1, R=3. Write-optimised. Fast writes; reads check all 3 and merge. Reads are slow but see everything.

The pleasing property of Dynamo's design is that these are all the same code path with different numbers. You do not rebuild the database to change consistency level; you change the config. Cassandra exposes this per-query (CONSISTENCY QUORUM, CONSISTENCY ONE, CONSISTENCY ALL), so a single table serves both "fast cache-like reads" and "careful quorum reads" from different call sites.

A Dynamo ring of 8 physical nodes, A through H. The key cart:42 hashes to a position between B and C. The preference list — the next N=3 nodes clockwise — is [C, D, E]. C is the coordinator (the first node clockwise), so a client hitting any node forwards the write to C, which fans out to D and E. The write returns success once W=2 of the three have acked (typically C plus whichever of D or E is faster).

Vector clocks — resolving concurrent writes

Without a leader there is no single point serialising writes. Two clients can put different values for the same key at the same time, on different coordinators, and both succeed. The next read returns two values — siblings in Dynamo's vocabulary. Something must decide which is newest. Timestamps are tempting but fragile (clock skew). Dynamo's answer is vector clocks, which record causal history directly.

A vector clock is a dictionary {node_id: counter} attached to every version. Each time node X accepts a write to key K, X increments its own entry. Two clocks a and b compare element-wise: a < b iff every entry of a ≤ b and at least one is strictly less (then a happens-before b); symmetrically a > b; otherwise concurrent — neither knew about the other.

When a read returns values with concurrent clocks, Dynamo returns all of them. The application merges — for the cart, set union of items; for a string value, perhaps last-write-wins or user prompt. Subsequent writes carry the coordinate-wise max of the siblings' clocks plus an increment on the writer's own entry, which "heals" the divergence.

Concrete example. Key cart:42 starts empty, clock {}. Coordinator X accepts "add apple"; writes {apple} with clock {X: 1}. A partition isolates X from Y. Y accepts "add banana" (not knowing about X); writes {banana} with clock {Y: 1}. Partition heals. A read sees {X: 1} and {Y: 1} — concurrent. The cart service merges: {apple, banana}. Write-back carries clock {X: 1, Y: 1}, which dominates both siblings. Future reads see one value.

Why vector clocks rather than timestamps: a timestamp is a single number imposing a total order the system cannot guarantee. If node X's clock is 3 ms ahead of Y's, X's older write appears "newer" and last-write-wins silently discards Y's newer write. Vector clocks are partial orders — honest about which writes really happened before which, and which were concurrent. Last-write-wins can still sit on top of vector clocks as a deterministic tie-break for concurrent siblings; Cassandra uses exactly this combination.

Vector clocks do have cost — they grow with the number of nodes that have ever written a given key. Dynamo prunes oldest entries past a threshold; real deployments rarely see pathological growth because writes for a given key tend to come through a small set of coordinators.

Hinted handoff — surviving transient replica failure

A write to key K has preference list [C, D, E]. C is the coordinator. Suppose D has crashed. With W=2 the write could succeed with just C and E — but the paper wanted the stronger property that data eventually reaches all N replicas so a returning replica does not cause read-repair churn.

Hinted handoff: the coordinator writes to the next healthy node in the ring — say F — with a "hint" saying "this belongs to D; deliver when D returns." F stores the write in a separate hints region (not treated as a replica for reads). When gossip tells F that D is back, F pushes the hinted writes to D and deletes its local copies.

The optimisation has a cousin called sloppy quorum: if enough preference-list members are down that W acks from them are impossible, the coordinator writes to non-preference-list substitutes and counts them toward W. This preserves availability at the cost of consistency — a read from the real preference list may miss the write until handoff completes. Dynamo's paper makes sloppy quorum explicit; Cassandra makes it per-query.

Read repair and anti-entropy

Replicas drift. A hint gets dropped because F crashed before delivering. A disk bit flips. Dynamo has two reconciliation mechanisms.

Read repair runs opportunistically on every read. The coordinator fans out to all N replicas, returns to the client after R respond, and collects the rest in the background. If any response has a strictly-older vector clock than the merged result, the coordinator writes the newer version back asynchronously. Hot keys self-heal.

Anti-entropy covers the cold keys. Each replica keeps a Merkle tree over its key range: leaves hash key+value, internal nodes hash their children, the root hashes the whole range. Two replicas compare roots; equal means total agreement, zero data exchanged. Unequal means descend the tree, narrow the divergence, exchange only the differing keys. Bandwidth is O(differences · log N_keys) rather than O(N_keys) — practical for millions of keys with a handful differing.

A minimal Python sketch of the core path

The skeleton of a Dynamo coordinator, showing the write and read paths with N/R/W and read-repair:

# dynamo/coordinator.py
import threading
from dataclasses import dataclass

@dataclass
class Versioned:
    value: object
    vclock: dict  # {node_id: counter}

class DynamoNode:
    def __init__(self, node_id, ring, N=3, W=2, R=2):
        self.node_id = node_id
        self.ring = ring
        self.N, self.W, self.R = N, W, R

    def put(self, key, value, vclock):
        replicas = self.ring.preference_list(key, self.N)
        vclock = dict(vclock)
        vclock[self.node_id] = vclock.get(self.node_id, 0) + 1
        versioned = Versioned(value, vclock)
        acks = 0
        lock = threading.Lock()
        done = threading.Event()
        def on_ack():
            nonlocal acks
            with lock:
                acks += 1
                if acks >= self.W:
                    done.set()
        for r in replicas:
            r.store_async(key, versioned, callback=on_ack)
        if not done.wait(timeout=1.0):
            raise TimeoutError(f"only {acks}/{self.W} replicas acked")
        return versioned

    def get(self, key):
        replicas = self.ring.preference_list(key, self.N)
        responses = [r.fetch(key) for r in replicas[:self.R]]
        merged = resolve_siblings(responses)
        for r, resp in zip(replicas, responses):
            if dominates(merged.vclock, resp.vclock):
                r.store_async(key, merged)  # read repair
        return merged

Under 40 lines. Why resolve_siblings is a separate function: merge policy is not one-size-fits-all. The shopping cart does set union. A counter might pick the max. A user profile might pick last-write-wins by timestamp with the coordinator's ID as tie-break. Dynamo exposes this as a hook; the database does not know what your values mean. Cassandra bakes last-write-wins into the column model; Riak lets you register custom merge functions; DynamoDB the service does last-write-wins by default.

What the sketch skips: gossip protocols for membership, Merkle anti-entropy, hinted handoff machinery, sloppy quorums during failures, read-your-writes session guarantees. All of them are engineering on top of the core loop — put to N, wait for W; get from R, merge, repair.

Trade-offs — what Dynamo gives up

Leaderless replication is not a free lunch:

No cross-key consistency. You get per-key read-your-writes if R + W > N, but no way to make a transaction span keys. No "transfer 100 rupees from A to B atomically". For banking or strict inventory, Dynamo is the wrong tool.
Secondary indexes are hard. Dynamo is pure KV. "Find all carts containing apples" requires scanning every cart. Cassandra added secondary indexes but they are slow and fragile; production systems usually maintain inverted indexes in separate buckets.
No ACID. Atomicity across keys, isolation, consistent snapshots — none exist in the Dynamo model.
Clients think about conflicts. Any data model richer than "last-write-wins is acceptable" needs application merge logic. The cart works because it is set union. A mutable nested JSON document is much harder — this is where CRDTs enter as an automatic-merge framework (Riak 2.0 shipped with them).
Complex queries are foreign. No joins, no cross-key aggregates (except batch jobs), no multi-row transactions.

Where the Dynamo model fits

Workloads share common shape:

Session stores and shopping carts — the original Riverone workload.
Write-heavy telemetry, metrics, logs — append-dominant, no conflicts, losses survivable.
Time-series and IoT data — high volume, per-key, range scans on timestamp.
Cache-like workloads — recommendations, popularity scores — stale is fine, missing is not.
Per-entity state across populations — game players, social profiles, devices.

Cassandra's public deployments — Streamora's user metadata (hundreds of petabytes), Apple's iCloud backends, Picstrand's feed, Glydex's geo queries — fit these shapes.

Where it doesn't fit

Workloads needing cross-key consistency or strict correctness:

Financial ledgers — double-entry bookkeeping needs cross-account transactions.
Strict inventory — two concurrent "last item" purchases must not both succeed.
Complex relational queries — joins, aggregates, window functions.
Referential integrity — Dynamo does not enforce foreign keys; the application cannot under concurrent writes in general.

Rule of thumb: if you sketch your schema with multi-row transactions, Dynamo is the wrong database. If your schema is a large population of independent per-key things, Dynamo is a great fit.

The shopping cart under a partition

Priya is shopping on Riverone.in during Diwali. Her cart lives in keyspace carts, key cart:priya-42. Preference list [C, D, E] across three AZs in ap-south-1. Config: N=3, W=2, R=2.

10:00:05 Priya adds a saree on her phone. Request routes to C. C's vclock becomes {C: 1}, value [saree]. D and E ack. W=2; OK.
10:00:15 Partition isolates D from C and E. D is still reachable to clients on its side.
10:00:20 Priya's husband opens the app on another device; routes to D. D adds a kurta. D's vclock becomes {C: 1, D: 1}, value [saree, kurta]. Sloppy quorum: D writes a hint on F. W=2; OK.
10:00:30 Priya on her phone removes the saree. C's vclock {C: 2}, value []. E acks. C writes a hint on G for D. W=2; OK.
10:01:00 Partition heals. F delivers its hint to D; G delivers its hint to D. D now holds two concurrent versions.
10:01:05 Priya refreshes. Read sees vclocks {C: 1, D: 1} → [saree, kurta] and {C: 2} → [] — concurrent (C:1<C:2 but D:1>D:0). Siblings returned.
Cart service merges: "saree removed, kurta added, cart = [kurta]". Write-back carries {C: 2, D: 1} which dominates both siblings.

What happened. During the partition, both writes went through on different nodes. A leader-based system would have rejected one with "leader unavailable", and the customer on the wrong side of the partition would have been unable to edit their cart.

The cost: a merge decision. For a cart, set arithmetic. For a bank balance, not obvious — which is why banks do not run ledgers on Dynamo.

Common confusions

"Dynamo is always eventually consistent, never strongly consistent." False. With R + W > N, Dynamo gives quorum consistency — every read sees every committed write. The eventual-consistency reputation comes from common low-latency configurations like R=W=1, not from the architecture itself.
"Leaderless means no coordination." False. Every write still involves coordination: the coordinator fans out to N replicas and waits for W acks. What is gone is one specific node being authoritative; what remains is per-request coordination among N replicas. The coordination moves from election-time to request-time.
"Dynamo is faster than Postgres." Sometimes, sometimes not. A single Dynamo put with W=2 waits for two cross-node acks; a Postgres commit with async replication waits for one local fsync. Postgres is often faster per request under healthy conditions; Dynamo's advantage is that it does not stop serving when a node dies.
"CRDTs replace vector clocks." Not really. CRDTs are data structures that merge automatically without application code; vector clocks are metadata that detects concurrent updates. A system built on CRDTs still tracks causality — often with vector clocks or their close relatives (dotted version vectors) — to decide when a merge is needed. CRDTs complement rather than replace the vector-clock machinery.
"N/R/W is a Cassandra feature." N/R/W comes from Dynamo 2007; Cassandra inherited it. Every Dynamo descendant exposes something equivalent — Riak calls W "write quorum", DynamoDB the managed service has ConsistentRead which is a read-at-quorum flag. The vocabulary is shared.
"Leaderless systems do not need consensus." Mostly true for writes, but false for membership. Adding or removing nodes from the ring — changing which nodes are responsible for which key ranges — requires consensus. Dynamo used a gossip protocol with manual seed lists; Cassandra for many years had a "schema agreement" problem where two nodes could disagree on the schema; modern systems use lightweight Paxos or Raft for metadata even when the data path is leaderless. Gossip gets you most of the way; consensus closes the remaining gaps.

Going deeper

The original paper

DeCandia et al., Dynamo: Riverone's Highly Available Key-value Store (SOSP 2007) is the canonical reference. Unusually for a systems paper, it was written by the production team running Riverone.com and describes what they actually deployed. Influences are cited explicitly: consistent hashing (Karger 1997), vector clocks (Mattern/Fidge 1988), Merkle trees (Merkle 1979), quorum systems (Gifford 1979). The contribution is the synthesis, not any single technique.

Cassandra's extensions

Cassandra (Sociogram 2008, Apache 2009) inherited Dynamo's replication layer and bolted a Bigtable-style column-family data model on top. Key additions over the years: tunable consistency per query (the CONSISTENCY clause), lightweight transactions via Paxos (compare-and-set at 4 round-trips — Cassandra admitting some operations want serialisability), materialised views and secondary indexes, and CDC streams like MySQL's binlog. Cassandra is the most-deployed Dynamo descendant in industry.

Riak and DynamoDB

Riak (Basho) was the most direct Dynamo clone, contributing pluggable backends and first-class CRDTs (sets, counters, maps) so applications got automatic merging without merge functions. Basho shut down in 2017 but the code is open-source. Riverone's managed DynamoDB service (2012) is a spiritual descendant, not Dynamo itself: single-region strong consistency (via internal per-partition leaders!), ACID transactions added in 2018, and global tables with last-writer-wins across regions — recognisably Dynamo-the-paper at the multi-region layer.

Where this leads next

Dynamo is the architecture; Chapter 76 covers consistent hashing with virtual nodes, the placement algorithm that lets Dynamo rebalance when nodes join or leave. After that: replica placement and preference lists (ch.77), gossip and membership (ch.78), anti-entropy and Merkle trees (ch.79), hinted handoff in depth (ch.80), and CRDTs later in Build 10. Each is engineering that Dynamo made necessary by throwing away the leader.

References

DeCandia et al., Dynamo: Riverone's Highly Available Key-value Store, SOSP 2007 — the paper that started it all. Written by the production team at Riverone.com, describing what they actually deployed, with a frank "experiences" section discussing operational lessons.
Lakshman and Malik, Cassandra — A Decentralized Structured Storage System, LADIS 2009 — the original Cassandra paper. Describes how Sociogram took Dynamo's replication layer and combined it with a Bigtable-style data model for the inbox search workload.
Basho Technologies, Riak KV Documentation Archive — reference documentation for Riak KV's replication and conflict resolution, with detailed coverage of vector clocks, sibling resolution, and CRDT types. Archived but still the most complete public description of a pure Dynamo-style deployment.
Amazon Web Services, Amazon DynamoDB Developer Guide — How It Works — the canonical reference for the managed DynamoDB service, including consistency options, global tables, and the transactional APIs added in 2018.
Kleppmann, Designing Data-Intensive Applications, Chapter 5 — Replication, O'Reilly 2017 — the clearest pedagogical treatment of leaderless replication, N/R/W, quorum consistency, and vector clocks, with worked examples and careful discussion of the trade-offs.
Kingsbury, Jepsen: Cassandra (2013) and later analyses — the Jepsen test suite's systematic investigation of Cassandra's behaviour under partitions, clock skew, and network faults. The posts include concrete reproductions of data-loss scenarios and their resolution, providing a sanity check on Dynamo-style systems' real-world consistency guarantees.