In short
Dynamo-style leaderless databases solve two real problems — partition tolerance and horizontal write scale — that single-leader systems cannot. For those workloads where availability is existential (carts, sessions, telemetry, social inboxes), that trade is often worth it. But the cost is architectural, not just operational: eventual consistency leaks out of the database into the application. Your code must now handle five classes of problem that a Postgres primary hides entirely.
Leak 1 — conflict resolution in application code. Without a CRDT that fits your data, every read can return multiple concurrent versions and your application must merge them. Merge bugs are distributed-systems bugs.
Leak 2 — retry and idempotency. Writes can fail with QuorumFailed, or succeed with a lost ack. Retrying naively produces duplicates; every write must carry an operation id or a conditional-write predicate.
Leak 3 — session semantics. "I just posted; why don't I see my post?" Read-your-writes, monotonic-read, and causal-session guarantees require application-level session tokens, sticky routing, or middleware — none are free.
Leak 4 — no cross-key guarantees. Uniqueness ("one active subscription per user", "no two users with the same username") cannot be enforced by the database. Lightweight-transactions, lock services, or compensating workflows fill the gap at real cost.
Leak 5 — schema migrations are hard. Replicas may hold different versions of a record during a migration window. Additive-only, never-drop, rolling-deploy becomes the iron law.
Single-leader SQL hides all of this behind ACID. Leaderless exposes it. That is why most applications in 2026 still start with Postgres and migrate away only when a hard availability or scale wall forces the move. Build 11 pivots to the storage model designed around these trade-offs: wide-column stores (Cassandra, HBase, DynamoDB) — the physical shape that complements the distribution protocol you just built.
A user files a bug report. "I added milk to my cart at 8:47 PM. When I checked out at 9:02, milk was gone. Order placed without it. Please credit me."
You open the logs. The POST /cart/add at 20:47 returned 200 OK. The GET /cart at 21:02 returned a cart without milk. The POST /checkout at 21:03 consumed that cart. No exceptions, no 5xx, no restarts. Every component did what its contract said.
This is not a code bug. It is a consistency-model bug. Somewhere between 20:47 and 21:02, a concurrent write on another replica — a stale "clear cart" from an old tab, a background "mark abandoned" job, a remove that raced with the add — was reconciled against the add, and LWW threw the add away. The database did nothing wrong. Your application made a promise (the add "succeeded") that the database cannot keep on its own.
Every user-visible bug in a well-run leaderless deployment is a distributed-systems bug in disguise. Welcome to the wall.
The five leaks
Across the rest of this chapter, follow one structural claim: when you choose leaderless, the database stops being a black box. Concerns that a Postgres primary hides — serialisation of concurrent writes, atomic commit, uniqueness, read-your-writes, schema evolution — all become application responsibilities.
Name them precisely.
- Leak 1 — Conflict resolution in application code. Without a CRDT that fits your data, every read can return multiple concurrent versions and your application must merge them. Merge logic lives in your repository, not the database's.
- Leak 2 — Retry and idempotency. Writes can fail with
QuorumFailed, or succeed with a lost ack. Retries without idempotency produce duplicates; every write must carry an operation id or conditional predicate. - Leak 3 — Session semantics. Read-your-writes, monotonic reads, and causal sessions require application-level session tracking or smart middleware. None are free in latency or complexity.
- Leak 4 — Cross-key invariants. There are no foreign keys and no
UNIQUEacross shards. "Each username is unique" must be enforced by conditional writes, a lock service, or a compensating workflow. - Leak 5 — Schema migrations. Replicas may hold different record shapes during a migration window. Additive-only changes, never-drop, always-rolling becomes an engineering law.
Each of these deserves its own section. Each is a class of production bug that simply does not exist on Postgres.
Leak 1 — Merge logic in application code
You stored a user profile {name, email, address, phone} at key user:42. Two clients issued concurrent updates: a mobile app updated phone; a desktop app updated email. Both writes were accepted on different coordinators before they could see each other. Now replica X holds {name, email1, address, phone1} and replica Y holds {name, email2, address, phone2}, where only one field differs on each side.
The database has two options. It can apply LWW at the whole-document level and keep one update while silently discarding the other — the user's phone edit vanishes for no reason a human could explain. Or it can hand the application both versions (vector-clock siblings) and let the application merge. Either way, the problem has leaked upward.
If the application handles siblings, the merge function looks like this:
def merge_user_profile(versions: list[dict]) -> dict:
"""Merge concurrent sibling versions of a user profile.
versions is a list of full profile dicts, each with a
per-field timestamp like {"name": ("Priya", 1713000000), ...}.
For each field, take the value with the largest timestamp.
"""
fields = set().union(*(v.keys() for v in versions))
merged = {}
for f in fields:
candidates = [v[f] for v in versions if f in v]
merged[f] = max(candidates, key=lambda x: x[1])
return merged
Why per-field LWW is safer than whole-document LWW: at the document level, the loser's edits to other fields get thrown away along with the losing field. Per-field LWW preserves all non-conflicting edits. The cost is that every field carries its own timestamp — the metadata overhead of the CRDT world, without the formal convergence guarantees. You are hand-building something that looks like an MV-Register without the rigor.
The subtle bugs are endless. Nested objects become a sequence-CRDT problem you did not sign up for. References to other keys? A merge on user:42 might restore a stale team_id that was deliberately cleared. Derived fields — full_name computed from first_name + last_name? Merge components, then recompute.
You will get some of this wrong. Every team writing merge logic gets some of it wrong; bugs surface weeks later as "my preference flipped back" tickets. In Postgres these bugs are impossible because the primary serialises writes — the second UPDATE simply overwrites the first and no sibling ever exists.
Leak 2 — Retry and idempotency
A client issues PUT cart:priya [milk, bread] to a coordinator. The coordinator sends to N=3 replicas; W=2 required. Two acks come back from replicas X and Y; a third from Z was lost in a timeout. From the database's view, the write succeeded — two replicas have it, anti-entropy will propagate to Z.
The coordinator tries to ack the client. The ack packet drops. The client, after a 500 ms timeout, retries.
Now the client issues the same PUT again. The second PUT lands on a different coordinator, fans out to the same three replicas, all three accept — but some databases do not treat the two PUTs as the same operation. The retry may be recorded as a second concurrent write with its own version tag. On the next read the application sees two siblings — both identical on the wire, but semantically duplicates.
Worse: imagine the operation was not a PUT but a counter.increment. The retry double-counts. The customer sees +100 for a single purchase.
The fix is idempotency at the application level. Every write carries a client-generated operation id:
import uuid
def charge_wallet(db, user_id, amount_paise):
op_id = str(uuid.uuid4()) # client generates once, retries reuse it
payload = {"op_id": op_id, "amount": amount_paise}
for attempt in range(5):
try:
return db.apply_charge(user_id, payload, timeout=1.0)
except (QuorumFailed, NetworkTimeout):
continue # safe: op_id is the same across retries
raise BackendUnavailable()
The server side deduplicates by op_id:
def apply_charge(user_id, payload):
if db.seen_op(user_id, payload["op_id"]):
return db.get_result(user_id, payload["op_id"]) # cached result
result = perform_charge(user_id, payload["amount"])
db.record_op(user_id, payload["op_id"], result)
return result
Why the op-id table is itself a consistency problem: seen_op and record_op must be atomic together with perform_charge, otherwise a crash between check and record lets a retry double-charge. On a single-leader SQL database, a transaction wraps all three. On a leaderless store, you need a conditional write — "record this op_id only if it does not exist" — as a lightweight-transaction or compare-and-swap. The idempotency layer is itself a mini-transaction protocol your code must implement.
The alternative to op-ids is conditional writes on version — "set this value only if the current version is v7". Every write becomes a read-modify-conditional-write loop. This is Dynamo's ConditionExpression shape; same complexity in a different place.
Leak 3 — Session semantics in application code
A user posts a comment. The POST /comments hits coordinator X, writes land on replicas X, Y, Z, coordinator X returns 201 Created. The user's next request — loading the page to see their comment — is routed by a load balancer to a different coordinator, W, which reads from replicas W, Y, X. Replica W has not yet received the write. The read completes before W catches up. The user's own comment is missing from the page.
This is not a bug. It is eventual consistency working as designed. A leaderless cluster does not guarantee a client will see their own writes on the next read. To get that guarantee, you need a session-level layer on top, covered in the read-your-writes, monotonic reads, causal sessions chapter. It has three implementation shapes, in ascending order of cost:
- Sticky routing. Pin a user's reads to the same coordinator as their writes for a window (typically
session_ttl >= max_replication_lag). Cheap and mostly effective. Fails when the sticky node dies or when the user crosses data centres. - Client-side version tokens. On every write, the coordinator returns a
versiontoken; the client stores it and sends it on subsequent reads; the coordinator waits until at least one replica has caught up to that version before reading. Effective, but every request now carries session state, and reads may block on replica lag. - Causal consistency with vector clocks. The client tracks the set of versions it has observed across all keys; every read must be ordered after that set. This is the full session guarantee; it is also the most expensive and the one most systems do not implement.
class CausalSession:
def __init__(self, db):
self.db = db
self.observed = {} # key -> last-seen version
def write(self, key, value):
version = self.db.put(key, value, after=self.observed)
self.observed[key] = version
def read(self, key):
value, version = self.db.get(key, after=self.observed)
self.observed[key] = max(self.observed.get(key, 0), version)
return value
Why session tokens are not a free upgrade: blocking a read until a replica catches up turns a zero-latency request into one that waits for replication. Under heavy write load or partition, that wait can be seconds. Users experience "the page took forever to load" — the same user who experienced "my post is missing" under no-session-guarantee. The consistency dial is a dial, not a switch; every position has a cost.
Every Postgres application gets read-your-writes for free because the primary is the source of truth. In a leaderless store, you engineer for it, and the engineering is visible in the code — session.write(...) not db.write(...), session.read(...) not db.read(...).
Leak 4 — No cross-key guarantees
"Each username must be unique" is a one-line constraint in Postgres:
CREATE TABLE users (
username TEXT UNIQUE NOT NULL,
...
);
That constraint is enforced atomically with the insert because all writes pass through a single serialising leader. Two concurrent inserts for username = "priya" — one wins, one fails with duplicate key. The application sees a clean error and tells the user to pick another name.
In a leaderless database, this constraint cannot be enforced by the database. Two concurrent INSERTs with the same username hit different coordinators, both pass (no replica has seen the other), both are durably stored. Later, read-repair or anti-entropy discovers two rows with the same username and has no basis to prefer either. LWW may discard one; vector-clock siblings surface both. Your application now has duplicate usernames in production, and the "fix" — decide which one wins, delete the other, email the loser — is a business problem.
Three workarounds exist, each with costs:
- Lightweight transactions (Cassandra's
IF NOT EXISTS). A Paxos round on top of the leaderless write path, guaranteeing linearisability for one specific operation. Works correctly. Costs 4× the latency of a normal write and a round-trip to every preference-list node. Scale limited to low-throughput keys. - External lock service (ZooKeeper, etcd). Acquire a distributed lock on the username, check for existing rows, insert, release lock. Correct. Adds ZooKeeper/etcd to your deployment, and the lock service becomes the new single leader for this operation — the bottleneck you tried to avoid.
- Accept duplicates, reconcile asynchronously. Insert without checking; a background job scans for duplicate usernames, keeps the oldest, renames or deletes the others, notifies the users. Simple, eventually consistent, user-hostile unless the collision is rare.
The same pattern repeats for every cross-key invariant — "at most one active subscription", "no two orders share an invoice number". In Postgres each is a UNIQUE, CHECK, or FOREIGN KEY. In Dynamo each is a protocol.
Why this is the hardest leak to see until production: in a test suite with synthetic data and serial traffic, cross-key races never fire. In staging with a few dozen concurrent users, they fire once a week. In production on Black Friday with hundreds of thousands of concurrent sign-ups, they fire thousands of times per minute. The bug is invisible in development and catastrophic at scale. This is why Dynamo-style adoption almost always starts with "cart" or "session" — workloads where no cross-key invariant exists — and never with "ledger" or "user accounts".
Leak 5 — Schema migrations are hard
Adding a column to a Postgres table is ALTER TABLE users ADD COLUMN phone TEXT;. The DDL runs on the primary; replicas apply it via WAL; the schema is universal within seconds. Any application code that assumes phone exists will read NULL on old rows and something-meaningful on new ones.
In a leaderless store there is no "the primary" to run DDL on. When you change the shape of a value:
- Adding a field. Old replicas still write values without the field. Merges must handle missing fields as "not-yet-set", never as "explicitly-null".
- Dropping a field. Old replicas may still read values with the field. Code "done removing
nickname" must still tolerate its presence as long as the oldest replica can hold stale data — weeks or months under sloppy quorums. - Renaming a field. Never atomic. Add the new name, double-write both for a migration window, backfill, switch reads, stop writing the old name. Five deploys.
The iron laws: never drop, additive only, rolling deploys, backwards-compatible parsing. Every schema change becomes a protocol change, because the database does not coordinate schema across replicas.
The overall pattern — distributed-systems problems become CRUD problems
Step back. The leaks are not separate bugs; they are instances of one structural fact.
In a single-leader database, the distributed-systems concerns have been solved for you by the primary — it serialises writes, enforces invariants, runs schema, manages replica state. Your code sees a clean RPC.
In a leaderless store your application code is the coordinator. You merge versions; dedupe retries; manage sessions; enforce invariants; coordinate schema evolution. The database's RPC is much thinner — put, get, cas — and the hard parts are above the line.
This is what "eventual consistency leaks" really means. The database is not incorrect; the abstraction boundary moves. Concerns that used to live inside the database now live in your application.
A useful diagnostic: in a Postgres codebase, grep for SELECT FOR UPDATE, transaction blocks, ON CONFLICT, and foreign keys. Every one of those constructs, in a leaderless port, becomes an application-level protocol — CAS loops, op-id tables, lock-service calls, compensating workflows. The cost is spread across the entire codebase.
When this is worth paying
Despite the cost, leaderless is the right answer for a specific shape of workload:
- Availability is existential. Carts, session stores, like-counts, telemetry, push queues, social inboxes. If writes must never be refused, leader-based systems eventually fail you and you accept the application complexity.
- Horizontal write scale beyond any single primary. Workloads above ~100k writes/s sustained — ingest pipelines, time-series metrics, distributed logs.
- Eventually-consistent semantics match user expectations. A like-count showing 1,247 vs 1,248 for ten seconds is fine. A view-count lagging a minute is fine. These are the cases CRDTs and LWW were built for.
- Per-user isolation with no cross-key invariants. A cart has no relationship to another cart. The absence of cross-key invariants eliminates Leak 4 entirely for these workloads — which is why Amazon picked the cart as the motivating example.
- Multi-region active-active with sub-100ms writes. Leader-based cross-region replication is asynchronous-only; leaderless is genuinely multi-master.
When it is not
For most applications, Postgres is still the right answer in 2026:
- Financial and transactional systems. Banking, payments, ledgers. Strong consistency is a requirement; lost writes are real money.
- Small-to-medium applications. Below ~20k writes/s on a decent primary — most applications — there is no reason to pay the leaderless tax.
- Complex queries and reporting. Joins, ad-hoc analytics, window functions, full-text search. Leaderless stores are key-value at heart.
- Teams without distributed-systems experience. The leaks require senior engineering. A team that has never read a Jepsen report will ship data-loss bugs.
Start with Postgres. Migrate away only when you hit a wall — Build 9's write ceiling, a cross-region active-active requirement, or a specific workload (cart, session, telemetry) that profits from availability.
Python demonstration — a "simple" read-modify-write
Show the leak on the smallest possible operation: incrementing a counter.
def increment_counter_naive(db, key):
v = db.get(key) # read current value
db.put(key, v + 1) # write back plus one
This is correct on Postgres with transactions; catastrophically wrong on Dynamo-style. Two concurrent callers both read v = 5, both write v + 1 = 6. One +1 is lost. The counter goes from 5 to 6 instead of from 5 to 7. If the workload is "count page views", your analytics drift downward by the concurrency rate. If the workload is "count stock-on-hand", you oversell.
Fixing it requires either a CRDT counter (encode the merge into the type — see CRDTs chapter) or a compare-and-swap loop:
def increment_counter_safe(db, key, max_retries=10):
for attempt in range(max_retries):
value, version = db.get(key, with_version=True)
try:
db.compare_and_set(
key, value + 1, expected_version=version
)
return value + 1
except VersionMismatch:
# another writer beat us; reread and retry
continue
raise TooMuchContention(f"CAS failed {max_retries} times")
Why CAS loops are not a universal fix: under heavy contention, the retry rate explodes. If ten clients are all incrementing the same key, each round of CAS only lets one succeed; the other nine retry; of those nine, one succeeds; and so on. The expected number of retries grows linearly with contention. Production counters under load ("likes on a viral post") grind to a halt on CAS; they need a CRDT PN-Counter or a sharded-counter pattern (N sub-keys that aggregate on read). Every concurrency primitive has a contention ceiling, and CAS's ceiling is low.
The naive three lines of Postgres-style code become ten lines of retry logic, plus a pager alert for TooMuchContention, plus a separate implementation path for hot keys. This multiplies across every mutation in your codebase.
An e-commerce checkout under leaderless
Walk the flow — cart, inventory, payment, order — and identify what fits Dynamo-style and what does not.
Cart. cart:priya is a set of item-ids. Additions are CRDT adds; removes are OR-Set removes; "add from two devices" merges as union. Fits perfectly — the workload the paper was written for.
Inventory check. inventory:sku_12345 is a shared counter. Concurrent checkouts read the same count, both decrement, both write — you oversell. Does not fit. Move inventory to a strongly-consistent store, or accept oversell and compensate.
Payment. Retries must not double-charge. op_id deduplication is the application-level protocol; the gateway call itself must also be idempotent (an idempotency_key header). Fits only with careful idempotency engineering.
Order placement. An order spans order:12345, inventory:sku_12345 (decrement), payment:abc (capture), cart:priya (clear). In Postgres this is one transaction. In Dynamo it is a saga — multi-step workflow with compensating actions for partial failure. Does not fit natively; requires a saga orchestrator.
The mature pattern. Use Postgres for order-critical steps (inventory, payment, final order — one ACID transaction) and Cassandra or DynamoDB for cart, session, telemetry, recommendation cache. Every mature e-commerce company, including Amazon itself, runs polyglot persistence of roughly this shape.
Common confusions
-
"Eventual consistency is always weaker." Different, not weaker. For always-writable workloads, EC is strictly better — the strongly-consistent system simply refuses writes during partition. "Weaker" only makes sense for the same workload.
-
"CRDTs solve the leaks." Only for data types that fit CRDT semantics. Counters, sets, maps, sequences have CRDTs. Business objects with custom invariants do not.
-
"You can layer strong consistency on top." Via Paxos-per-key (Cassandra LWT) — but you lose most of the leaderless benefits: availability drops to CP, latency rises 4×, coordination returns. "Leaderless with strong-consistency bolted on" is a worse database than "single-leader with good failover".
-
"Newer systems are better." Dynamo is 2007, Cassandra 2008, Riak 2009 — nearly two decades old. "Newer" is not a technical argument; fit-for-workload is. Postgres in 2026 is still right for most applications, and will be in 2036.
-
"Managed leaderless services hide the leaks." Managed DynamoDB handles operations (replication, scaling, patching) but not the consistency model. The five leaks are architectural, not operational. Your code still merges, retries, tracks sessions, enforces invariants.
Going deeper
For senior engineers evaluating architectural choices, these three readings sharpen the trade-offs the chapter sketches.
CAP, twelve years later
Brewer's 2012 follow-up clarifies the practical reading of his 2000 conjecture. Partitions are rare; during non-partition operation you can have all three of C, A, and P. CAP's real message is: when partitions occur, you must design for them explicitly — pick A or C per operation, not once for the whole system. Dynamo is overwhelmingly AP; Spanner is overwhelmingly CP; most production systems are a per-operation mix.
PACELC — consistency vs latency even without partition
Abadi's 2012 refinement adds a second trade-off for the non-partition case: Else, pick Latency or Consistency. Even on a healthy network, strong consistency pays quorum round-trips that eventual consistency does not. Dynamo is PA/EL; Spanner is PC/EC. PACELC is more useful than CAP for architectural choices because the non-partition case is the common case.
Strong eventual consistency — the narrowest escape
The cleanest response to the five leaks is SEC via CRDTs. It does not eliminate the leaks — retries, sessions, invariants, schema still live in the application — but it eliminates Leak 1 by construction. Where your data fits a CRDT, use one. The practical limit is that most business data does not fit, and engineering new CRDTs is research-grade work.
Where this leads next
Build 10 ends here. You have walked the full Dynamo stack — leaderless replication (ch.75), consistent hashing (ch.76), N/R/W quorums (ch.77-78), sloppy quorums and hinted handoff (ch.79), gossip (ch.80), anti-entropy (ch.81), LWW and vector clocks (ch.82), CRDTs (ch.83), and the honest summary of what it costs the application (ch.84). The distribution layer is complete.
Build 11 opens on wide-column storage — Cassandra's, HBase's, and DynamoDB's physical layout. Where Build 10 was about how replicas talk to each other, Build 11 is about how one replica stores data on disk — column families, SSTables as storage primitive, partition-key-plus-clustering-key design, tombstone management. The two halves of the architecture were designed to match.
References
- Eric Brewer, CAP Twelve Years Later — How the Rules Have Changed, IEEE Computer 45(2), 2012 — the original CAP author revisits the theorem a decade in, introducing the per-operation reading and grounding it in production experience.
- Daniel Abadi, Consistency Tradeoffs in Modern Distributed Database System Design — CAP is Only Part of the Story, IEEE Computer 45(2), 2012 — the PACELC framework, sharper than CAP because it captures the common (non-partition) case.
- Kyle Kingsbury, Jepsen — Distributed Systems Safety Research — the public test series that has found real consistency bugs in Cassandra, Riak, MongoDB, CockroachDB, and dozens of others. Practitioner ground truth on what these systems actually guarantee.
- Martin Kleppmann, Designing Data-Intensive Applications, Chapter 12, O'Reilly 2017 — DDIA's closing chapter on polyglot persistence, eventual-consistency bugs, and how the five leaks play out in real deployments.
- Michael Stonebraker, The End of an Architectural Era, VLDB 2007 — the relational-database counter-argument. Stonebraker argues legacy architectures are wrong for modern workloads, but for different reasons than Dynamo does.
- Dan Pritchett, BASE — An Acid Alternative, ACM Queue 6(3), 2008 — the industry coinage of BASE (Basically Available, Soft-state, Eventually consistent). A short manifesto for the philosophy Dynamo embodies.