In short
Between "eventual consistency" — the async-replication default, under which a read may observe any state from any point in the past and two successive reads may even contradict each other — and "linearisability" — every read returns the globally most recent write, as if the whole system ran on one machine — lies a rung-laddered region of consistency guarantees that cost far less than linearisability and fix almost every user-visible anomaly.
Read-your-writes (RYW) says: within a session, a read issued after a write sees that write. User posts a photo, reloads, sees the photo. Nothing more, nothing less. It does not promise that other users see your photo immediately; only that you do.
Monotonic reads says: within a session, successive reads move forward in time, never backward. If you saw a comment on page load 1, you still see it on page load 2. The replica's view, from your session's perspective, is a non-decreasing function of time.
Causal consistency says: if write B happened-after write A — a reply to a post, an edit to a document someone else created, a tag added to a photo someone else uploaded — then any reader who sees B also sees A. Causality is preserved across sessions; concurrent, unrelated writes may still be seen in different orders on different replicas.
Each guarantee is strictly weaker than linearisability and strictly stronger than eventual. Each is implementable on top of ordinary async log shipping — from ch.67 — with a session-scoped LSN cache, a sticky session pin, or a vector-clock-shaped dependency set. The cost is a few dozen bytes of per-session state and, occasionally, a short wait while a chosen replica catches up. You pay nothing on the write path; you add no synchronous replication hop; you keep the read-scaling benefit of replicas. This chapter is about how the three rungs are defined, how to implement each in real middleware, where each one fails, and where the ladder leads past linearisability into the CAP-constrained world of ch.74 and Build 10.
A user in Pune hits three bugs in ninety seconds, none a bug in any individual component.
First. They post a photo. The toast appears. They tap their profile — no photo. Reload, reload, reload. On the fourth, the photo appears. Primary committed; the replica their read hit had not yet applied the WAL record. Engineers call this replica lag; users call it "broken".
Second. They read a friend's birthday thread — top comment "happy bday bro". They come back via the notification bell; top comment is gone. Reload and it reappears. Two reads, two different replicas, reality running backwards.
Third. They see a reply — "totally agreed". They tap through to the parent — "no such post". Alice posted, Bob replied, and the replica serving Bob's reply had applied Bob's write but not Alice's.
Three distinct anomalies, three distinct fixes. You do not need linearisability to stop any of these bugs; you need the three rungs of the session-consistency ladder.
The consistency ladder — where these sit
Consistency models form a partial order; stronger ones imply weaker. If a system is linearisable it is also causal, monotonic, and read-your-writes; the reverse is not true. The useful portion of the lattice:
-
Eventual consistency. In the absence of new writes, all replicas converge. Nothing more. Reads may be arbitrarily stale; two successive reads in the same session may disagree; causally dependent writes may arrive out of order. What plain async replication gives you free.
-
Read-your-writes (RYW). Eventual, plus: within a session, a read after a write sees that write. The smallest useful strengthening.
-
Monotonic reads. Eventual, plus: within a session, successive reads are non-decreasing in time. You never see a row vanish and reappear. Orthogonal to RYW.
-
Causal consistency. Monotonic + RYW, plus: writes with a happens-before relationship (Lamport 1978) are observed in order, across sessions. If Alice's post happened-before Bob's reply, any session that sees the reply also sees the post. Concurrent (no happens-before) writes may still be seen in different orders on different replicas.
-
Sequential consistency. All operations appear in some total order consistent with per-process program order, but no real-time guarantee. Mostly a stepping stone.
-
Linearisability (aka external consistency, strong consistency). Every operation appears to take effect atomically at some instant between invocation and response, and the total order respects real-time ordering. The gold standard, and the expensive one.
Why this ladder matters operationally: most user-facing anomalies that get filed as bugs are violations of one of the three middle rungs. You do not need to pay for linearisability to fix them. You pay for session-scoped state (a few bytes), a pinning decision (a cookie), or a replica-LSN lookup (a cache). The CAP theorem's famous cost — giving up either consistency or availability under a partition — applies to linearisability, not to the middle rungs. Causal consistency is available under partitions; you can have both causal and available on the same system, a fact usually called "the causal escape from CAP" after Bailis et al. 2013.
Read-your-writes — implementation
Read-your-writes is the smallest useful strengthening of eventual, and the single guarantee that fixes the most-filed class of user-reported bugs. Three implementation strategies, in ascending order of principled-ness.
Strategy A: read-from-primary window. The cheapest. On every write, record the wall-clock timestamp. For some window W after that timestamp (typically 2-4× the p99 replica lag), route that session's reads to the primary instead of any replica. After W expires, return to the replica pool. Covered in ch.70 in detail. Coarse but effective; works without any LSN plumbing.
Strategy B: session LSN token. The principled answer. On every write, capture the primary's current WAL LSN (Postgres: pg_current_wal_lsn() called in the same transaction, or the commit_lsn returned by the replication protocol). Store the LSN in the user's session. On every subsequent read, consult a cache of per-replica replay_lsn values; pick a replica whose replay_lsn >= session_lsn. If none exists, wait briefly (Postgres 10+ offers pg_wal_replay_wait()) or route to the primary as a fallback.
# ryw_session.py
import psycopg2
class RYWRouter:
def __init__(self, primary_conn, replica_conns, replica_lsns):
self.primary = primary_conn
self.replicas = replica_conns # list of live replica connections
self.replica_lsns = replica_lsns # dict: replica_id -> latest replay_lsn
self.sessions = {} # session_id -> last_commit_lsn
def write(self, session_id, sql, params=()):
cur = self.primary.cursor()
cur.execute(sql, params)
cur.execute("SELECT pg_current_wal_lsn()")
lsn = cur.fetchone()[0]
self.primary.commit()
self.sessions[session_id] = lsn
return lsn
def read(self, session_id, sql, params=()):
needed = self.sessions.get(session_id)
if needed is None:
return self._exec(self.replicas[0], sql, params)
for r_id, r_lsn in self.replica_lsns.items():
if self._lsn_ge(r_lsn, needed):
return self._exec(self.replicas[r_id], sql, params)
return self._exec(self.primary, sql, params) # fallback
Why the LSN token approach is strictly better than the time window: the window says "recent writers cannot use replicas for W seconds", which is both stricter than necessary (most replicas have caught up well within W) and looser than necessary (if a replica lags beyond W you have silently lost RYW). The LSN token routes each read to the exact replica that is known to satisfy causality for this read. The primary absorbs only the tiny tail of reads for which no replica has caught up yet. Read-scaling is preserved at full effectiveness; correctness is not heuristic.
Strategy C: replica wait. A variant of B: tell the replica "wait until your replay_lsn reaches X, then answer, or fail after 100 ms". Postgres's pg_wal_replay_wait() is this. Trades a bounded read-path wait for keeping load off the primary.
MongoDB productises strategy B as readConcern: 'majority' + afterClusterTime: <token>. The driver plumbs the token; you enable it with causalConsistency: true on the session. Transparent to application code.
Monotonic reads — implementation
Monotonic reads is about successive reads, not the first read after a write. The guarantee: if a session reads value V at time t1 and reads the same key at time t2 > t1, the read at t2 sees a state at least as recent as V. Never backward, possibly forward. The user's reality time-travels only in the positive direction.
Two implementation strategies.
Strategy A: sticky sessions. Pin a given session to a single replica — typically via a cookie the load balancer hashes on. All reads from that session hit the same replica. That replica's replay_lsn is monotonically increasing (apply never regresses), so the observed state is monotonically increasing too. Dead simple, and a common default in load-balancer configuration (HAProxy: cookie SRV insert indirect; nginx: sticky cookie; AWS ALB: "stickiness" toggle).
Strategy B: session-tracked max LSN. Every read answer returns the replica's replay_lsn at the moment of answer. The session remembers the maximum LSN it has ever observed (write-committed or read-returned). Subsequent reads carry that max LSN as a token and require a replica at or past it. Stronger than sticky sessions because it survives replica failover — if the pinned replica dies and the session re-pins, the new replica must also be caught up to the max observed LSN.
# monotonic_session.py
class MonotonicSession:
def __init__(self, router):
self.router = router
self.high_water_lsn = None
def write(self, sql, params=()):
lsn = self.router.write_primary(sql, params)
self._advance(lsn)
return lsn
def read(self, sql, params=()):
replica = self.router.pick_replica_at_or_past(self.high_water_lsn)
rows, replica_lsn = replica.execute_and_report_lsn(sql, params)
self._advance(replica_lsn)
return rows
def _advance(self, lsn):
if self.high_water_lsn is None or lsn > self.high_water_lsn:
self.high_water_lsn = lsn
Why strategy B subsumes sticky sessions: sticky sessions give monotonic reads as a consequence of always hitting the same replica, but they break on replica failure — when the session re-pins to a new replica, the new replica's replay_lsn might be behind the old one's. The user sees time run backwards at exactly the moment the infrastructure was trying to recover. Strategy B passes the max observed LSN with every read; the new replica must catch up before it can answer. The cost is an extra LSN comparison per read, which is free, and occasionally a brief wait, which is rare.
Cost of sticky sessions. The load balancer loses some ability to rebalance. If one replica gets hot-spotted by a handful of heavy sessions, the LB cannot move them to a cooler replica without breaking monotonicity. In practice, session lifetimes are short enough (minutes) that imbalance averages out.
Causal consistency — implementation
Causal is the most interesting of the three. It is also the weakest guarantee that cleanly handles cross-session scenarios: the "reply exists, original does not" bug from the opening hook.
The definition: operations are partially ordered by the happens-before relation. Operation A happens-before B if (i) they are in the same session and A precedes B in program order, (ii) B reads a value written by A, or (iii) there exists some C such that A happens-before C happens-before B (transitive closure). A causal-consistent system guarantees that if A happens-before B, every session that observes B also observes A.
The implementation pattern: each session maintains a set of "writes I have observed so far" — call it the seen set. On every write, the session adds the write's identifier to the seen set. On every read, the session adds to the seen set all writes the read-result depends on (as reported by the replica). On every subsequent read, the session sends the seen set with the read; the replica must have applied every write in the seen set before it can answer.
In full generality, seen sets are tracked as vector clocks — one counter per writer, so the system can distinguish concurrent writes from causally ordered ones. For single-database systems (one logical primary, replicas following one WAL stream), a single LSN suffices as the seen-set summary, because one WAL stream gives a total order on all writes.
# causal_session.py
class CausalSession:
def __init__(self, router, session_id):
self.router = router
self.session_id = session_id
self.seen_lsn = None # monotonic
def write(self, sql, params=()):
lsn = self.router.write_primary(sql, params, after_lsn=self.seen_lsn)
self._update(lsn)
return lsn
def read(self, sql, params=()):
replica = self.router.pick_replica_at_or_past(self.seen_lsn)
rows, replica_lsn = replica.execute_and_report_lsn(sql, params)
self._update(replica_lsn)
return rows
def merge_from_peer(self, peer_seen_lsn): # e.g. Alice reads Bob's reply
if peer_seen_lsn is not None:
self._update(peer_seen_lsn)
def _update(self, lsn):
if self.seen_lsn is None or lsn > self.seen_lsn:
self.seen_lsn = lsn
The merge_from_peer call is how cross-session causality enters: when session A observes a value produced by session B, A's seen set must expand to include B's seen set. In an HTTP/API context, B's seen LSN travels in an HTTP header or in the payload of whatever A just read; A's middleware picks it up and merges.
Why a single LSN suffices on a single-leader Postgres or MySQL: the primary's WAL is a totally ordered log. Every write lands at a unique LSN, and LSNs are monotonic. "Has replica R applied write W?" is equivalent to "is R's replay_lsn >= W's LSN?". The seen set collapses to a single scalar. Vector clocks are only needed when writes originate at multiple nodes — leaderless systems (ch.75), multi-master replication, Dynamo-style quorums — where no single total order exists and "apply all writes causally before X" requires per-writer tracking.
Wait vs route. When the replica is behind the session's seen LSN, the router has two choices: wait for the replica to catch up (bounded timeout), or route to a different replica that is already caught up (or to the primary). The trade-off is latency (waiting adds milliseconds) vs load (routing more shifts traffic to the primary). Production systems typically wait up to 50-100 ms, then fall back.
A Python middleware implementation
Tie the three guarantees together into a single middleware. The interface: an application passes a session ID on every call, reads and writes are routed automatically, and the guarantee level is selectable per request.
# consistency_middleware.py
from enum import Enum
class Level(Enum):
EVENTUAL = 0
RYW = 1
MONOTONIC = 2
CAUSAL = 3
class ConsistencyMiddleware:
def __init__(self, primary, replicas):
self.primary = primary
self.replicas = replicas # dict: id -> Replica
self.session_lsn = {} # session_id -> lsn
def write(self, session_id, sql, params=()):
lsn = self.primary.execute(sql, params)
cur = self.session_lsn.get(session_id, 0)
self.session_lsn[session_id] = max(cur, lsn)
return lsn
def read(self, session_id, sql, params=(), level=Level.RYW):
needed = self.session_lsn.get(session_id, 0) if level != Level.EVENTUAL else 0
r = self._find_replica_at_lsn(needed)
if r is None:
rows, lsn = self.primary.execute_and_report(sql, params)
else:
rows, lsn = r.execute_and_report(sql, params)
if level in (Level.MONOTONIC, Level.CAUSAL):
cur = self.session_lsn.get(session_id, 0)
self.session_lsn[session_id] = max(cur, lsn)
return rows
def _find_replica_at_lsn(self, target):
for r in self.replicas.values():
if r.replay_lsn >= target:
return r
return None
Thirty lines, all three guarantees. The differences:
- EVENTUAL ignores the session LSN entirely; any replica answers.
- RYW requires the replica to be at or past the last write LSN from the session, but does not update the session LSN on reads.
- MONOTONIC additionally updates the session LSN on every read, so subsequent reads carry a monotonic frontier.
- CAUSAL is MONOTONIC plus a
merge_from_peerpath (not shown) that folds in seen LSNs from other sessions the user has interacted with.
The real-world version adds: a replica-LSN refresher polling pg_stat_replication every 500 ms, a bounded wait on catch-up (not immediate fallback to primary), per-replica health checks, a circuit breaker on primary fallback, and metrics for "read fell back to primary" rate — which tells you whether your p99 replica lag is within your session SLA or not.
Cost of these guarantees
RYW. Per-session state: one LSN (8 bytes). Storage: session store with TTL matching session lifetime. Extra read latency: a hash lookup in the replica-LSN cache. Extra primary load: the tail of reads where no replica is caught up, under 1% with healthy lag. No extra write latency.
Monotonic reads. Sticky sessions: LB constraint plus a cookie. Session-tracked max LSN: same as RYW plus updating the LSN on every read. Extra latency negligible; storage one LSN per session.
Causal. Single-leader: same as monotonic reads, plus a cross-session merge path. Multi-leader or leaderless: vector clocks scaling with number of writers.
The most material cost across all three is engineering: plumbing session IDs and tokens through every layer. Productised systems (MongoDB, Cosmos DB) make this transparent; hand-rolled middleware does not.
Where these guarantees fail
RYW fails if the session identifier is not preserved across requests. The serverless-in-the-dark-ages problem: short-lived Lambda containers, stateless edge functions, client-side routing without session cookies. If request-2's middleware cannot identify that it is the same session as request-1, it cannot read the stored LSN, and RYW degrades to eventual. The fix — stable session IDs in cookies or auth tokens — is trivial but silently regresses during a refactor. Monitor the "session LSN missing" rate.
Monotonic reads fails if the sticky replica dies and the session is re-routed to a lagging replica. Strategy B (session max LSN) survives this because the new replica is required to be caught up, but has its own failure: if no replica is caught up within the wait bound, the request falls back to the primary or fails.
Causal fails if writer and reader do not share a session-tracking mechanism. If Alice writes a post and Bob reads it, but Bob's middleware has no way to pick up Alice's seen LSN from the response, Bob's subsequent reads can see a reply-to-Alice without Alice. The fix is to include the seen LSN in response headers and have client-side middleware merge it.
All three fail across service boundaries without explicit propagation. A session may touch three databases through three services, each with its own middleware. A write in service A followed by a read in service B gets no RYW unless A propagates its LSN to B. W3C trace context headers are the usual carrier.
Comparison with stronger levels
Distinguish the session guarantees here from the database-level isolation levels of Build 7.
Linearisability. Every read returns the most recent write globally, respecting real-time ordering. Requires synchronous replication or consensus (Raft, Paxos). Etcd and ZooKeeper are linearisable; production Postgres with async replicas is not.
Serialisability (isolation level). A property of concurrent transactions: the outcome is equivalent to some serial order. Per-transaction, database-level, not session-level. Snapshot Isolation (ch.47) and SSI (ch.51) are isolation levels; they say nothing about what successive transactions in the same session observe across replicas.
External consistency. Spanner's term: linearisability applied to transactions. Requires TrueTime's bounded-clock machinery or equivalent.
The session guarantees here are orthogonal to isolation levels. You can have SI + RYW, or SSI + causal, or read-committed + monotonic reads — the dimensions compose. An application chooses both: the isolation level for concurrent transactions, and the session guarantee for user-perceived cross-request consistency.
:::example Three users, three replicas, one scenario
Your social app runs on Postgres: one primary in Mumbai, three async replicas (R1, R2, R3) in Mumbai, Hyderabad, and Singapore. Three users — Alice, Bob, Carol — interact in the next 200 ms.
- t = 0 ms. Alice posts "new feature launched" (write W1). Primary commits at LSN
0/D1000000. Alice's session storesseen_lsn = 0/D1000000. - t = 5 ms. R1 (Mumbai) applies W1.
replay_lsn = 0/D1000000. - t = 12 ms. R2 (Hyderabad) applies W1.
replay_lsn = 0/D1000000. - t = 20 ms. Bob sees Alice's post (via his own read from R1 at
t = 15, which returnedreplay_lsn = 0/D1000000). Bob's session merges,seen_lsn = 0/D1000000. - t = 25 ms. Bob replies: "congrats!" (write W2 at LSN
0/D1000040). Bob's middleware passed hisseen_lsn >= 0/D1000000with the write, so the primary accepted W2 as causally-after W1. Bob's session updates toseen_lsn = 0/D1000040. - t = 30 ms. R1 applies W2.
replay_lsn = 0/D1000040. - t = 50 ms. Carol reads the feed. Her session has
seen_lsn = None(she has not interacted yet). Her request lands on R3 (Singapore), which is lagging:replay_lsn = 0/D0FFF000. R3 answers Carol's feed query. Carol sees neither Alice's post nor Bob's reply — consistent, because she has no causal constraint. - t = 60 ms. Carol taps Bob's reply (via a push notification that included Bob's seen LSN). Her middleware merges:
seen_lsn = 0/D1000040. Her request to load Bob's reply thread requires a replica at>= 0/D1000040. R3 does not qualify. R1 does. - t = 65 ms. Router sends the thread query to R1. R1 answers with Bob's reply and Alice's original post, because R1 is caught up past both. Carol sees a complete, causally consistent thread.
No write blocked waiting for replicas. Async replication, full primary throughput, read scaling preserved. The only extra work is LSN plumbing. :::
Common confusions
-
"Eventual consistency is fine for everything." Fine for background aggregations, analytics, ML feature stores. Not fine for interactive UIs where a user has just acted and expects to see the effect.
-
"Read-your-writes means strong consistency." No. RYW is per-session. A system with only RYW still ships the "reply without the original" bug for third-party readers.
-
"Causal consistency is the same as linearisability." No. Causal respects happens-before; linearisability additionally respects real-time ordering of concurrent operations. Two sessions writing to unrelated keys concurrently can be seen in different orders under causal; under linearisability they cannot.
-
"You need vector clocks to do causal." Only for multi-writer systems. Single-leader async-replicated Postgres has a total order on writes (the WAL), so a single LSN per session suffices. Vector clocks are the general solution for leaderless and multi-master systems.
-
"Sticky sessions are the same as causal." No. Sticky sessions give monotonic reads within a session. They do not give RYW (the pinned replica may be behind the user's write) and they do not track cross-session dependencies.
-
"These guarantees are free." They cost engineering — plumbing session IDs and tokens through every layer, replica-LSN tracking, middleware. They are free in write-path latency and disk I/O, which distinguishes them from consensus-based strong consistency.
Going deeper
MongoDB causal consistency and cluster time
MongoDB 3.6+ implements causal consistency via cluster time — a hybrid logical clock combining wall-clock seconds with a logical counter, signed so it cannot be forged. Every operation returns the cluster time at which it executed; causal sessions carry this as afterClusterTime. The replica requires its own applied time to be at or past afterClusterTime before answering. Exposed through the session's causalConsistency: true flag. Drivers plumb the token transparently. The implementation is precisely the LSN-tracking middleware pattern above, productised.
Spanner and external consistency via TrueTime
Google Spanner (OSDI 2012) achieves external consistency — linearisability applied to transactions — by bounding clock uncertainty rather than synchronising on each read. Every datacentre has GPS receivers and atomic clocks; the TrueTime API returns [earliest, latest] intervals bracketing true global time. Transactions commit at timestamp s and wait until TrueTime.earliest > s before releasing locks. The wait averages 7 ms in Google's deployment and gives global linearisability without per-read coordination. CockroachDB implements a software approximation using a configured MaxOffset; exceed it and correctness silently goes.
CRDTs — eventual consistency with automatic merging
Conflict-free replicated data types (Shapiro et al. 2011) design data types so that concurrent updates merge deterministically regardless of order. G-Counter merges by vector max; LWW-Register picks latest timestamp; sequence CRDTs (RGA, Yjs) use identifier trees. CRDTs give strong eventual consistency: two replicas that have received the same set of updates (in any order) converge to identical state. Different axis from session guarantees: CRDTs fix "what is the correct state after concurrent writes"; session guarantees fix "what does one user see across successive operations". Figma, Notion, and Google Docs combine both.
Where this leads next
The three rungs: read-your-writes, monotonic reads, causal. Each has a textbook definition, a 30-line middleware, a known failure mode, a clear cost model. The jump from causal to sequential or linearisable crosses into consensus territory, where writes get expensive again.
- Chapter 74: Single-leader limits. The primary is a write bottleneck and a single coordination point; at multi-region, hot-partition, or geographic-failover scale, single-leader stops working. Sets up Build 10.
- Build 10: Leaderless replication. Cassandra, Dynamo, Riak. Quorum reads and writes. Vector clocks as first-class citizens because no single LSN exists. Read repair, hinted handoff. The CAP trilemma becomes a daily operational knob.
The middle rungs are where most production systems live. A working engineer should name each rung, identify which an app needs per operation class, and implement the middleware in an afternoon.
References
- Bailis et al., Bolt-On Causal Consistency, SIGMOD 2013 — the definitive demonstration that causal consistency can be layered as a client-side shim above an eventually-consistent store, with the dependency-tracking shim architecture that underlies every session-guarantee middleware since.
- MongoDB documentation, Causal Consistency and Read and Write Concerns — a productised implementation of session-level causal consistency, with the
afterClusterTimetoken,readConcern: 'majority', and transparent driver plumbing that together hide all the LSN-tracking from application code. - Corbett et al., Spanner: Google's Globally-Distributed Database, OSDI 2012 — Spanner's TrueTime-backed external consistency is the definitive example of a level stronger than causal, achieved by bounding clock uncertainty rather than paying per-read coordination.
- PostgreSQL documentation, Chapter 27: High Availability, Load Balancing, and Replication — the operational reference for
pg_current_wal_lsn(),pg_last_wal_replay_lsn(),pg_wal_replay_wait(), andpg_stat_replication, which are the primitives an LSN-tracking session middleware needs. - Kleppmann, Designing Data-Intensive Applications, O'Reilly 2017, chapter 9 — the clearest book-length prose on the consistency ladder, with especially good treatments of the difference between causal consistency and linearisability, and why the latter is much more expensive than the former.
- Lamport, Time, Clocks, and the Ordering of Events in a Distributed System, CACM 1978 — the foundational paper defining the happens-before relation that underlies every definition of causal consistency, and introducing logical clocks as the first dependency-tracking mechanism.