In short

Between "eventual consistency" — the async-replication default, under which a read may observe any state from any point in the past and two successive reads may even contradict each other — and "linearisability" — every read returns the globally most recent write, as if the whole system ran on one machine — lies a rung-laddered region of consistency guarantees that cost far less than linearisability and fix almost every user-visible anomaly.

Read-your-writes (RYW) says: within a session, a read issued after a write sees that write. User posts a photo, reloads, sees the photo. Nothing more, nothing less. It does not promise that other users see your photo immediately; only that you do.

Monotonic reads says: within a session, successive reads move forward in time, never backward. If you saw a comment on page load 1, you still see it on page load 2. The replica's view, from your session's perspective, is a non-decreasing function of time.

Causal consistency says: if write B happened-after write A — a reply to a post, an edit to a document someone else created, a tag added to a photo someone else uploaded — then any reader who sees B also sees A. Causality is preserved across sessions; concurrent, unrelated writes may still be seen in different orders on different replicas.

Each guarantee is strictly weaker than linearisability and strictly stronger than eventual. Each is implementable on top of ordinary async log shipping — from ch.67 — with a session-scoped LSN cache, a sticky session pin, or a vector-clock-shaped dependency set. The cost is a few dozen bytes of per-session state and, occasionally, a short wait while a chosen replica catches up. You pay nothing on the write path; you add no synchronous replication hop; you keep the read-scaling benefit of replicas. This chapter is about how the three rungs are defined, how to implement each in real middleware, where each one fails, and where the ladder leads past linearisability into the CAP-constrained world of ch.74 and Build 10.

A user in Pune hits three bugs in ninety seconds, none a bug in any individual component.

First. They post a photo. The toast appears. They tap their profile — no photo. Reload, reload, reload. On the fourth, the photo appears. Primary committed; the replica their read hit had not yet applied the WAL record. Engineers call this replica lag; users call it "broken".

Second. They read a friend's birthday thread — top comment "happy bday bro". They come back via the notification bell; top comment is gone. Reload and it reappears. Two reads, two different replicas, reality running backwards.

Third. They see a reply — "totally agreed". They tap through to the parent — "no such post". Alice posted, Bob replied, and the replica serving Bob's reply had applied Bob's write but not Alice's.

Three distinct anomalies, three distinct fixes. You do not need linearisability to stop any of these bugs; you need the three rungs of the session-consistency ladder.

The consistency ladder — where these sit

Consistency models form a partial order; stronger ones imply weaker. If a system is linearisable it is also causal, monotonic, and read-your-writes; the reverse is not true. The useful portion of the lattice:

The session-consistency ladderVertical ladder with six rungs from weakest at the bottom (eventual consistency) to strongest at the top (linearisability). Between them, in ascending order: read-your-writes, monotonic reads, causal consistency, sequential consistency. Annotations show which mechanism implements each rung: eventual gets plain async; read-your-writes gets session LSN tracking; monotonic reads gets sticky sessions; causal gets per-session seen-set or vector clock; sequential requires consensus; linearisable requires consensus plus real-time ordering.strongerweakerlinearisableconsensus + real-timesequentialconsensus, no wall-clockcausalseen-set / vector clockmonotonic readssticky session or max-LSNread-your-writessession LSN tokeneventualplain async (free)
The consistency ladder from eventual (free, weakest) to linearisable (expensive, strongest). The three middle rungs — read-your-writes, monotonic reads, causal — are the subject of this chapter. Each is implementable on top of async replication with small per-session state; none of them requires consensus or synchronous writes. The jump from causal to sequential is where you must pay consensus costs.

Why this ladder matters operationally: most user-facing anomalies that get filed as bugs are violations of one of the three middle rungs. You do not need to pay for linearisability to fix them. You pay for session-scoped state (a few bytes), a pinning decision (a cookie), or a replica-LSN lookup (a cache). The CAP theorem's famous cost — giving up either consistency or availability under a partition — applies to linearisability, not to the middle rungs. Causal consistency is available under partitions; you can have both causal and available on the same system, a fact usually called "the causal escape from CAP" after Bailis et al. 2013.

Read-your-writes — implementation

Read-your-writes is the smallest useful strengthening of eventual, and the single guarantee that fixes the most-filed class of user-reported bugs. Three implementation strategies, in ascending order of principled-ness.

Strategy A: read-from-primary window. The cheapest. On every write, record the wall-clock timestamp. For some window W after that timestamp (typically 2-4× the p99 replica lag), route that session's reads to the primary instead of any replica. After W expires, return to the replica pool. Covered in ch.70 in detail. Coarse but effective; works without any LSN plumbing.

Strategy B: session LSN token. The principled answer. On every write, capture the primary's current WAL LSN (Postgres: pg_current_wal_lsn() called in the same transaction, or the commit_lsn returned by the replication protocol). Store the LSN in the user's session. On every subsequent read, consult a cache of per-replica replay_lsn values; pick a replica whose replay_lsn >= session_lsn. If none exists, wait briefly (Postgres 10+ offers pg_wal_replay_wait()) or route to the primary as a fallback.

# ryw_session.py
import psycopg2

class RYWRouter:
    def __init__(self, primary_conn, replica_conns, replica_lsns):
        self.primary = primary_conn
        self.replicas = replica_conns          # list of live replica connections
        self.replica_lsns = replica_lsns       # dict: replica_id -> latest replay_lsn
        self.sessions = {}                     # session_id -> last_commit_lsn

    def write(self, session_id, sql, params=()):
        cur = self.primary.cursor()
        cur.execute(sql, params)
        cur.execute("SELECT pg_current_wal_lsn()")
        lsn = cur.fetchone()[0]
        self.primary.commit()
        self.sessions[session_id] = lsn
        return lsn

    def read(self, session_id, sql, params=()):
        needed = self.sessions.get(session_id)
        if needed is None:
            return self._exec(self.replicas[0], sql, params)
        for r_id, r_lsn in self.replica_lsns.items():
            if self._lsn_ge(r_lsn, needed):
                return self._exec(self.replicas[r_id], sql, params)
        return self._exec(self.primary, sql, params)  # fallback

Why the LSN token approach is strictly better than the time window: the window says "recent writers cannot use replicas for W seconds", which is both stricter than necessary (most replicas have caught up well within W) and looser than necessary (if a replica lags beyond W you have silently lost RYW). The LSN token routes each read to the exact replica that is known to satisfy causality for this read. The primary absorbs only the tiny tail of reads for which no replica has caught up yet. Read-scaling is preserved at full effectiveness; correctness is not heuristic.

Strategy C: replica wait. A variant of B: tell the replica "wait until your replay_lsn reaches X, then answer, or fail after 100 ms". Postgres's pg_wal_replay_wait() is this. Trades a bounded read-path wait for keeping load off the primary.

MongoDB productises strategy B as readConcern: 'majority' + afterClusterTime: <token>. The driver plumbs the token; you enable it with causalConsistency: true on the session. Transparent to application code.

Monotonic reads — implementation

Monotonic reads is about successive reads, not the first read after a write. The guarantee: if a session reads value V at time t1 and reads the same key at time t2 > t1, the read at t2 sees a state at least as recent as V. Never backward, possibly forward. The user's reality time-travels only in the positive direction.

Two implementation strategies.

Strategy A: sticky sessions. Pin a given session to a single replica — typically via a cookie the load balancer hashes on. All reads from that session hit the same replica. That replica's replay_lsn is monotonically increasing (apply never regresses), so the observed state is monotonically increasing too. Dead simple, and a common default in load-balancer configuration (HAProxy: cookie SRV insert indirect; nginx: sticky cookie; AWS ALB: "stickiness" toggle).

Strategy B: session-tracked max LSN. Every read answer returns the replica's replay_lsn at the moment of answer. The session remembers the maximum LSN it has ever observed (write-committed or read-returned). Subsequent reads carry that max LSN as a token and require a replica at or past it. Stronger than sticky sessions because it survives replica failover — if the pinned replica dies and the session re-pins, the new replica must also be caught up to the max observed LSN.

# monotonic_session.py
class MonotonicSession:
    def __init__(self, router):
        self.router = router
        self.high_water_lsn = None

    def write(self, sql, params=()):
        lsn = self.router.write_primary(sql, params)
        self._advance(lsn)
        return lsn

    def read(self, sql, params=()):
        replica = self.router.pick_replica_at_or_past(self.high_water_lsn)
        rows, replica_lsn = replica.execute_and_report_lsn(sql, params)
        self._advance(replica_lsn)
        return rows

    def _advance(self, lsn):
        if self.high_water_lsn is None or lsn > self.high_water_lsn:
            self.high_water_lsn = lsn

Why strategy B subsumes sticky sessions: sticky sessions give monotonic reads as a consequence of always hitting the same replica, but they break on replica failure — when the session re-pins to a new replica, the new replica's replay_lsn might be behind the old one's. The user sees time run backwards at exactly the moment the infrastructure was trying to recover. Strategy B passes the max observed LSN with every read; the new replica must catch up before it can answer. The cost is an extra LSN comparison per read, which is free, and occasionally a brief wait, which is rare.

Cost of sticky sessions. The load balancer loses some ability to rebalance. If one replica gets hot-spotted by a handful of heavy sessions, the LB cannot move them to a cooler replica without breaking monotonicity. In practice, session lifetimes are short enough (minutes) that imbalance averages out.

Causal consistency — implementation

Causal is the most interesting of the three. It is also the weakest guarantee that cleanly handles cross-session scenarios: the "reply exists, original does not" bug from the opening hook.

The definition: operations are partially ordered by the happens-before relation. Operation A happens-before B if (i) they are in the same session and A precedes B in program order, (ii) B reads a value written by A, or (iii) there exists some C such that A happens-before C happens-before B (transitive closure). A causal-consistent system guarantees that if A happens-before B, every session that observes B also observes A.

The implementation pattern: each session maintains a set of "writes I have observed so far" — call it the seen set. On every write, the session adds the write's identifier to the seen set. On every read, the session adds to the seen set all writes the read-result depends on (as reported by the replica). On every subsequent read, the session sends the seen set with the read; the replica must have applied every write in the seen set before it can answer.

In full generality, seen sets are tracked as vector clocks — one counter per writer, so the system can distinguish concurrent writes from causally ordered ones. For single-database systems (one logical primary, replicas following one WAL stream), a single LSN suffices as the seen-set summary, because one WAL stream gives a total order on all writes.

# causal_session.py
class CausalSession:
    def __init__(self, router, session_id):
        self.router = router
        self.session_id = session_id
        self.seen_lsn = None                  # monotonic

    def write(self, sql, params=()):
        lsn = self.router.write_primary(sql, params, after_lsn=self.seen_lsn)
        self._update(lsn)
        return lsn

    def read(self, sql, params=()):
        replica = self.router.pick_replica_at_or_past(self.seen_lsn)
        rows, replica_lsn = replica.execute_and_report_lsn(sql, params)
        self._update(replica_lsn)
        return rows

    def merge_from_peer(self, peer_seen_lsn):   # e.g. Alice reads Bob's reply
        if peer_seen_lsn is not None:
            self._update(peer_seen_lsn)

    def _update(self, lsn):
        if self.seen_lsn is None or lsn > self.seen_lsn:
            self.seen_lsn = lsn

The merge_from_peer call is how cross-session causality enters: when session A observes a value produced by session B, A's seen set must expand to include B's seen set. In an HTTP/API context, B's seen LSN travels in an HTTP header or in the payload of whatever A just read; A's middleware picks it up and merges.

Why a single LSN suffices on a single-leader Postgres or MySQL: the primary's WAL is a totally ordered log. Every write lands at a unique LSN, and LSNs are monotonic. "Has replica R applied write W?" is equivalent to "is R's replay_lsn >= W's LSN?". The seen set collapses to a single scalar. Vector clocks are only needed when writes originate at multiple nodes — leaderless systems (ch.75), multi-master replication, Dynamo-style quorums — where no single total order exists and "apply all writes causally before X" requires per-writer tracking.

Wait vs route. When the replica is behind the session's seen LSN, the router has two choices: wait for the replica to catch up (bounded timeout), or route to a different replica that is already caught up (or to the primary). The trade-off is latency (waiting adds milliseconds) vs load (routing more shifts traffic to the primary). Production systems typically wait up to 50-100 ms, then fall back.

A Python middleware implementation

Tie the three guarantees together into a single middleware. The interface: an application passes a session ID on every call, reads and writes are routed automatically, and the guarantee level is selectable per request.

# consistency_middleware.py
from enum import Enum

class Level(Enum):
    EVENTUAL = 0
    RYW = 1
    MONOTONIC = 2
    CAUSAL = 3

class ConsistencyMiddleware:
    def __init__(self, primary, replicas):
        self.primary = primary
        self.replicas = replicas               # dict: id -> Replica
        self.session_lsn = {}                  # session_id -> lsn

    def write(self, session_id, sql, params=()):
        lsn = self.primary.execute(sql, params)
        cur = self.session_lsn.get(session_id, 0)
        self.session_lsn[session_id] = max(cur, lsn)
        return lsn

    def read(self, session_id, sql, params=(), level=Level.RYW):
        needed = self.session_lsn.get(session_id, 0) if level != Level.EVENTUAL else 0
        r = self._find_replica_at_lsn(needed)
        if r is None:
            rows, lsn = self.primary.execute_and_report(sql, params)
        else:
            rows, lsn = r.execute_and_report(sql, params)
        if level in (Level.MONOTONIC, Level.CAUSAL):
            cur = self.session_lsn.get(session_id, 0)
            self.session_lsn[session_id] = max(cur, lsn)
        return rows

    def _find_replica_at_lsn(self, target):
        for r in self.replicas.values():
            if r.replay_lsn >= target:
                return r
        return None

Thirty lines, all three guarantees. The differences:

The real-world version adds: a replica-LSN refresher polling pg_stat_replication every 500 ms, a bounded wait on catch-up (not immediate fallback to primary), per-replica health checks, a circuit breaker on primary fallback, and metrics for "read fell back to primary" rate — which tells you whether your p99 replica lag is within your session SLA or not.

Cost of these guarantees

RYW. Per-session state: one LSN (8 bytes). Storage: session store with TTL matching session lifetime. Extra read latency: a hash lookup in the replica-LSN cache. Extra primary load: the tail of reads where no replica is caught up, under 1% with healthy lag. No extra write latency.

Monotonic reads. Sticky sessions: LB constraint plus a cookie. Session-tracked max LSN: same as RYW plus updating the LSN on every read. Extra latency negligible; storage one LSN per session.

Causal. Single-leader: same as monotonic reads, plus a cross-session merge path. Multi-leader or leaderless: vector clocks scaling with number of writers.

The most material cost across all three is engineering: plumbing session IDs and tokens through every layer. Productised systems (MongoDB, Cosmos DB) make this transparent; hand-rolled middleware does not.

Where these guarantees fail

RYW fails if the session identifier is not preserved across requests. The serverless-in-the-dark-ages problem: short-lived Lambda containers, stateless edge functions, client-side routing without session cookies. If request-2's middleware cannot identify that it is the same session as request-1, it cannot read the stored LSN, and RYW degrades to eventual. The fix — stable session IDs in cookies or auth tokens — is trivial but silently regresses during a refactor. Monitor the "session LSN missing" rate.

Monotonic reads fails if the sticky replica dies and the session is re-routed to a lagging replica. Strategy B (session max LSN) survives this because the new replica is required to be caught up, but has its own failure: if no replica is caught up within the wait bound, the request falls back to the primary or fails.

Causal fails if writer and reader do not share a session-tracking mechanism. If Alice writes a post and Bob reads it, but Bob's middleware has no way to pick up Alice's seen LSN from the response, Bob's subsequent reads can see a reply-to-Alice without Alice. The fix is to include the seen LSN in response headers and have client-side middleware merge it.

All three fail across service boundaries without explicit propagation. A session may touch three databases through three services, each with its own middleware. A write in service A followed by a read in service B gets no RYW unless A propagates its LSN to B. W3C trace context headers are the usual carrier.

Comparison with stronger levels

Distinguish the session guarantees here from the database-level isolation levels of Build 7.

Linearisability. Every read returns the most recent write globally, respecting real-time ordering. Requires synchronous replication or consensus (Raft, Paxos). Etcd and ZooKeeper are linearisable; production Postgres with async replicas is not.

Serialisability (isolation level). A property of concurrent transactions: the outcome is equivalent to some serial order. Per-transaction, database-level, not session-level. Snapshot Isolation (ch.47) and SSI (ch.51) are isolation levels; they say nothing about what successive transactions in the same session observe across replicas.

External consistency. Spanner's term: linearisability applied to transactions. Requires TrueTime's bounded-clock machinery or equivalent.

The session guarantees here are orthogonal to isolation levels. You can have SI + RYW, or SSI + causal, or read-committed + monotonic reads — the dimensions compose. An application chooses both: the isolation level for concurrent transactions, and the session guarantee for user-perceived cross-request consistency.

:::example Three users, three replicas, one scenario

Your social app runs on Postgres: one primary in Mumbai, three async replicas (R1, R2, R3) in Mumbai, Hyderabad, and Singapore. Three users — Alice, Bob, Carol — interact in the next 200 ms.

Causal session tracking across three users and three replicasTimeline from t=0 to t=70 ms. Four swim-lanes: Alice, Bob, Carol, and three replicas R1, R2, R3. Alice's write W1 at t=0 propagates to R1 by t=5 and R2 by t=12. Bob reads from R1 at t=15 (OK because R1 has W1), then writes W2 at t=25 carrying seen_lsn=W1.lsn. R1 applies W2 by t=30. Carol first reads from R3 at t=50 without a seen LSN (legitimately eventual — sees nothing). At t=60 Carol merges Bob's seen LSN through a notification, and her subsequent read at t=65 is routed to R1 which has both W1 and W2 applied. Shows LSN bars advancing per replica.AliceBobCarolR1R2R3W1Bob readsW2readno seen_lsnR3 behind — OK (eventual)readseen=W2R1 OKt=0t=15t=30t=50t=70 ms
Causal consistency across sessions. Alice writes, Bob reads and replies, Carol reads both Alice's and Bob's writes — but Carol's first read has no causal dependency and legitimately observes a lagging replica. Only when Carol's session merges Bob's seen LSN (via the notification payload) does her subsequent read require catch-up. The three replicas serve different causal-consistency needs at different moments; no read is unnecessarily routed to the primary.

No write blocked waiting for replicas. Async replication, full primary throughput, read scaling preserved. The only extra work is LSN plumbing. :::

Common confusions

Going deeper

MongoDB causal consistency and cluster time

MongoDB 3.6+ implements causal consistency via cluster time — a hybrid logical clock combining wall-clock seconds with a logical counter, signed so it cannot be forged. Every operation returns the cluster time at which it executed; causal sessions carry this as afterClusterTime. The replica requires its own applied time to be at or past afterClusterTime before answering. Exposed through the session's causalConsistency: true flag. Drivers plumb the token transparently. The implementation is precisely the LSN-tracking middleware pattern above, productised.

Spanner and external consistency via TrueTime

Google Spanner (OSDI 2012) achieves external consistency — linearisability applied to transactions — by bounding clock uncertainty rather than synchronising on each read. Every datacentre has GPS receivers and atomic clocks; the TrueTime API returns [earliest, latest] intervals bracketing true global time. Transactions commit at timestamp s and wait until TrueTime.earliest > s before releasing locks. The wait averages 7 ms in Google's deployment and gives global linearisability without per-read coordination. CockroachDB implements a software approximation using a configured MaxOffset; exceed it and correctness silently goes.

CRDTs — eventual consistency with automatic merging

Conflict-free replicated data types (Shapiro et al. 2011) design data types so that concurrent updates merge deterministically regardless of order. G-Counter merges by vector max; LWW-Register picks latest timestamp; sequence CRDTs (RGA, Yjs) use identifier trees. CRDTs give strong eventual consistency: two replicas that have received the same set of updates (in any order) converge to identical state. Different axis from session guarantees: CRDTs fix "what is the correct state after concurrent writes"; session guarantees fix "what does one user see across successive operations". Figma, Notion, and Google Docs combine both.

Where this leads next

The three rungs: read-your-writes, monotonic reads, causal. Each has a textbook definition, a 30-line middleware, a known failure mode, a clear cost model. The jump from causal to sequential or linearisable crosses into consensus territory, where writes get expensive again.

The middle rungs are where most production systems live. A working engineer should name each rung, identify which an app needs per operation class, and implement the middleware in an afternoon.

References

  1. Bailis et al., Bolt-On Causal Consistency, SIGMOD 2013 — the definitive demonstration that causal consistency can be layered as a client-side shim above an eventually-consistent store, with the dependency-tracking shim architecture that underlies every session-guarantee middleware since.
  2. MongoDB documentation, Causal Consistency and Read and Write Concerns — a productised implementation of session-level causal consistency, with the afterClusterTime token, readConcern: 'majority', and transparent driver plumbing that together hide all the LSN-tracking from application code.
  3. Corbett et al., Spanner: Google's Globally-Distributed Database, OSDI 2012 — Spanner's TrueTime-backed external consistency is the definitive example of a level stronger than causal, achieved by bounding clock uncertainty rather than paying per-read coordination.
  4. PostgreSQL documentation, Chapter 27: High Availability, Load Balancing, and Replication — the operational reference for pg_current_wal_lsn(), pg_last_wal_replay_lsn(), pg_wal_replay_wait(), and pg_stat_replication, which are the primitives an LSN-tracking session middleware needs.
  5. Kleppmann, Designing Data-Intensive Applications, O'Reilly 2017, chapter 9 — the clearest book-length prose on the consistency ladder, with especially good treatments of the difference between causal consistency and linearisability, and why the latter is much more expensive than the former.
  6. Lamport, Time, Clocks, and the Ordering of Events in a Distributed System, CACM 1978 — the foundational paper defining the happens-before relation that underlies every definition of causal consistency, and introducing logical clocks as the first dependency-tracking mechanism.