Replica Lag and the Read-Your-Writes Problem

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Replica lag is the distance, in bytes of unshipped WAL or in wall-clock seconds, between what the primary has committed and what a replica has durably applied. Under asynchronous log shipping — which is the default for Postgres, MySQL, and essentially every single-leader database in production — lag is always non-zero. A replica that shows zero lag is either receiving no writes or measuring its lag wrong. In a well-provisioned intra-region setup the steady-state number is 1-100 ms; under bulk writes, long replica queries, saturated networks, or replica disk pressure it climbs to seconds; during a disconnection it grows without bound until the replica reconnects.

The user-visible consequence of this innocuous-sounding number is the read-your-writes anomaly. A user submits a POST to the primary, the primary commits, returns OK, the browser redirects, the redirect's GET arrives 30 ms later at your load balancer, the load balancer hashes it to a replica that has not yet applied the WAL record for the POST, and the query answers as if the post does not exist. The user sees a blank feed. They reload. They reload again. Four or five seconds pass; the replica finally catches up; their post appears. In the interim, they have decided your product is broken.

Four standard mitigations exist, and every serious deployment picks at least one. (a) Route reads to the primary for a short window after each write — simple, coarse, works. (b) Pin one user's session to one replica, so at least their reads are self-consistent even if stale versus the primary. (c) Have the client carry a causal-consistency token — the commit LSN the primary returned — and refuse to read from a replica whose applied LSN is below it. (d) Write through a local cache on every write and let reads hit the cache first until the replica catches up. Stronger per-session consistency — monotonic reads, linearisable reads — is covered in ch.73.

A user in Kochi posts to their feed. The spinner goes away; the "posted" toast appears. They reload the page. The feed is blank. They reload again — still nothing. Twenty seconds later, on the fourth reload, the post appears. From the user's perspective your application lost their post, then un-lost it.

Nothing was lost. The primary committed the insert, wrote the WAL record, acknowledged the client. The read at reload-time went to a replica 300 ms behind the primary — enough to miss the write by five transactions. The replica was doing its job; the primary was doing its job; the router was doing its job. The system produced an answer the user experienced as a bug, because nobody made sure the user's read saw the user's own write.

This chapter dissects that pipeline — what lag is formally, why it happens, what the read-your-writes anomaly looks like in detail, and the four engineering patterns that make it invisible.

What replica lag is, formally

Lag has two customary definitions, and both are useful.

Byte lag — the WAL-byte distance between the primary's current write position and the replica's applied position. In Postgres:

-- Run on the primary
SELECT client_addr,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS byte_lag
  FROM pg_stat_replication;

pg_stat_replication has one row per connected replica with four LSN columns: sent_lsn, write_lsn, flush_lsn, replay_lsn (how far the replica's startup process has actually applied). Byte lag is usually computed against replay_lsn, because that is the position a replica read would see.

Time lag — the wall-clock difference between when the primary committed the most recent transaction and when the replica applied it. On the replica itself:

SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS seconds_lag;

pg_last_xact_replay_timestamp() returns the commit time (from the primary's clock) of the last transaction the replica has applied. Subtract it from the replica's current wall clock and you get seconds of apply lag.

Why byte lag is the more reliable alarm signal: if the primary is briefly idle — no new transactions for 10 seconds — every replica's pg_last_xact_replay_timestamp() freezes at the timestamp of the last applied transaction, and time lag looks like it is climbing even though nothing is wrong. Byte lag reports zero in this case because replay_lsn has caught up to pg_current_wal_lsn(). Time lag is what humans want to reason about ("how far back in time is the replica?"); byte lag is what monitoring should actually page on.

MySQL exposes SHOW REPLICA STATUS with a Seconds_Behind_Master field. Its semantics are famously subtle: it measures the gap between when the current replaying event was originally written on the primary and the replica's wall clock. If both primary and replica are idle, it reports 0. If the replica's clock is skewed from the primary's, the number is meaningless. Prefer GTID-based lag (Retrieved_Gtid_Set vs Executed_Gtid_Set) for anything operationally load-bearing.

Why lag happens

Lag has four common causes, and a real replica is usually experiencing some combination.

Network round-trip. The walsender pushes a WAL record over TCP; the walreceiver gets it half an RTT later. Intra-AZ this is 0.2-2 ms. Cross-region, 40-180 ms. A floor, not a variable — physics.

Replica apply throughput. Traditional Postgres physical replication applies WAL serially on a single startup process. The primary has 32 cores generating WAL in parallel; the replica applies through one. A workload that saturates the primary at 40% CPU can push the replica's apply thread to 100% and lag grows linearly with write rate. MySQL addressed this in 5.7 with multi-threaded applier workers; Postgres is still single-threaded for physical replication as of 16.

Replica disk I/O. The replica fsyncs every WAL record it applies. If the replica sits on cheaper storage than the primary — a common cost-saving that bites later — its fsync latency can be 3-10× the primary's, and apply lag grows.

Bandwidth saturation during bursts. A COPY 10GB bulk load writes ~10 GB of WAL in minutes. On a 1 Gbps cross-region link, shipping 10 GB takes at least 80 seconds, and normal WAL traffic queues behind it. Lag climbs to tens of seconds and recovers only after the burst ends.

Why a single slow query on the replica can cause lag across the entire instance: Postgres pauses WAL apply if applying would invalidate a tuple that an active read query on the replica is depending on. This is governed by hot_standby_feedback (replica tells primary about its active snapshots, primary holds off vacuum) and max_standby_streaming_delay (replica bounds how long it will pause apply before cancelling the blocking query). The default max_standby_streaming_delay = 30s means one long-running analytics query on your replica can create 30 seconds of lag for every other reader on the same replica.

The read-your-writes anomaly — in detail

Walk the exact timeline of the anomaly.

t = 0 ms. User submits POST /post with body "chai is superior to coffee". Request hits the load balancer, routed to an app server, app server opens a write transaction to the primary, inserts the row, commits. Primary writes the WAL record, assigns LSN 0/A8F2C110, fsyncs, returns COMMIT OK to the app server. App server returns HTTP 302 redirect to /feed.
t = 2 ms. Primary's walsender picks up the new WAL record and starts transmission to replicas.
t = 4 ms. Replica A (same AZ, 1 ms RTT) receives the record into its walreceiver, writes to its local WAL.
t = 5 ms. Replica A's startup process applies the record; its replay_lsn advances to 0/A8F2C110.
t = 30 ms. User's browser acts on the 302, sends GET /feed.
t = 32 ms. Load balancer receives the GET, routes to an app server.
t = 33 ms. App server opens a read connection to "any replica" and the pool picks Replica B (different AZ, 8 ms RTT).
t = 33 ms. Replica B's replay_lsn at this moment is 0/A8F2A800 — before the user's commit.
t = 34 ms. App server issues SELECT * FROM posts WHERE user_id = 42 ORDER BY created_at DESC LIMIT 20.
t = 35 ms. Replica B answers with 19 posts. The user's new post is not among them.
t = 36 ms. User sees their own feed, minus their own post, and concludes the submission failed.

The read-your-writes anomaly. The POST commits on the primary at t=2 ms; the WAL record is in flight to Replica B from t=3 ms and applied at t=11 ms. But the GET arrives on Replica B at t=33 ms — well after the apply — in this idealised picture. In practice, Replica B's apply is often further behind, the GET routes to whichever replica the load balancer picks, and some non-trivial fraction of first-reads-after-write hit a replica before the apply completes. The fix is not "make lag zero" but "do not route the recent writer's read to an arbitrary replica".

This is not a bug in any individual component. Primary, replica, load balancer, and query all behaved correctly. The system produced the wrong user experience because no layer knew that this particular read needed to see this particular write.

Mitigation 1 — read from primary after own writes

The cheapest and coarsest mitigation. The application tracks, per user, the wall-clock timestamp of that user's last write. For some window W after the write (5 seconds is a typical default), all reads from that user route to the primary instead of any replica. After W, reads return to the replica pool.

# app/read_routing.py
import time
from dataclasses import dataclass

RYW_WINDOW_S = 5.0          # tune to ~2x p99 replica lag

@dataclass
class UserSession:
    user_id: int
    last_write_at: float = 0.0   # monotonic timestamp

def on_write(session: UserSession) -> None:
    session.last_write_at = time.monotonic()

def pick_backend(session: UserSession) -> str:
    """Return 'primary' or 'replica' for this user's next read."""
    if time.monotonic() - session.last_write_at < RYW_WINDOW_S:
        return 'primary'
    return 'replica'

The window needs to cover the replica lag tail, not the median. If p99 lag is 500 ms, a 5-second window is very safe; if p99 is 5 seconds, you need a 20-second window and should probably fix your lag instead. RYW_WINDOW_S should be roughly 2-4× your p99 replica lag.

Cost. Primary absorbs reads for all recent writers. If 10% of active users are within their window, primary handles an extra 10% of read load — manageable, since most reads are from non-writers. Non-recent-writers still hit replicas, so read scaling is preserved for most traffic.

Weakness. Only the user's own writes are accounted for. If user A reads user B's just-posted comment, that comment is still subject to replica lag — A was not the writer. For social feeds where "did I post" matters but "did they post one second ago" does not, fine. For collaborative editors where every participant needs every keystroke, not fine.

Mitigation 2 — sticky sessions

Pin a given user's session to a single replica. The replica may lag the primary, but it lags consistently, and the same user's subsequent reads see the same replica's state — so the user does not see their feed toggle between two replicas' views.

Sticky sessions do not solve the first-read-after-write problem — the pinned replica can still be behind the primary at commit time. They do solve monotonic reads: once a user has seen a post, they continue to see it on every subsequent read, because their replica's state monotonically increases. Without sticky sessions, a user could see their post, then not see it (different, further-behind replica), then see it again. That flicker is confusing in a way plain staleness is not.

Typical implementation: a cookie tagged with a replica ID; the load balancer hashes on the cookie so requests land on the same replica. On replica failure, the user may see a state jump (possibly backward) when re-pinned. Sticky sessions are usually paired with mitigation 1 or 3 to cover first-write.

Mitigation 3 — causal-consistency tokens

The principled answer. Teach the client to carry the information the router needs to pick the right replica.

On write. The primary returns, alongside COMMIT OK, the LSN of the commit. Postgres exposes this via pg_current_wal_lsn() or the commit_lsn in the replication protocol. The application reads this LSN and stores it in the user's session or in a cookie.

On read. The client sends the stored LSN with the request. The read router consults pg_stat_replication (or cached per-replica replay_lsn state) and picks a replica whose replay_lsn >= client_lsn. If none qualifies, it either waits briefly for one to catch up or routes to the primary.

Postgres 10+ exposes pg_wal_replay_wait() — a function you can call on a replica to block until its replay position reaches a target LSN or a timeout fires. The read-path middleware pattern:

# app/causal_router.py
from typing import Optional
import psycopg2

def read_with_causal_token(user_lsn: Optional[str], sql: str, params: tuple):
    """Route a read respecting the user's causal token.

    user_lsn: the LSN the user last committed, e.g. '0/A8F2C110', or None."""
    if user_lsn is None:
        conn = connect_replica_any()
        return conn.execute(sql, params).fetchall()

    replica = pick_replica_caught_up_to(user_lsn)
    if replica is not None:
        conn = connect_replica(replica)
        return conn.execute(sql, params).fetchall()

    # nobody caught up — wait briefly, then escalate to primary
    replica = wait_for_replica(user_lsn, timeout_s=0.1)
    if replica is not None:
        return connect_replica(replica).execute(sql, params).fetchall()
    return connect_primary().execute(sql, params).fetchall()

Why this is the conceptually correct fix: it attaches the causal dependency — "this read depends on a write at LSN X" — to the request itself. The router uses global state (replica LSNs) plus request-carried state (client LSN) to pick a replica whose answer will respect causality. Mitigations 1 and 2 are heuristic approximations of this; mitigation 3 is the actual thing. The cost is engineering: you must plumb LSNs through your application layer, your cookies, your RPC calls. The payoff is that you get read-your-writes with no arbitrary time window and no loss of read-scaling for unrelated traffic.

What the token looks like. In Postgres, a 64-bit LSN. In logical systems, a transaction timestamp or a vector clock. MongoDB's afterClusterTime paired with readConcern: 'majority' is this pattern, productised.

Worst case. No replica is caught up; the router routes to the primary. You have gracefully degraded to mitigation 1 for this one read.

Mitigation 4 — write-through cache

Orthogonal to the replica layer. On every write, the application additionally writes the new value into a fast cache (Redis, Memcached, in-process); on every read, it checks the cache first and falls back to the replica pool only on a miss. The cache TTL must exceed the expected replica lag tail — typically a few seconds to a minute.

# app/write_through.py
import time, json
from typing import Optional

class WriteThroughReader:
    def __init__(self, cache, replica_pool, primary, ttl_s: float = 10.0):
        self.cache = cache
        self.replica = replica_pool
        self.primary = primary
        self.ttl_s = ttl_s

    def post(self, user_id: int, body: str) -> dict:
        row = self.primary.insert_post(user_id, body)
        key = f'post:{row["id"]}'
        self.cache.set(key, json.dumps(row), ex=int(self.ttl_s))
        self.cache.zadd(f'feed:{user_id}', {row["id"]: row["created_at"]})
        return row

    def feed(self, user_id: int, limit: int = 20) -> list[dict]:
        cached_ids = self.cache.zrevrange(f'feed:{user_id}', 0, limit - 1)
        if cached_ids:
            blobs = self.cache.mget([f'post:{i}' for i in cached_ids])
            if all(b is not None for b in blobs):
                return [json.loads(b) for b in blobs]
        return self.replica.query_feed(user_id, limit)

The post-insert path writes to both primary and cache. The feed-read path checks the cache first and falls back to the replica only on a cache miss. For the recently-written user, the cache has their post; the replica may not; the read succeeds either way.

Costs. Cache invalidation on updates and deletes is manual and error-prone. Primary failover can leave the cache inconsistent if the failover lost writes. Cache TTL interacts with replica lag tuning — too short and reads fall through to a lagging replica, too long and stale data survives legitimate updates.

Not a substitute. Write-through caches are an optimisation on top of the database's consistency machinery, not a replacement. Production systems combine mitigation 4 (performance) with mitigation 1 or 3 (correctness).

Monitoring replica lag

You cannot mitigate what you cannot see. Every production replica deployment exports lag metrics to Prometheus or equivalent. Canonical exporter queries:

-- on primary, per replica
SELECT application_name,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS byte_lag,
       EXTRACT(EPOCH FROM (now() - reply_time))         AS reply_age_s
  FROM pg_stat_replication;

-- on each replica
SELECT CASE WHEN pg_is_in_recovery()
            THEN EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))
            ELSE 0 END AS apply_lag_s;

Scrape every 15 seconds. Typical alerting thresholds:

Warning: byte lag > 100 MB or apply lag > 10 s, sustained for 2 minutes.
Critical: byte lag > 1 GB or apply lag > 60 s, sustained for 5 minutes.
Disaster: replica disconnected for 5 minutes (no sent_lsn progress).

A sudden lag spike almost always has one of three causes: a large bulk write (investigate pg_stat_activity on the primary), a replica disk issue (check iostat on the replica), or a network regression (check packet loss on the replication link). The spike's shape tells you which.

Replica-apply parallelism — how modern engines fix the bottleneck

Serial single-threaded apply is the most common bottleneck in physical replication. Recent engines address it:

Postgres logical replication (core from version 10) parallelises apply across tables and, with streaming = parallel (from 14), applies large in-progress transactions using multiple workers. Physical replication remains single-threaded for apply.

MySQL multi-threaded replication (MTR, since 5.7) parallelises along several dimensions: per-schema, per-transaction (based on writeset conflict analysis), or commit-order preserving. A MySQL 8.0 replica with replica_parallel_workers = 16 and replica_parallel_type = LOGICAL_CLOCK sustains several times single-threaded throughput.

CockroachDB, TiKV, YugabyteDB use Raft replication at the range level. Each range has its own replication group; applies across ranges are inherently parallel. Replica lag per range is bounded by that range's throughput alone, not the cluster's aggregate. This is the architectural answer — range-partitioned distributed databases scale write throughput without the apply bottleneck leader-follower systems hit.

Your application: a feed service with 10 million monthly actives, peak 8,000 write TPS and 80,000 read TPS. Deployment: Postgres primary in ap-south-1a, one sync replica in ap-south-1b (2 ms RTT, zero-RPO per ch.68), two async read-scaling replicas in the same region (3 ms RTT), one async DR replica in ap-southeast-1 (40 ms RTT).

Measured lag on the three async replicas, over 24 hours:

Async read-1: p50=4 ms   p95=35 ms   p99=180 ms   max=1.8 s
Async read-2: p50=5 ms   p95=40 ms   p99=210 ms   max=2.3 s
Async DR:     p50=45 ms  p95=120 ms  p99=900 ms   max=12 s

The read-your-writes scenario. A user in Bengaluru posts at t = 0. Primary commits, returns LSN 0/C1A47820. App server stores this LSN in the user's session cookie. User's browser redirects to /feed, issues GET /feed at t = 40 ms.

Router reads the LSN from the session cookie (0/C1A47820). It checks cached per-replica replay_lsn state (refreshed every 500 ms from pg_stat_replication):

read-1: replay_lsn = 0/C1A47820   (caught up)
read-2: replay_lsn = 0/C1A477F0   (behind by 48 bytes, ~1 tuple)
DR:     replay_lsn = 0/C1A0E000   (behind by ~2 MB, ~50 s)

Router picks read-1. Query executes. User sees their post.

Without causal tokens, using mitigation 1 (read-from-primary for 5 s): the GET at t = 40 ms routes to primary. Correct, higher primary load. For an app where ~5% of users are in a 5-s window at any moment, this adds 5% to primary read load — fine.

Using mitigation 2 alone: the session is pinned to read-2. At t = 40 ms, read-2's replay_lsn = 0/C1A477F0, short of the commit. User's post is missing.

Using no mitigation: the load balancer hashes the GET to DR. DR is 50 seconds behind. User reloads for 50 seconds and files a support ticket.

The DR replica's role. It is a disaster-recovery standby for region failure, not in the read-serving rotation. Adding it to the read pool because "there's a spare replica over there" destroys read-your-writes for every user. :::

Common confusions

"Replica lag is a bug." Async replication is defined as "does not wait"; lag is the observable consequence. A zero-lag replica is one receiving no writes. Production engineering is about bounding lag and making the application tolerate it, not eliminating it.
"The user will never notice." Modern frontends redirect within 20-50 ms of a POST response, and any user who double-tapped submit just produced a second request that sees the state from 30 ms ago. The tail of noticed cases is fatter than intuition suggests.
"Eventual consistency and replica lag are the same thing." Related but different. Eventual consistency is the formal property: in the absence of new writes, replicas converge. Replica lag is the empirical measurement now. Eventual consistency says nothing about how long "eventually" is; replica lag measures it.
"Sticky sessions solve read-your-writes." They solve monotonic reads, not first-read-after-write. Sticky sessions must be paired with mitigation 1 or 3 to cover the first-write case.
"More replicas reduce lag." The opposite can be true. Each replica consumes walsender CPU and bandwidth; cross-region replicas saturate shared links during bursts. More replicas help read capacity, not lag.
"Causal tokens make everything linearisable." No. Causal tokens give causal consistency — reads see all writes that causally precede them. They do not give linearisability. Concurrent writers can still see each other's writes in different orders.

Going deeper

CAP, PACELC, and where replica lag sits

CAP (Brewer 2000, Gilbert and Lynch 2002) says that during a network partition a distributed system must choose between Consistency and Availability. Replica lag is the CAP world's "P" side expressed empirically: a partition is the limit case of lag (growing unboundedly while the link is down). An async system during a partition picks A; strict-sync picks C.

PACELC (Abadi 2012) refines CAP for when no partition is occurring: the system still faces a latency-vs-consistency trade-off. "In a Partition, choose Availability or Consistency; Else, choose Latency or Consistency." Replica lag is the currency of the "Else" clause — every mitigation in this chapter is a point on the PACELC-else curve.

Spanner's TrueTime — engineering around lag at the clock layer

Google's Spanner (OSDI 2012) takes an unusual path: rather than mitigate lag in the application, it deploys TrueTime, a clock API backed by atomic clocks and GPS, returning a timestamp interval [earliest, latest] guaranteed to contain the true current time. Transactions commit at timestamp s, then wait until now.earliest > s before releasing locks. This commit-wait guarantees any later-timestamped transaction truly started after the earlier's commit was visible — linearisability across the global database without per-read coordination.

The cost is hardware. CockroachDB mimics it in software with a MaxOffset parameter; if clocks drift past it, correctness is silently lost. Spanner turns replica-lag into clock-uncertainty — tractable if you control the clocks.

Read consistency levels — preview

Chapter 73 covers the spectrum of session-level consistency guarantees — monotonic reads, monotonic writes, read-your-writes, writes-follow-reads, consistent prefix — each ruling out a specific anomaly plain async replication permits. The four mitigations here deliver read-your-writes specifically. The full bundle needs session-long LSN accumulation (client stores the max LSN observed, sends it with every read, updates on every write and every read answer) and replica-side wait semantics that respect all five.

Where this leads next

You have the full picture: formal definition, causes, the anomaly in operational detail, four mitigations with costs and failure modes, and monitoring. Build 9 continues:

Chapter 71: Failover, fencing, and split-brain — which replica becomes primary, how to ensure the old primary cannot still accept writes, and what data loss is attributable to lag at the failover moment.
Chapter 72: Point-in-time recovery — base backups plus retained WAL to restore to any chosen moment; built on the same WAL stream.
Chapter 73: Read-your-writes, monotonic reads, causal sessions — the formal consistency levels that generalise this chapter's mitigations into a per-session contract.

Replica lag is a number on a dashboard; the read-your-writes anomaly is what happens to users when no one is accountable for that number. Measure it, bound it, make the application route around it. Every serious async deployment does all three; those that skip one eventually ship the anomaly to production and blame the frontend.

References

Kleppmann, Designing Data-Intensive Applications, O'Reilly 2017, chapter 5 — the canonical book-length discussion of replication lag, read-your-writes, monotonic reads, and consistent prefix reads, with the cleanest prose on why each anomaly matters.
PostgreSQL documentation, Chapter 27: High Availability, Load Balancing, and Replication — the operational reference: pg_stat_replication, pg_last_xact_replay_timestamp, hot_standby_feedback, max_standby_streaming_delay, and pg_wal_replay_wait.
Abadi, Consistency Tradeoffs in Modern Distributed Database System Design (PACELC), IEEE Computer 2012 — the short paper that refined CAP by pointing out that the latency-vs-consistency trade-off exists even without partitions, which is the regime replica lag lives in.
Corbett et al., Spanner: Querion's Globally-Distributed Database, OSDI 2012 — Spanner's TrueTime mechanism is the definitive example of engineering around replica lag by bounding clock uncertainty rather than bounding the lag itself.
MongoDB documentation, Causal Consistency and Read and Write Concerns — a productised causal-token implementation; afterClusterTime is the token, readConcern: 'majority' is the replica-side check, and the driver plumbs the token through for you.
Bailis et al., Probabilistically Bounded Staleness for Practical Partial Quorums, VLDB 2012 — treats replica lag as a probability distribution and derives expected-staleness bounds for quorum-replicated systems, giving a principled way to reason about "how often does a read return stale data?" instead of worst-case guarantees.