In short

Asynchronous log shipping is the oldest and simplest replication strategy, and it is what you get by default in Postgres, MySQL, and most commercial systems above them. The primary writes every change to its write-ahead log — the same WAL you already use for crash recovery. A separate sender process reads WAL records as they land and streams them over TCP to one or more replica servers. Each replica writes the records into its own WAL, then a replay loop applies them to the local tablespace. The replica is, byte for byte, a running re-execution of the primary's history.

"Asynchronous" means the primary does not wait. It commits locally — fsyncs its own WAL, returns success to the client — and then fires the record at the replica over the network. The primary's commit latency is unchanged by the presence of replicas. The replica lags behind by whatever the network and the replay loop happen to be doing: typically 10-100 ms under light load, seconds under heavy load, unbounded during a network partition. If the primary dies between committing locally and shipping the last few records, those records are lost — the client got an OK the replica never heard about.

In exchange for that hazard you get three wins. Read scaling — dashboard queries go to replicas, which together serve many times the read throughput of the primary alone. Disaster recovery — a replica in another region survives the primary's region going dark. Zero-downtime major-version upgrades — promote a replica running the new version and cut over.

This chapter builds an async log shipper in ~80 lines of Python, shows the byte flow through walsender and walreceiver in Postgres, explains how to measure replica lag, walks the failover procedure, and sets up ch.68 (synchronous replication) and ch.69 (semi-sync and quorum acks) by making the async trade-offs concrete enough to argue against.

It is 3 AM, and an AWS availability zone has gone dark. Your primary Postgres was in that zone. Your replica, in a neighbouring zone, is configured for asynchronous replication; monitoring says its lag at the outage was 800 ms. The deploy script promotes it, applications reconnect, reads and writes resume in about ninety seconds. Then a support ticket lands: a user's payment, confirmed at 3:00:00.412 AM, does not exist in the database. The primary did return COMMIT OK at that instant. The replica never received the record.

Twelve transactions in the primary's last 800 ms are gone. Async replication did exactly what it promises. The question is whether that promise is one your business can accept — yes for a social media timeline, no for a payments ledger. This chapter is about understanding precisely what async replication costs you so you can make that call.

Why ship the log, not the data

To replicate a database you need some stream of information that lets the replica reconstruct the primary's state. Three broad choices; shipping the log wins both of the others.

Option 1 — ship the data pages. Ship every modified buffer-pool page. The problem is volume: a single UPDATE changing one tuple inside a 4 KB page ships the entire page, and pages get dirtied for reasons unrelated to user writes (hint bits, freespace map updates, autovacuum). You end up shipping background noise.

Option 2 — ship the SQL. Capture each UPDATE, INSERT, and DELETE and replay it on the replica. The problem is non-determinism. UPDATE events SET created_at = NOW() gives the primary 14:02:17.334; the replica ten seconds later gets 14:02:27.891. INSERT INTO t VALUES (nextval('seq')) — primary allocates 5001, replica's sequence is at 4822. Data drifts. MySQL's historical "statement-based replication" hit exactly this and added "mixed" and "row-based" modes to paper over it.

Option 3 — ship the write-ahead log. The WAL is the ordered record of every physical change — page 17, offset 2048, 73 bytes of tuple data. You already have it for crash recovery (see torn writes and the need for a log).

Shipping the WAL wins on both axes. Compact — a 100-byte tuple update produces ~100 bytes of WAL, not a 4 KB page. Deterministic — the record says which bytes to write where, no computation on the replica side. The replica's WAL replay is the same code path it would run after a crash; the only difference is the records arrive over a socket instead of off local disk.

Why determinism is what makes async safe in the small: after applying WAL records up to LSN X, the replica's on-disk state is byte-identical to the primary's state at the same LSN. Nothing is "interpreted" on the replica that could drift. The only thing that can go wrong is the replica missing records — not applying them incorrectly.

The physical log stream — Postgres example

In Postgres, the WAL is a sequence of 16 MB segment files under $PGDATA/pg_wal/. WAL records are addressed by LSN (log sequence number), a 64-bit byte offset into the virtual concatenation of all WAL ever generated.

When a replica connects for streaming replication, it issues a START_REPLICATION command over the normal Postgres wire protocol. The primary's postmaster forks a walsender process dedicated to that replica. The walsender's job is simple: read WAL records from the local pg_wal/ starting at the LSN the replica asked for, and send each one to the replica over TCP as fast as TCP will carry it. Periodically it records the replica's feedback — "I have applied up to LSN X" — for monitoring.

On the replica side, a walreceiver process reads the stream and writes records into the replica's own pg_wal/. A separate startup process (the same code that runs during crash recovery) consumes WAL from local disk and applies it to the tablespace. The two-stage design — receive, then apply — lets the replica keep durable WAL even when the apply loop falls behind.

Streaming replication topologyLeft box labelled Primary containing a postmaster, a walsender process, and a pg_wal directory. Right box labelled Replica containing a walreceiver process, a startup process, and a pg_wal directory. Arrows show the WAL record flow: application writes hit the primary's pg_wal, the walsender reads from pg_wal, sends over TCP to the walreceiver, the walreceiver writes to the replica's pg_wal, and the startup process reads from replica's pg_wal and applies to the replica's tablespace. A dashed arrow indicates the replica's feedback about its replay position flowing back to the walsender.Primaryapplicationpostmasterwalsenderpg_wal/000000010000...000000010001...writereadReplicawalreceiverstartup (apply)pg_wal/000000010000...000000010001...writereadreplica tablespace (converges to primary)TCP: WAL recordsfeedback: replay LSN
Streaming replication in Postgres. Application writes land in the primary's WAL; walsender reads from pg_wal and streams records over TCP to the replica's walreceiver, which writes them to the replica's pg_wal. A separate startup process reads from the replica's pg_wal and applies records to the tablespace. The replica periodically reports its replay LSN back to the walsender for monitoring — but in async mode the primary does not wait for that report before returning COMMIT OK to the client.

Every streaming replication system looks more or less like this. MySQL's binlog replication swaps WAL for the logical binlog and the walsender/walreceiver pair for the binlog dump thread and the replica's I/O and SQL threads. Oracle Data Guard swaps them for redo log shipping and the managed recovery process. Engine-specific machinery varies; the abstract picture is one sender, one receiver, one applier, records flowing left to right over TCP.

Implementing a toy log shipper in Python

Build the primary side first. Strip the idea to its core: an append-only log, a list of replicas, and a broadcast on every append.

# replication/primary.py
from dataclasses import dataclass
from typing import Any
import socket, pickle, threading

@dataclass
class WALRecord:
    lsn: int           # monotonic log sequence number
    op: str            # 'SET' or 'DELETE'
    key: str
    value: Any = None

class WALPrimary:
    """Owns the WAL and fans records out to registered replicas.
    Async: the append returns as soon as the local WAL is updated;
    network failures on replicas are ignored."""

    def __init__(self):
        self.wal: list[WALRecord] = []
        self.replicas: list[socket.socket] = []
        self._lock = threading.Lock()
        self._next_lsn = 1

    def register(self, sock: socket.socket) -> None:
        self.replicas.append(sock)

    def append(self, op: str, key: str, value: Any = None) -> int:
        with self._lock:
            rec = WALRecord(self._next_lsn, op, key, value)
            self.wal.append(rec)
            self._next_lsn += 1
        self._broadcast(rec)
        return rec.lsn

    def _broadcast(self, rec: WALRecord) -> None:
        for sock in list(self.replicas):
            try:
                sock.sendall(pickle.dumps(rec))
            except (BrokenPipeError, OSError):
                pass   # async: don't care about dead replicas

The append path takes the internal lock only long enough to assign an LSN and append locally, then fans out to replicas with pickle.dumps over the sockets. Why the try/except is the whole idea of async: if a replica's socket is broken, the record is dropped for that replica and the primary moves on. The client gets its OK from append regardless. That is the fire-and-forget semantics.

Now the replica:

# replication/replica.py
import socket, pickle, threading
from replication.primary import WALRecord

class WALReplica:
    """Receives WAL records on a socket, applies them, tracks the last
    LSN applied. Real replicas also persist to their own WAL first;
    this toy applies directly."""

    def __init__(self):
        self.state: dict[str, Any] = {}
        self.applied_lsn: int = 0
        self._lock = threading.Lock()

    def run(self, sock: socket.socket) -> None:
        buf = b""
        while True:
            chunk = sock.recv(4096)
            if not chunk:
                return          # primary closed connection
            buf += chunk
            while True:
                try:
                    rec, size = pickle.loads(buf), len(pickle.dumps(pickle.loads(buf)))
                except (pickle.UnpicklingError, EOFError):
                    break       # wait for more bytes
                self._apply(rec)
                buf = buf[size:]

    def _apply(self, rec: WALRecord) -> None:
        with self._lock:
            if rec.op == 'SET':
                self.state[rec.key] = rec.value
            elif rec.op == 'DELETE':
                self.state.pop(rec.key, None)
            self.applied_lsn = rec.lsn

Together the two files are about seventy lines. They implement the full async-shipping loop: primary appends, broadcasts without waiting, replica receives, applies, updates applied_lsn. A third party asking the replica for data sees whatever state the replica has applied so far.

What is missing that a production system has: durable local WAL on the replica before apply, a reconnect-and-resume protocol keyed on last applied LSN, proper binary framing, TCP flow control, TLS, authentication. But the core semantics — append, broadcast, apply, never wait — are all already here.

Replica lag — the one metric you always watch

In a working async setup the replica is always a little behind; "by how much?" is the single most important observability metric in replication.

Lag is measured two ways. Time lag — the wall-clock difference between when the primary committed a transaction and when the replica applied it. Postgres: now() - pg_last_xact_replay_timestamp(). MySQL: Seconds_Behind_Master. Easy to reason about, matches what users care about, but only meaningful when the primary is actively writing. Byte lag — the WAL-byte distance between the primary's current write position and the replica's apply position. Postgres: pg_wal_lsn_diff(pg_current_wal_lsn(), pg_last_wal_replay_lsn()). A more robust alarm signal because it keeps updating even when the primary is temporarily idle.

What is normal. A well-provisioned intra-region replica under moderate write load lags by 10-100 ms or roughly 1 MB of WAL. Sub-millisecond most of the time, with occasional spikes when a big transaction commits.

What is abnormal. Sustained seconds-or-minutes lag is a red flag. Causes, in order of frequency:

  1. Replica CPU or I/O saturation. The Postgres apply loop is single-threaded per database (pre-parallel-apply). If the primary generates WAL faster than the replica applies it, lag grows without bound.
  2. Long read query conflict. Postgres pauses WAL apply if a replica query is reading a tuple the primary just vacuumed. Tunable via hot_standby_feedback (costs primary vacuum aggressiveness) or max_standby_streaming_delay (costs query cancellations).
  3. Network throughput. Rare intra-datacenter, common over slow VPN or cross-region links with insufficient bandwidth.
  4. A bloated transaction. A DELETE FROM huge_table generates a WAL record per tuple; the replica replays a billion records the primary dispatched in minutes.

Production monitoring alarms on lag > 10 s or lag > 1 GB of WAL. The alarm is not "something is broken" — it is "your disaster-recovery window has opened wider than you planned for."

Catching up after a disconnect

A replica that loses its connection — a brief outage, a deploy, an OS upgrade — needs to catch up when it returns. The protocol: on reconnect the replica tells the primary the last LSN it durably received; the primary streams everything from there forward.

The subtlety is that the primary must still have the WAL covering that LSN. Postgres retains WAL segments until the checkpointer decides they are no longer needed — once a segment's contents have been flushed to the main data files, the segment becomes eligible for recycling. A replica that was offline long enough for its last-applied LSN to fall behind the oldest retained segment is stale: the primary cannot stream missing records, and the replica must re-sync from scratch with pg_basebackup.

Two mechanisms control WAL retention. wal_keep_size is a hard minimum of recent WAL bytes the primary always keeps for potential replica catch-up; typical 1-16 GB. Why not set it huge: every byte of kept WAL is disk space not available for anything else, and the primary's own startup scans kept WAL. Replication slots go further — the replica "reserves" a position, and the primary promises not to recycle WAL past it. Safe for transient outages, dangerous for abandoned replicas where nobody notices the disk filling. The Postgres 13 max_slot_wal_keep_size cap exists specifically to prevent abandoned slots from destroying production primaries.

Every production setup has an explicit answer to: if a replica disappears for longer than X, what happens? Some combination of wal_keep_size, slot configuration, slot-lag monitoring, and a pg_basebackup runbook for when all else fails.

The data-loss trade-off

The async promise, measured in microseconds:

  1. Client sends COMMIT.
  2. Primary writes and fsyncs the commit record to its WAL.
  3. Primary returns COMMIT OK to the client. (T + few ms)
  4. Primary hands the record to the walsender.
  5. Walsender TCP-sends. (T + RTT/2)
  6. Replica walreceiver writes to local WAL. (T + RTT/2 + disk)
  7. Replica startup applies. (T + RTT/2 + disk + apply)

The client's OK arrives at step 3. Everything from step 4 onward is best-effort. If the primary crashes anywhere between steps 3 and 5, the replica never sees the record, and the client's guarantee — "your write is committed" — is a lie about everything except the dead primary's disk.

The industry term for this window is RPORecovery Point Objective, the maximum data you are willing to lose in a disaster. For async replication, RPO equals current replica lag. At 10,000 TPS with 800 ms of lag, you can lose 8,000 transactions. Whether that is acceptable depends on what those transactions are. "User clicked a like button" — fine. "User sent a wire transfer" — career-ending.

Why async is the default anyway: most workloads really can tolerate a small RPO, and the latency gain is dramatic. A primary that waits for a cross-region replica to fsync adds 100+ ms to every commit. An e-commerce cart update that took 15 ms now takes 120 ms. Users notice. Async lets you keep the 15 ms commit latency and accept the small data-loss window as the cost of doing business.

For workloads where the RPO is unacceptable, ch.68 covers synchronous replication — primary waits for replica ack before returning OK, making RPO zero but paying RTT on every commit. Ch.69 covers the quorum-based middle ground.

Failover with async replication

When the primary dies you must promote a replica. The procedure, stripped:

  1. Detect the failure. A health check misses N consecutive heartbeats; typical N=3, interval 1 s — detection takes 3-5 seconds. Faster detection risks false positives from transient hiccups.
  2. Fence the old primary. Make absolutely sure it cannot still accept writes. This is the split-brain failure mode: the network between health-checker and primary is partitioned, but the primary is still alive and serving clients on the other side. Without fencing you get two primaries and divergent writes, and you must throw one side's data away. Tools: Pacemaker with STONITH, cloud APIs to forcibly terminate the instance, shared quorum disks.
  3. Promote the replica. Postgres: pg_ctl promote. MySQL: STOP REPLICA; RESET REPLICA ALL. Seconds.
  4. Re-point application traffic. Update DNS, PgBouncer upstream, HAProxy — whatever routes in front of the database. The trap here is DNS TTLs still cached in clients keeping some applications talking to a dead primary for minutes.
  5. Re-sync surviving replicas against the new primary. Often a full pg_basebackup because the new primary's WAL history has diverged past the failover point.

End-to-end failover time rarely matches the per-step "seconds" sum. Detection (3-5 s) + fencing (2-10 s) + promotion (1-2 s) + DNS propagation (30 s to minutes) + pool reconnect (seconds to tens of seconds). Well-tuned: 10-30 seconds. Poorly tuned: 5-10 minutes. The application sees database unavailability across that window.

The data between the old primary's last streamed WAL record and its crash is gone. Either you recover it from the old primary's disk after the fact (if the disk survived) or you accept the loss as the cost of your RPO.

Read scaling with replicas

The other reason to run replicas is throughput. A primary plus N replicas all serving reads handle roughly N+1 times the read load. Writes go to primary; reads — analytics, dashboards, background jobs — go to any replica. ORMs and middleware poolers (PgBouncer split pools, ProxySQL) route automatically by statement type.

The hazard is read-your-own-writes consistency: POST /comment commits on primary at t=0; GET /comments at t=100 ms routes to a replica lagging 300 ms, and the user's own comment is missing. "I just posted it — where is it?"

Standard mitigations: read from primary for N seconds after a write (per-user last-write timestamp in session; route to primary until elapsed > lag + margin); causal-consistency tokens (primary returns commit LSN with the write response; reads pass the LSN; router picks a replica whose applied_lsn >= client_lsn or falls back to primary); accept the staleness for dashboards and aggregates. Tokens are the principled answer; most systems use the N-seconds heuristic because it is cheaper to implement.

A three-region async replication setup

You run a Postgres primary in ap-south-1 (Mumbai) serving a user-facing application. Peak load 10,000 TPS writes, 50,000 TPS reads.

  • Primary: ap-south-1a.
  • Replica A: ap-south-1b (different AZ). Intra-region RTT: 1-2 ms.
  • Replica B: ap-south-1c (third AZ). Intra-region RTT: 1-2 ms.
  • Replica C: eu-west-1 (Dublin). Cross-region RTT: 130 ms.

Measured lag distribution over 24 hours:

Replica A (ap-south-1b):  p50=8 ms   p95=45 ms   p99=180 ms   max=1.2 s
Replica B (ap-south-1c):  p50=7 ms   p95=40 ms   p99=160 ms   max=900 ms
Replica C (eu-west-1):    p50=140 ms p95=280 ms  p99=1.1 s    max=8 s

Intra-region replicas lag by milliseconds under normal load, with spikes to hundreds of milliseconds during large batch jobs. The cross-region replica always lags by at least the one-way network delay (~65 ms) plus buffering and apply; under load it spikes to seconds.

RPO for a region-wide failure: if ap-south-1 goes dark, you failover to Replica C. Its p50 lag is 140 ms, p99 is 1.1 s, worst-case 8 s. At 10,000 TPS that is up to 80,000 lost transactions worst-case, ~1,400 expected.

Read routing: reads unrelated to recent writes go to any replica; reads within 2 s of a user's last write go to primary (intra-region p99 is 180 ms; 2 s is a comfortable margin). Replica C stays DR-only, out of the normal read rotation — its p99 lag would cause frequent read-your-own-writes failures for Mumbai users.

What ch.68 would change: if you wanted RPO ≈ 0 you would configure synchronous replication to Replica A or B (2 ms RTT is tolerable); primary commit latency rises from ~3 ms to ~5 ms. Replica C stays async — 130 ms per commit is unacceptable. For regional disasters, synchronous intra-region replication does nothing; this is why serious systems layer quorum-based synchronous replication (ch.69) across at least three regions.

Common confusions

Going deeper

Replication slots — guaranteed catch-up, at a price

wal_keep_size gives eventual catch-up: the replica can reconnect as long as the WAL covering its last-applied LSN has not been recycled. Offline longer than the retention window means a full pg_basebackup. For setups where a full re-sync is unacceptable (large databases, thin network pipes), Postgres offers replication slots — a named resource on the primary that tracks a specific replica's restart_lsn. The primary guarantees WAL at or after that position will not be recycled.

The cost: an abandoned slot will destroy your primary. If a replica is torn down without cleaning up its slot, the primary accumulates WAL forever. Postgres 13 added max_slot_wal_keep_size as a safety cap: once slot lag exceeds it, the slot is invalidated. Always set this in production.

Logical vs physical replication

Everything in this chapter is physical replication — byte-for-byte WAL streaming, replica on-disk format identical to primary. Constraints: same major version, whole-instance replication only, replica is read-only.

Logical replication (Postgres 10+) decodes WAL into per-row changes and ships them to subscribers, who replay as SQL-equivalent operations. Costs: more CPU on primary, larger on-wire data. Wins: selective replication, cross-version replication, non-Postgres consumers (Kafka via Debezium), writable subscribers. A later Build 9 chapter covers logical in detail; the async streaming semantics apply equally.

Cross-region throughput limits

Streaming across continents — Mumbai to London at 130 ms RTT — hits two quiet limits most single-region setups never see. TCP throughput is capped by RTT × receive window. Default Linux 6 MB × 130 ms RTT caps throughput around 46 MB/s regardless of bandwidth — a primary generating 100 MB/s of WAL falls behind forever. Fix: TCP window scaling with tcp_rmem/tcp_wmem auto-tuning, and wal_compression = on which typically gets 3-5× ratios because WAL records are repetitive. Postgres 14 also added column-level delta encoding for certain TOAST-able types.

Where this leads next

You now have the full picture of async log shipping: the WAL-vs-alternatives argument, the walsender/walreceiver physical flow, a working toy implementation, lag as the central metric, catch-up semantics, RPO, the failover procedure, and read scaling with its caveats. Every other Build 9 chapter adjusts one dimension of this picture without discarding the rest.

Async is the default because it is simplest, fastest, most available — and because "a small window of data loss on a rare disaster" is acceptable for most workloads. Sync and quorum exist for the ones that cannot accept it. You pick your rung on the ladder by measuring your RPO tolerance against your latency and availability budgets.

References

  1. PostgreSQL documentation, Chapter 27: High Availability, Load Balancing, and Replication — the canonical reference for streaming replication, synchronous_commit, wal_keep_size, replication slots, and failover procedures.
  2. MySQL documentation, Chapter 19: Replication — the equivalent reference for MySQL's binlog-based replication, including GTID-based failover and the Seconds_Behind_Master semantics that differ subtly from Postgres lag.
  3. Oracle, Oracle Data Guard Concepts and Administration, 19c — Oracle's enterprise replication stack. Maximum Performance, Maximum Availability, and Maximum Protection modes map directly to async, semi-sync, and sync.
  4. Kleppmann, Designing Data-Intensive Applications, O'Reilly 2017, chapter 5 — the clearest book-length treatment of replication trade-offs, covering leader-follower, multi-leader, and leaderless models and the consistency implications of replication lag.
  5. Kingsbury, Jepsen: PostgreSQL 12.3 — Aphyr's Jepsen tests of Postgres replication under network partitions, with documented split-brain modes when fencing is misconfigured.
  6. Heroku Engineering Blog, Postgres HA architecture — a production-grade walk-through of running thousands of Postgres HA clusters, with concrete numbers on failover detection timing, promotion latency, and real-world RPO distributions.