Synchronous Replication and Its Latency Tax

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Synchronous replication closes the data-loss window that async log shipping opens, and it does so by inverting the one rule that made async fast: the primary now waits. When a client sends COMMIT, the primary writes the WAL record locally, ships it to at least one replica, and then — before telling the client anything — waits for that replica to answer "got it, persisted." Only after the ack does the primary return COMMIT OK. The guarantee you buy: any transaction the client saw as committed survives an immediate primary crash, because at least one replica already has the WAL record on disk. The RPO (Recovery Point Objective) drops from "up to your current lag window" to exactly zero.

The price is paid on every commit, forever. The client's commit latency now includes one full network round-trip to the replica plus the replica's own fsync. For a replica in the same rack (~0.5 ms RTT) the extra latency is ~2 ms per commit; same availability zone (~2 ms RTT) adds ~5 ms; cross-AZ adds ~20 ms; cross-region Mumbai-to-Frankfurt adds ~120 ms. A workload that previously committed in 3 ms now commits in 5-120 ms depending on where its sync partner lives, and a sequential-commit loop tops out at 1 / (commit_latency) transactions per second.

In Postgres the switch is synchronous_commit = on combined with synchronous_standby_names. The non-obvious failure mode: with a single sync standby configured ('replica1'), the primary stalls all writes when replica1 dies, because it is waiting forever for an ack that cannot come. Fix: name multiple sync candidates — 'FIRST 1 (replica1, replica2)' — so at least one is always alive. Sync replication is about failure semantics (RPO=0) and not about steady-state throughput; for throughput, group commit and quorum configurations recover most of what a naive sync setup loses.

The CTO of a fintech startup reads chapter 67 and calls you into a conference room with the door closed. "The payment window. 3 AM. You said twelve transactions can just vanish if the primary dies between committing locally and shipping the WAL. We absolutely cannot lose a payment. What is the fix?"

The fix is this chapter. It is also a five-millisecond tax on every commit for the system's life, and an availability hazard if misconfigured. Sync replication is not strictly safer than async — it trades one risk (data loss on primary crash) for another (writes stall when the sync replica is unreachable).

The sync vs async protocol difference

The entire difference between async and sync replication fits in one edit to the commit protocol. Async, as you saw in ch.67:

Client sends COMMIT.
Primary writes WAL record locally, fsyncs.
Primary returns COMMIT OK to client.
Primary, in background, ships WAL record to replica.

Sync inserts one new wait between steps 2 and 3:

Client sends COMMIT.
Primary writes WAL record locally, fsyncs.
Primary sends WAL record to replica.
Primary waits for replica to fsync and ack.
Primary returns COMMIT OK to client.

That one wait — steps 3 and 4 — is the entire chapter. Every operational property of sync replication (latency, availability, RPO) derives from it.

Why moving the ack inside the commit path changes durability absolutely: the client's "OK" is now contingent on the WAL existing in at least two places. If the primary explodes the instant after returning OK, the WAL record is still on the replica's disk; promotion recovers it. In async, the primary's OK only meant "my local disk has it" — an explosion after OK but before shipping lost the record forever.

The new wait is not small. It is at least one network round-trip (primary to replica, ack back) plus one replica-side fsync. In the fast case (same rack, NVMe) it is a few milliseconds. In the slow case (cross-continent) it is over a hundred. And the commit path is the critical path for almost every OLTP workload.

The latency tax — quantified

Break down where the milliseconds go.

Primary-local commit (baseline, no replication):

WAL write: ~10 μs (kernel buffer).
fsync on NVMe with battery-backed cache: ~200 μs; spinning rust with write cache off: ~5 ms.
Commit LSN update, lock release: ~10 μs.
Total: 0.5-5 ms depending on storage.

Async (ch.67): Zero extra. The walsender hands bytes to the kernel TCP buffer and returns; the client has long since been told OK.

Sync: Add one RTT and one replica fsync.

Distance	Typical RTT	Replica fsync	Extra latency
Same rack	0.1-0.5 ms	0.2-1 ms	~1-2 ms
Same AZ	1-2 ms	0.5-1 ms	~2-5 ms
Cross-AZ (same region)	5-15 ms	0.5-1 ms	~8-20 ms
Cross-region (same continent)	20-40 ms	1 ms	~25-45 ms
Cross-continent	80-180 ms	1 ms	~85-185 ms

Why the replica fsync matters as much as the network: the replica is promising durability, not just receipt. A cross-rack deployment with a replica on spinning disks can have its fsync dominate the round-trip. This is why many production setups co-locate a fast NVMe sync replica and a slow-disk async replica — fast storage for the sync partner is far more valuable than fast storage for an async reader pool.

For a serial-commit workload — one transaction waiting for the previous to commit before starting — these extras dictate a hard throughput ceiling. Async at 1 ms commit latency tops out at ~1000 TPS serial. The same workload with a 5 ms same-AZ sync replica tops at ~200 TPS. Cross-region, 8 TPS. The tax is not linear on throughput; it is linear on the serial part of the workload.

Parallel commits fare much better (see group commit below). For a payment API with 500 concurrent connections, the throughput hit is typically 10-20%, not 80%. For a batch job committing in a tight sequential loop, it can be catastrophic.

The Postgres configuration

Postgres exposes sync replication through two knobs. The first, synchronous_commit, controls what kind of ack the primary waits for on each commit. It takes five values in increasing order of durability:

Value	Primary waits for
`off`	Nothing. Even the primary's own WAL fsync is deferred.
`local`	Primary's own WAL fsync only. Equivalent to no replication.
`remote_write`	Replica's walreceiver has written the record to the OS page cache.
`on` (default when sync configured)	Replica has fsync'd the record to disk.
`remote_apply`	Replica has fsync'd and replayed the record to the replica's tablespace.

remote_write is faster than on because it skips the replica's fsync; it trades a tiny extra data-loss window (the replica's OS buffer is lost in a replica crash) for a few milliseconds of commit latency. remote_apply is the strongest guarantee — it means any SELECT on the replica will see the just-committed row — and it is the slowest because the replica apply loop is usually slower than its write loop. Most production deployments pick on as the sane middle.

The second knob, synchronous_standby_names, names which replicas can satisfy the ack. Two grammar forms:

synchronous_standby_names = 'FIRST 1 (replica_a, replica_b, replica_c)'
synchronous_standby_names = 'ANY 2 (replica_a, replica_b, replica_c)'

FIRST N (...) means: of the listed replicas currently connected, the first N in list order must ack. Ordering is explicit priority. If replica_a is connected it is the sync partner; if it disconnects, replica_b takes over; and so on.

ANY N (...) is unordered quorum: any N of the listed replicas can ack. This is the building block of quorum-based replication (ch.69) and is what you actually want for HA.

Why the naming matters for availability: 'replica_a' (just a name, implicit FIRST 1) means the primary's entire write path is gated on replica_a's liveness. 'FIRST 1 (replica_a, replica_b)' means the primary stays up as long as either is alive. The difference between "one sync replica" and "one required ack from a pool" is the difference between "your primary dies when your sync replica does" and "your primary survives any single-replica failure." You almost always want the latter.

A transaction can also opt out at statement scope with SET LOCAL synchronous_commit = off. Useful for bulk loads where you accept async durability for the import and then resume sync for the live traffic afterwards. The off at transaction scope does not disable replication — it just declines to wait.

Python sketch of a sync shipper

Modify the async primary from ch.67. The structural change is small: append-then-broadcast becomes append-then-broadcast-and-wait. The semantic change is enormous: if no replica answers, the caller does not return.

# replication/sync_primary.py
import socket, pickle, threading, time
from replication.primary import WALRecord, WALPrimary

class CommitStalled(Exception):
    """Raised when sync commit cannot complete — no sync partner acked."""

class WALSyncPrimary(WALPrimary):
    """Like WALPrimary but blocks on at least `needed` acks before
    declaring a commit done. Models synchronous_standby_names 'ANY N'."""

    def __init__(self, needed: int = 1, timeout_s: float = 5.0):
        super().__init__()
        self.needed = needed
        self.timeout_s = timeout_s

    def append(self, op: str, key: str, value=None) -> int:
        with self._lock:
            rec = WALRecord(self._next_lsn, op, key, value)
            self.wal.append(rec)
            self._next_lsn += 1
        acks = self._broadcast_and_wait(rec)
        if acks < self.needed:
            raise CommitStalled(f"got {acks} acks, needed {self.needed}")
        return rec.lsn

    def _broadcast_and_wait(self, rec: WALRecord) -> int:
        payload = pickle.dumps(rec)
        deadline = time.monotonic() + self.timeout_s
        acks = 0
        for sock in list(self.replicas):
            remaining = deadline - time.monotonic()
            if remaining <= 0:
                break
            sock.settimeout(remaining)
            try:
                sock.sendall(payload)
                reply = sock.recv(64)        # blocks until replica acks
                if reply.startswith(b"ACK"):
                    acks += 1
                    if acks >= self.needed:
                        return acks
            except (socket.timeout, OSError):
                continue                      # try the next replica
        return acks

About thirty lines. The visible difference from ch.67's append: after _broadcast, the method does not return the LSN to the caller. It returns only after it has collected needed ACKs or exhausted timeout_s. A caller asking for a commit on a dead sync partner blocks for timeout_s seconds and then gets CommitStalled.

The corresponding replica change is one extra line — after applying, send b"ACK" on the socket. You already have durable enough semantics in the receive-and-apply path; promoting it to sync is purely the ack.

Why the timeout matters and why Postgres does not have one: in the real thing, the primary waits forever (or until the replica reconnects) for the ack. A timeout-then-fail policy creates a new RPO problem — the client thinks the commit failed, so retries, but the WAL record is already in the primary's WAL and may have reached the replica. The canonical answer is "block forever, fix the replica, let the client sit." That is what the single-sync-replica trap below is.

Availability gotcha — the single-sync-replica trap

The most surprising property of sync replication, and the one that fells every team deploying it for the first time, is that a single-sync-replica configuration trades your availability for your durability. Naively:

synchronous_standby_names = 'replica1'

means: wait for replica1 to ack. If replica1 is down, no ack comes. The primary does not commit. Every client COMMIT hangs. The application appears to have a dead database, even though the primary itself is perfectly healthy.

This is not a bug. It is consistent with what you asked for. You said "never commit without replica1's ack." Replica1 is not acking. The primary is obeying. Any implementation that "helps you out" by silently committing without the ack has thrown away the durability guarantee — precisely what you installed sync replication to obtain. Postgres's authors chose correctness over availability here, and it is the right choice given the spec.

The correct configuration is to name a pool:

synchronous_standby_names = 'FIRST 1 (replica1, replica2)'

Now the primary waits for the first connected replica in the list. If replica1 is down but replica2 is up, replica2 is the sync partner. The primary keeps committing. Only if both are down do writes stall — and if both of your sync partners are simultaneously unreachable, stalling is arguably the right answer.

The single-sync-replica trap. Left: naming one replica as sync means the primary stalls all writes when that replica dies — correct behaviour per the spec but catastrophic for availability. Right: naming a pool with FIRST 1 means the primary uses whichever listed replica is alive, surviving any single replica failure while preserving the zero-RPO guarantee.

This is the first place sync-replication deployments fall over. A team configures one sync replica in testing, ships it to production, the replica has a disk event at 3 AM, and all writes stall. The fix is trivial once you know; the failure is brutal if you do not.

Sync is about failure semantics, not steady-state throughput

In steady state — all replicas up, network healthy — sync replication looks like "async plus a few milliseconds per commit." It is easy to evaluate it against async on that steady-state curve and conclude "not worth it; we will save 5 ms per commit and deal with rare data loss."

That framing misses the point. Sync replication does not exist to make steady state better. It exists to make the failure mode better. The steady-state cost is the tax you pay so the crash-and-recover scenario has a different outcome.

Async, primary crash at time T: RPO = current replica lag. Typical 10-800 ms. At 10,000 TPS with 200 ms lag, 2,000 transactions evaporate post-failover.

Sync, primary crash at time T: RPO = 0, as long as a sync replica was connected. In-flight commits (primary hadn't returned OK yet) return error and the client retries; no committed transaction is lost.

The right question is not "is 5 ms/commit worth it?" but "what is the business cost of losing N transactions in a rare failure?" Social feed: near zero — the user will post again. Payments ledger: catastrophic — a reversed payment is a real-money loss and a regulatory incident.

Why "failure semantics" is the right lens and not "throughput": every replication mode is "fine" in steady state. All of them commit, all of them ship, all of them replay. They diverge only when something breaks. Picking a replication mode by benchmarking steady-state TPS is the equivalent of picking car insurance by benchmarking top speed.

Batch-friendly workloads — group commit

The latency-tax story so far has assumed a serial committer: one transaction waits for the previous to finish. Real OLTP workloads are almost never serial. An app server pool of 200 connections commits 200 transactions roughly concurrently.

Postgres exploits this with group commit. Between the primary's WAL flush and the replica ack wait, the primary batches as many pending transactions as are ready: their WAL records go into one network packet, one ack covers them all, and every transaction in the group gets notified when the single ack arrives. From the per-transaction view, each commit still costs one RTT — but the RTT is amortised across the whole batch, which can include hundreds of commits.

A sync primary with 5 ms RTT and 200 concurrent committers runs close to its async throughput. Each committer waits ~5 ms for its batch's ack, but 200 committers wait on the same ack, so the primary processes 200 commits every ~5 ms: ~40,000 TPS. The same workload serial caps at 200 TPS.

"Sync replication kills throughput" is a half-truth. It kills serial throughput. For concurrent workloads with many in-flight transactions, the loss is often under 20%. The latency to the individual committer is still ~5 ms higher, but that is usually acceptable.

Sync replication across regions — the hard case

Every analysis so far has been intra-region. Cross-region sync is a different beast.

Characteristic RTTs from Mumbai:

Mumbai ↔ Singapore: ~40 ms.
Mumbai ↔ Frankfurt: ~120 ms.
Mumbai ↔ us-east-1 (N. Virginia): ~185 ms.

A sync replica in Frankfurt means every commit costs 120+ ms on the wire alone. Group commit helps throughput but cannot help the individual commit's latency. For a human-facing API, 120 ms added to every write is noticeable. For a payment confirmation, it can push over the "user retries because they thought it timed out" threshold.

Worse, cross-region RTT is unstable. Submarine cable congestion can push Mumbai-Frankfurt to 200+ ms for hours. Your p99 commit latency becomes a function of ocean traffic.

The real-world solution is layered replication:

Sync replica in the same region (another AZ, ~2-5 ms RTT). Zero RPO under single-AZ failure. Modest latency tax (~5 ms).
Async replica in a different region (cross-region, best effort). Non-zero RPO under region-wide failure, but the same-region sync replica has already paid the zero-loss cost for the common case.

Why this two-tier design is the norm for serious OLTP: the most likely failure you will face is not a continent sinking, it is a single AZ losing power for 45 minutes. You want the sync guarantee to cover that failure, because it is frequent. For the much rarer region-wide failure, you accept some data loss in exchange for not paying 120 ms on every commit for the lifetime of the system. This is what AWS Aurora and every large Postgres deployment I have seen do, and it maps directly to two rungs on the ladder ch.67 → ch.68 → ch.69 lays out.

A three-node fintech deployment: numbers end to end

A payments service runs a Postgres primary in ap-south-1a (Mumbai AZ-1) with the following replicas:

Replica A in ap-south-1b (Mumbai AZ-2). RTT 2 ms. Same-region NVMe. Configured as sync.
Replica B in ap-southeast-1a (Singapore). RTT 40 ms. Configured as async.

Postgres config:

synchronous_commit = on
synchronous_standby_names = 'FIRST 1 (replicaA, replicaB)'

Listing both with FIRST 1 means A is the preferred sync partner; if A is gone, B takes over. The primary will not stall on A's failure.

Layered replication for zero-RPO under AZ failure with tolerable latency. The sync link to AZ-2 is 2 ms RTT and preferred; the async link to Singapore is 40 ms RTT and provides regional DR. Average commit costs primary fsync + 2 ms RTT + replica fsync ≈ 5 ms.

Measured commit latency (p50): ~5 ms. Breakdown: 1 ms primary fsync, 1 ms send to A, 1 ms A fsync, 1 ms ack back, 1 ms overhead.

RPO under AZ-1 (primary) failure: zero. Replica A in AZ-2 has every WAL record the primary ever acknowledged; promote A, lose nothing.

RPO under full Mumbai region failure (both AZs gone): equal to the lag to Singapore, typically ~40-200 ms. Promote B. At 1000 TPS that is 40-200 lost transactions — non-zero, but the frequency of such a failure (perhaps once every 5-10 years) and its magnitude (~0.2 seconds of transactions) is a trade management has signed off on.

What if Replica A dies? FIRST 1 (replicaA, replicaB) falls back to B. Commit latency jumps from 5 ms to 45 ms — slow, alerts fire, SRE pages — but writes do not stop. Fix A; the primary preferentially returns to using A once A reconnects.

What if both A and B die? Writes stall. This is the correct behaviour: you asked for zero RPO, and with no sync partner the guarantee cannot be upheld. The application must be coded to handle commit timeouts gracefully.

Common confusions

"Sync replication is always safer than async." Safer for data, not for availability. A misconfigured single-sync-replica setup stalls all writes on replica failure; an async setup would keep accepting them. Synchronous ≠ highly available. You trade one hazard for another and must pick which matters more.
"Sync gives you RPO = 0 always." Only if a sync replica is actually reachable at the moment of primary failure. If the sync replica was itself offline seconds before the primary died — and Postgres's degrade_on_sync_loss behaviour silently fell back to async — you are back to async RPO. Alert on sync replica disconnection immediately; it is a silent durability downgrade.
"Sync doubles your write latency." It adds one RTT plus the replica's fsync to the existing primary-local commit time. If the primary commit was 1 ms and the sync cost is 4 ms, the new latency is 5 ms — 5× the baseline, not 2×. If the primary commit was 10 ms on slow disks and the sync cost is 2 ms, the new latency is 12 ms — 1.2×. The ratio depends entirely on baselines.
"The primary waits for all replicas." No. It waits for the number and identity specified in synchronous_standby_names. 'FIRST 1 (a, b, c)' waits for one. 'ANY 2 (a, b, c)' waits for two. Other replicas can be async and never enter the commit path.
"Sync commit means the replica has applied the change." Default synchronous_commit = on means the replica has fsync'd the WAL record. It has not necessarily replayed that record into the tablespace, so a SELECT on the replica may not yet see the new row. For apply-level sync you need remote_apply, which is slower because the apply loop is behind the write loop.

Going deeper

Group commit tuning in Postgres

commit_delay (microseconds) and commit_siblings (minimum concurrent transactions) control how aggressively the primary batches commits. The defaults (commit_delay = 0) prioritise individual latency; tuning commit_delay up to 100-1000 μs on a high-concurrency sync deployment can multiply throughput by 2-5× at the cost of a sub-millisecond latency increase per commit. The right value depends on your median concurrent-transaction count — benchmark, don't guess.

Two-safe and three-safe (DB2 terminology)

IBM DB2 uses "two-safe" (primary + one sync replica) and "three-safe" (primary + two sync replicas). Postgres's synchronous_standby_names = 'FIRST 2 (a, b)' is three-safe. Two-safe tolerates single-node failure without data loss; three-safe tolerates two simultaneous failures. Most fintech systems run two-safe with a third async replica for DR; banks with regulatory constraints run three-safe.

Semi-sync as the middle ground

Ch.69 covers semi-sync and quorum acks in depth. The essential idea: require a replica to receive the WAL (into TCP buffer or OS buffer cache) but not to fsync it. MySQL's lossless semi-sync and Postgres's remote_write are variants. The durability guarantee weakens slightly (a replica crash right after receive loses the record) but the latency drops by the fsync time (often 1-5 ms). For many workloads this is the sweet spot.

Where this leads next

You now have the full picture of synchronous replication: the protocol edit (one wait inserted into commit), the latency tax (one RTT + one fsync, quantified by distance), the Postgres configuration (synchronous_commit + synchronous_standby_names), the availability gotcha (single-sync-replica stalls writes; use a pool with FIRST 1), the failure-semantics framing (sync is about what happens when things break, not throughput), group commit as the throughput recovery mechanism, and the cross-region hard case with layered sync+async.

Semi-sync and quorum acks — chapter 69. The middle ground between async's zero latency and sync's full RTT. Primary waits for K of N replicas to ack, giving a tunable point on the durability-latency-availability triangle. This is how most modern HA databases actually run in production.
Raft and Paxos (ch.70+). Distributed consensus generalises quorum acks from "replicas of a primary" to "peers in a consensus group." Every committed write is acknowledged by a majority; primary election itself is a consensus decision. Spanner, CockroachDB, TiDB, and etcd all live here.

Sync replication is the strongest durability primitive a single-primary database offers. Its cost is non-trivial but predictable. Its failure modes are surprising but knowable. The choice between async, sync, and quorum is a choice about which risks you can tolerate and which latencies you can afford — and now you have the vocabulary and numbers to make the trade-off deliberately instead of by default.

References

PostgreSQL documentation, Synchronous Replication — the canonical reference for synchronous_commit, synchronous_standby_names, the FIRST/ANY grammar, and the availability implications of each configuration.
MySQL documentation, Semisynchronous Replication — MySQL's take on sync replication, including the "lossless" mode introduced in 5.7 that closes a small phantom-read window the older semi-sync exposed.
Kleppmann, Designing Data-Intensive Applications, O'Reilly 2017, chapter 5 — the clearest book-length framing of sync vs async vs quorum replication, including the failure-mode-centric analysis this chapter leans on.
Kingsbury, Jepsen: PostgreSQL 12.3 — Aphyr's Jepsen tests covering sync replication under network partitions, with documented behaviour when the sync partner is unreachable and the primary's stall-forever semantics.
Verbitski et al., Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases, SIGMOD 2017 — Aurora's storage layer decomposes sync replication into per-segment quorums across 6 copies, which is one answer to "how do you get sync's guarantee without paying its full RTT on every commit."
Corbett et al., Spanner: Querion's Globally-Distributed Database, OSDI 2012 — Spanner replaces single-leader sync replication with Paxos-per-shard, making every committed write a quorum decision across globally-distributed replicas. The conceptual successor to the ideas in this chapter.