Semi-Sync Replication and Quorum Acknowledgements

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Async replication (ch.67) loses whatever is in the replica-lag window when the primary dies. Strict synchronous replication (ch.68) fixes that by waiting for one specific replica's ack on every commit — which means that replica's health is now your primary's health. This chapter covers the two standard middle grounds.

Semi-sync replication is MySQL's answer. The primary sends the WAL record to the replica, starts its local fsync in parallel, and returns COMMIT OK as soon as either the local fsync completes or the replica acknowledges receipt — whichever arrives first. When the replica is healthy, the ack usually arrives before the local fsync, so semi-sync behaves like strict sync with zero data loss. When the replica disconnects, the primary times out and falls back to local-only commits, logging that it has degraded. You get sync-grade safety most of the time and async-grade availability during outages — at the cost of a small data-loss window when the primary and its one sync replica die together within the ack window.

Quorum acks generalise the idea. Instead of one sync replica, run N and commit after any Q ack. With N=3 and Q=2, one replica can fail without affecting commit latency — the fastest two of three ack quickly, and the slow or dead one is never on the critical path. General rule: to tolerate F failures, run 2F+1 replicas and require F+1 acks.

Raft, Paxos, and their descendants — CockroachDB, etcd, Spanner, TiKV — use quorum acks natively. Every write flows through a majority commit. Semi-sync is the tactical patch on a leader-follower system; quorum acks are the strategic architecture of distributed databases.

Strict sync gives you zero RPO by binding your primary's liveness to one replica's. Async gives you primary liveness with a data-loss window on failover. Both trade-offs are endpoints of a design space; production systems live in the middle.

The middle is occupied by two ideas. Semi-sync relaxes strict sync just enough that a dead or slow replica no longer halts the primary. Quorum acks generalise "one sync replica" to "any majority of N replicas" — individual failures are absorbed by the ensemble without anyone noticing. The rule: to tolerate F failures, run 2F+1 replicas and wait for F+1 acks.

Semi-sync is the simpler of the two; its failure modes are the ones to internalise before the quorum version makes sense. Quorum acks are the bridge from "primary-replica" thinking to the consensus-based distributed databases that dominate the modern cloud stack.

Semi-sync — MySQL's design

Semi-synchronous replication was added to MySQL in 5.5 as a plugin, made native in later versions, and has been the template everyone copies since. The primary-side flow for a single commit:

Transaction reaches the commit boundary. Primary assembles the commit WAL record (binlog event, in MySQL terminology).
Primary ships the record to the registered semi-sync replica over the normal replication stream — this kicks off a network round-trip in the background.
Primary starts its own local fsync of the WAL record to disk — this kicks off a disk round-trip.
Primary waits for whichever of (replica ack) or (local fsync) completes first.
On completion, primary returns COMMIT OK to the client.

Why "whichever first" is the correct condition: the commit is durable as soon as either the replica has the record in memory (it will replay it even if the primary dies) or the primary has the record on disk (the primary will find it on restart). You do not need both conditions for durability; you need either one. Waiting for the later of the two would be strict sync and would pay the maximum of the two latencies on every commit.

In practice, local fsync takes 1-5 ms on NVMe, 5-20 ms on ordinary cloud SSDs; intra-region network RTT is 0.5-2 ms. The replica ack almost always arrives before the local fsync returns, so semi-sync in the steady state behaves exactly like strict sync: the replica has the record before the client gets its OK.

The difference appears when the replica is absent. If the replica is disconnected, the "wait for replica ack" branch never completes. Real implementations add a timeout (MySQL default 10 seconds, production setups often tighten to hundreds of milliseconds); on timeout the primary gives up on the ack, logs that it has fallen back to async, and commits based on the local fsync alone. Subsequent commits stay async until a replica reconnects.

This is the operational property: a dead replica cannot take your primary down. You lose the zero-RPO guarantee during the fallback window — back to normal async exposure — but the primary keeps serving.

Semi-sync trade-offs

vs async. Async's data-loss window is always current replica lag (10-100 ms of WAL typically). Semi-sync's is usually zero — the replica has the record when the client gets OK — except during the fallback window after a replica disappears. Expected-case RPO drops from tens of milliseconds to zero at essentially no cost in commit latency.

vs strict sync. Strict sync waits for the ack unconditionally: slow replica means slow commits, dead replica means halted commits. Semi-sync de-couples primary availability from any single replica — forward progress is guaranteed once the local fsync returns. You buy back availability that strict sync surrendered.

The failure window. Semi-sync is not strict sync. Consider: primary ships the record, replica receives it and is about to ack, primary crashes before the ack arrives, and simultaneously the replica's network to the failover orchestrator also fails. The orchestrator promotes a different replica that never got the record. Data lost. The probability is low (two correlated failures in a millisecond window) but not zero. Semi-sync is "almost always zero RPO"; strict sync is zero RPO full stop. For most workloads, almost-always is the right point; quorum acks close the remaining gap.

Implementation sketch in Python

A minimal semi-sync primary using threading events to express the "whichever first" logic:

# replication/semisync.py
import threading, time
from dataclasses import dataclass

@dataclass
class WALRecord:
    lsn: int
    payload: bytes

class SemiSyncPrimary:
    """Commits as soon as EITHER local fsync completes OR replica acks.
    Falls back to local-only after timeout if replica is unresponsive."""

    def __init__(self, replica, fallback_timeout=0.5):
        self.wal = []
        self.replica = replica
        self.fallback_timeout = fallback_timeout
        self._next_lsn = 1
        self._degraded = False

    def commit(self, payload: bytes) -> int:
        rec = WALRecord(self._next_lsn, payload)
        self._next_lsn += 1

        ack_event = threading.Event()
        # ship to replica — background; signals ack_event on receipt
        self.replica.send_async(rec, on_ack=ack_event.set)

        # local fsync runs concurrently with the network round-trip
        fsync_done = threading.Event()
        threading.Thread(
            target=lambda: (self._fsync_locally(rec), fsync_done.set()),
            daemon=True,
        ).start()

        # wait for EITHER condition, with a cap
        got = self._wait_for_either(ack_event, fsync_done,
                                    timeout=self.fallback_timeout)
        if got is None:
            self._degraded = True  # neither returned — very bad
            raise RuntimeError("both local fsync and replica ack timed out")

        self._degraded = not ack_event.is_set()
        return rec.lsn

    def _wait_for_either(self, a, b, timeout):
        deadline = time.monotonic() + timeout
        while time.monotonic() < deadline:
            if a.is_set() or b.is_set():
                return True
            time.sleep(0.0005)
        return None

    def _fsync_locally(self, rec):
        self.wal.append(rec)
        # real code: os.fsync(wal_fd)

The critical piece is _wait_for_either: it returns as soon as either the ack or the fsync completes. Why I return _degraded = not ack_event.is_set(): if the fsync returned first and the ack never arrived, the commit is still durable (local disk has it), but you have effectively fallen back to async for this commit — the replica did not confirm receipt. Production implementations expose this as a counter so monitoring can alarm on sustained degradation.

Under 40 lines of interesting logic. Real MySQL code adds timeouts, reconnection, a "rejoin as semi-sync" handshake when the replica returns, and careful accounting for the LSN ranges that shipped during the fallback window.

Quorum acknowledgements — the majority-rule generalisation

Semi-sync has one sync replica. If that replica is down, you are async. If it is slow, its slowness leaks into commit latency somewhat. Both issues disappear if, instead of one specific sync replica, you have N replicas and accept an ack from any majority of them.

Define quorum as Q = (N / 2) + 1 (integer arithmetic — simple majority). The primary waits for the Q-th fastest replica's ack before committing. Examples:

N=3, Q=2. Primary needs any 2 of 3 acks. One replica can be dead, slow, or partitioned; commits proceed at the speed of the fastest 2 of 3. This is the canonical HA setup.
N=5, Q=3. Any 3 of 5 acks. Tolerates 2 concurrent replica failures. Commit latency is the median-ish RTT of 5 replicas — slower than N=3 under normal conditions, much more robust to correlated failures.
N=7, Q=4. Any 4 of 7. Tolerates 3 failures. Uncommon in practice because the cost of maintaining 7 copies rarely justifies the third failure of tolerance.

The general rule: N = 2F + 1 replicas tolerates F failures, both for durability (no write is acknowledged without surviving F crashes) and for consensus (the surviving majority can always make progress). F=1 gives N=3, F=2 gives N=5, F=3 gives N=7. Even N is strictly worse than odd N-1 at the same F — you pay for the extra replica without gaining any fault tolerance — which is why production deployments are almost always odd.

Raft and Paxos — distributed consensus with quorum acks

"Wait for a majority" is an old idea; making it work correctly under partitions, leader crashes, and reordering is the subject of the distributed-consensus literature. The foundational algorithms are Paxos (Lamport 1989, published 1998) and Raft (Ongaro and Ousterhout 2014). Both implement a replicated log that every surviving node agrees on, even as individual nodes crash and recover. A write:

One replica is the leader (elected by majority vote after the previous leader dies). All writes go through it.
The leader appends the record to its own log and broadcasts to followers.
The leader waits for a majority of acks (including its own disk-write as one vote).
Once the majority has it on disk, the leader marks the record committed and returns success.
Remaining followers catch up asynchronously.

Raft adds machinery around this core to make leader election safe, to ensure new leaders already have all committed records, and to handle partitions, out-of-order messages, and returning-from-dead leaders. The core is still "quorum ack on every write".

Production systems built on these algorithms: etcd (Kubernetes control plane), CockroachDB and TiKV (per-range Raft groups), Spanner (Paxos per tablet, typically 5-way across continents), ZooKeeper (ZAB, a Paxos variant). Semi-sync is a bolt-on to a leader-follower database; these systems are built around quorum acks.

Quorum latency — the "slowest of Q" effect

Commit latency under quorum acks is the Q-th fastest ack time, not the fastest or the average. For N=3 and Q=2 with i.i.d. RTTs, that is the median of 3 draws — only slightly worse than the fastest.

The useful property: commit latency is bounded above by the second-fastest replica, not the slowest. A replica 10× slower than the others — distant region, GC pause, contended disk — never touches the critical path once the faster 2 have acked. Slow replicas are invisible to commit latency as long as Q fast ones exist.

For N=5 and Q=3: a cluster with three intra-region replicas and two cross-region replicas commits at intra-region speed; the cross-region acks arrive long after quorum has formed. The cross-region replicas exist for DR and do not slow writes.

The corner case: if quorum necessarily crosses a slow link, commit latency includes that link. N=3 spread across three distant regions (Mumbai, Singapore, Sydney) has Q=2, and any 2 of 3 includes an inter-region hop. Global databases like Spanner run N=5 so a quorum of 3 can form within a nearby region cluster, or they carefully pick 3 regions with matched RTT.

A 3-replica quorum commit. The leader in AZ-a appends a record and sends it to both followers concurrently. AZ-b is one hop away at 1.5 ms RTT, AZ-c at 2 ms. AZ-b's ack arrives first, forming a quorum of 2 (leader + AZ-b), and the leader returns COMMIT OK immediately. AZ-c's ack arrives slightly later and is simply recorded without changing the commit latency. If AZ-c were instead on the far side of a 50 ms link, commit latency would still be 1.5 ms — the slow replica is not on the critical path.

Read consistency levels in quorum systems

Quorum acks give durable writes with majority safety. Reads are a separate question:

Eventual reads. Read from any replica. Fast, possibly stale. Fine for "total count of likes".
Linearisable reads. Must reflect the most recent committed write. Two ways: route through the leader (which must first confirm it is still the leader, otherwise a deposed leader could serve stale data); or perform a quorum read, contacting a majority and taking the latest value. Both cost a round-trip.
Snapshot reads. Consistent point-in-time across replicas. Between the two in cost; usually served by a read timestamp chosen recent-but-safe.

A practical note: default reads in Raft systems are usually linearisable via the leader. Adding replicas does not scale read throughput unless you opt into stale reads. CockroachDB offers AS OF SYSTEM TIME for snapshot reads servable by any replica; etcd's linearised reads always hit the leader.

Python sketch — quorum commit

A minimal quorum-commit primary using a counter and a condition variable:

# replication/quorum.py
import threading, time
from dataclasses import dataclass

@dataclass
class WALRecord:
    lsn: int
    payload: bytes

class QuorumPrimary:
    """Commits once Q of N replicas ack. Tolerates N-Q simultaneous failures."""

    def __init__(self, replicas, timeout=1.0):
        self.replicas = replicas
        self.quorum = len(replicas) // 2 + 1
        self.timeout = timeout
        self.wal = []
        self._next_lsn = 1

    def commit(self, payload: bytes) -> int:
        rec = WALRecord(self._next_lsn, payload)
        self._next_lsn += 1

        # leader's own append counts as one vote
        self.wal.append(rec)
        acks = [1]                     # list for mutable closure
        acks_lock = threading.Lock()
        quorum_event = threading.Event()

        def on_ack(_rep):
            with acks_lock:
                acks[0] += 1
                if acks[0] >= self.quorum:
                    quorum_event.set()

        for r in self.replicas:
            r.send_async(rec, on_ack=lambda rep=r: on_ack(rep))

        if not quorum_event.wait(self.timeout):
            raise TimeoutError(
                f"quorum={self.quorum} not reached in {self.timeout}s "
                f"(got {acks[0]} of {len(self.replicas) + 1} votes)"
            )
        return rec.lsn

The structure is: a shared counter, a lock, a condition variable. Why the leader counts as a vote: in Raft the leader has already persisted the record to its own log by the time it sends to followers. Its own disk-write is one acknowledgement toward the quorum. If leader itself has a slow disk, its vote is also the slow one — but at least you are not double-counting or waiting unnecessarily.

What this sketch elides: leader election, term numbers that handle stale leaders, the pre-vote optimisation, log compaction, snapshot transfer for followers so far behind that replaying the log from the start is infeasible. A full Raft implementation runs 2000-5000 lines. The commit path is genuinely just "wait for Q acks".

A three-AZ CockroachDB-style cluster

You run a Raft-based distributed database across three AZs in ap-south-1. Replica A (leader) in AZ-a, Replica B in AZ-b, Replica C in AZ-c. Quorum is Q=2 of N=3. Intra-AZ RTTs: A↔B 1.2 ms, A↔C 1.8 ms, B↔C 1.5 ms.

Normal operation. A appends to its log (1 ms fsync) and dispatches to B and C concurrently. B's ack comes at ~2.2 ms, C's at ~2.8 ms. Leader reaches quorum (self + B) at 2.2 ms and commits. C's ack arrives slightly later and is recorded.

Scenario 1: AZ-c goes dark. A dispatches to B and C. B acks at 2.2 ms; C never responds. Leader reaches quorum (self + B) at 2.2 ms regardless. Commit latency is unchanged — the failure is invisible to clients.

Scenario 2: AZ-b AND AZ-c both go dark. A is alone, with only 1 of 3 votes. Writes time out and return errors. Writes halt — correctly: no other replica has seen the write, so acknowledging it would risk data loss if A also dies.

Scenario 3: The leader (AZ-a) dies. Raft's leader election triggers. B and C exchange votes; one becomes leader. Total time: election timeout (150-300 ms) plus a few RTTs. Writes unavailable during that window, then resume with the new leader and a 2-of-2 quorum until A rejoins.

Scenario 4: AZ-c is slow but not dead. C's fsyncs take 50 ms due to GC. B acks at 2.2 ms, C at 52 ms. Quorum forms at 2.2 ms via A+B. C's slowness never touches commit latency. This is the property that makes quorum acks operationally forgiving in a way semi-sync is not — in a semi-sync setup with B as the sync replica, a slow B would slow every commit until timeout-driven fallback.

Common confusions

"Semi-sync is always safe." Not quite. Expected-case RPO is zero, but a narrow window exists where the primary and the one sync replica die together before other nodes learn about the committed transaction. Small window, low probability, not zero. Quorum acks with N≥3 close it at the cost of extra replicas.
"Quorum means all replicas have the data." No. Only a majority at commit time. The remaining replicas catch up asynchronously and may never catch up if permanently dead. As long as the majority survives, the data is durable.
"Linearisable reads are free in Raft." False. Default reads go through the leader, which must first confirm its leadership is current (a heartbeat to a quorum) before serving. That is a round-trip. Stale-read optimisations exist but must be opted in per query.
"N=2 with Q=2 is a valid quorum." Technically yes, practically strict-sync in disguise: any single failure stalls commits. Always use odd N.
"Quorum replication and consensus are different." The same idea from two vocabularies. A correct implementation of one is a correct implementation of the other; Paxos, Raft, ZAB, Viewstamped Replication all deliver both.
"Quorum reads always return the latest value." Only if you read-quorum correctly — read from Q replicas and take the highest version. Reading from a leader that has silently been deposed (it has not yet noticed) gives stale data.

Going deeper

Raft in detail

Raft splits the replication problem into three sub-problems: leader election, log replication, and safety. Each server is a follower, a candidate, or a leader. Followers expect heartbeats; if they stop arriving, a follower becomes a candidate, increments its term, and requests votes. Servers grant at most one vote per term and only to candidates whose log is at least as up-to-date as their own (the log-completeness check). A majority of votes wins.

The leader then sends AppendEntries RPCs; when a majority has appended, the leader commits. The core safety invariant is: if a log entry is committed in a given term, it is present in every future leader's log. The log-completeness check is what guarantees this — a new leader always already has every committed entry.

Ongaro and Ousterhout's 2014 paper is genuinely readable. The core algorithm fits on two pages of pseudocode; etcd, CockroachDB, TiKV, and Hashicorp's Raft library are all close readings of it.

Multi-Paxos and its variants

Paxos predates Raft by two decades. Basic Paxos solves consensus on a single value in two round-trips; Multi-Paxos chains instances for a sequence of values and lets an established leader skip the prepare phase.

The pain point historically was that Paxos papers described the protocol in terms of safety properties rather than operational rules, and practitioners had to invent the engineering — leader leases, log compaction, reconfiguration, failure detection — per implementation. Raft was explicitly designed to close that gap. Multi-Paxos still runs Chubby and Spanner; it works, and rewriting them was not justified.

Flexible quorums

Heidi Howard and colleagues' 2016 Flexible Paxos showed that you only need the election and replication quorums to intersect, not to both be majorities. That gives a degree of freedom: small write quorums (fast commits) paired with large read/election quorums. For a read-heavy workload with rare elections, N=5 with write quorum 2 and election quorum 4 is safe and noticeably faster than the symmetric design.

Spanner's TrueTime

Spanner avoids quorum reads by leaning on TrueTime, a global clock API that returns an interval [earliest, latest] guaranteed to contain the current absolute time (width ~7 ms, backed by atomic clocks and GPS in every datacenter). Commits wait out the uncertainty before releasing locks, which lets any follower serve linearisable reads by choosing a read timestamp far enough in the past to be safe. The approach needs special hardware; most organisations use quorum reads or leader reads instead.

Where this leads next

Async, strict sync, semi-sync, and quorum acks form a ladder from lowest to highest safety, each with a characteristic latency and availability profile. Build 9 continues with:

Logical replication — decoding the WAL into row-level changes shipped to heterogeneous subscribers (other databases, Kafka, analytics systems). The replication semantics (async, sync, quorum) still apply; what changes is the representation.
Distributed transactions — when a single transaction spans multiple Raft groups, quorum acks alone are not enough. You need a commit protocol across groups — two-phase commit, Percolator-style timestamped commits, or Spanner's TrueTime-augmented 2PC.
Consensus in the wild — the operational reality: leader flapping, split votes, reconfiguration, slow replicas, snapshot-install for followers that fell too far behind.

Every scalable strongly-consistent distributed database built after 2012 rests on quorum acks. Learn the pattern once here; you will recognise it in every production system you run.

References

Ongaro and Ousterhout, In Search of an Understandable Consensus Algorithm (Extended Version), USENIX ATC 2014 — the Raft paper. The single best starting point for understanding quorum-based consensus; the paper is deliberately accessible and the algorithm fits on two pages.
Lamport, The Part-Time Parliament, ACM TOCS 1998 — the original Paxos paper. Harder going than Raft but the canonical reference for Multi-Paxos and the theoretical foundations of consensus.
MySQL documentation, Section 19.4.10: Semisynchronous Replication — the canonical reference for production semi-sync configuration, including the rpl_semi_sync_master_timeout setting that controls fallback-to-async behaviour.
CockroachDB documentation, CockroachDB Architecture: Replication Layer — a production-oriented walk-through of per-range Raft groups, with concrete numbers on quorum sizes, lease management, and how the system behaves under AZ and region failures.
Howard, Malkhi, and Spiegelman, Flexible Paxos: Quorum Intersection Revisited, arXiv 2016 — the paper showing that read and write quorums can be sized independently as long as they intersect, with applications to latency-optimised deployments.
Corbett et al., Spanner: Querion's Globally-Distributed Database, OSDI 2012 — the Spanner paper introducing TrueTime, commit-wait, and Paxos-per-tablet as the foundation for externally-consistent global transactions.