In short

Build 9 has taught you every trick that makes a primary-replica database safer and faster. Async log shipping, synchronous replication, semi-sync, quorum acks, failover runbooks, point-in-time recovery, read-your-own-writes, causal consistency tokens, read scaling, replication slots, cascading replicas, delayed replicas. Every one of those techniques preserves a single invariant: there is one primary, and every write goes through it.

That invariant is also the ceiling. A modern well-tuned primary on modern hardware handles roughly 50,000 to 100,000 writes per second before it tips over. The bottleneck is not software quality — Postgres, MySQL, Oracle, and SQL Server all end up in the same 50-100k range on comparable hardware. The bottleneck is physics: one WAL writer, one fsync queue, one CPU topology, one NIC. Past that ceiling, adding replicas gives you exactly zero additional write throughput. Replicas only scale reads. A cluster of fifty replicas behind one primary has the same write capacity as one primary alone, just with fifty expensive machines reading it back.

This chapter is the honest closure of Build 9. You have reached the wall. Beating the wall requires abandoning the single-leader assumption, and that is what Build 10 is about. The paths forward are sharding (N primaries, each owning a slice of the keyspace), leaderless replication (Dynamo-style: every node accepts writes, quorums reconcile), and multi-primary with conflict resolution (last-write-wins, vector clocks, CRDTs). Each has different trade-offs on consistency, complexity, and cost. This chapter shows you where the wall is, why it is there, how to recognise you have hit it, and which door out of single-leader thinking fits which workload. Build 10 walks through each door.

The number is fifty thousand to one hundred thousand writes per second. Memorise it. That is the write ceiling of a single-primary OLTP database on a modern well-tuned server in 2026. Above that, no knob inside Build 9 saves you. More replicas do not help. Faster disks give maybe another factor of two and you are out of road. Beefier instances hit the same wall slightly later. Every one of those tricks funnels every write through one machine's CPU, one WAL, one fsync queue, one NIC — every one of which is capacity-bounded.

If your peak is 3,000 TPS, you are far from the wall. If your peak is 30,000 TPS, careful tuning matters. If your peak is 300,000 TPS, no single-leader configuration will hold you — you must restructure the database, not re-configure it. That restructure is what Build 10 teaches. Before you open that door, understand precisely why the door on this side is locked.

Where the ceiling comes from

There is no single number for "maximum writes per second" on a primary — there are four numbers, and the system is capped at the smallest. Every capacity-planning conversation reduces to asking which binds first.

The WAL fsync. Every committed write produces a WAL record that must be forced to durable storage before the client sees OK. On modern NVMe with power-loss-protected cache, one fsync costs ~100 μs, so the naive ceiling is 10,000 fsyncs/s. Group commit batches many transactions into one fsync and pushes this to 50,000-100,000 commits/s on enterprise NVMe. Spinning disks are 10-100× worse; SATA SSDs without PLP give ~5,000.

The WAL writer thread. Postgres writes WAL through one writer process. For extremely small transactions, per-record overhead (LSN assignment, lock, memcpy, flush scheduling) dominates and you see "wal_writer at 100% CPU" before the disk saturates. Roughly 500,000 trivial-write/s is the ceiling; real transactional writes cap earlier because each transaction produces multiple WAL records.

The CPU, for transaction processing. Parsing, planning, executing, locks, buffer pool updates, triggers, constraint checks — every write is CPU work on the primary. A 64-core server handles 50,000-200,000 transactions/s. A UPDATE t SET n = n + 1 WHERE id = ? is cheap; a transaction with three joins and five foreign key checks is twenty times more expensive.

The network out to replicas. A small WAL record is ~100 bytes; at 1 Gbps NIC with per-record overhead you saturate at roughly 200,000 records/s. Multiple replicas multiply the bandwidth. On 10 Gbps NICs this has headroom; on 1 Gbps it is the binding constraint.

The four bottlenecks of a single-leader primaryA bar chart showing the approximate write-throughput ceilings of four primary-side bottlenecks. From left to right: WAL fsync at roughly 100 thousand per second on NVMe, WAL writer thread at roughly 500 thousand per second for trivial writes, CPU for transaction processing at roughly 100 thousand per second for real transactions, network out to replicas at roughly 200 thousand per second on a 1 Gbps NIC. A dashed horizontal line at 75 thousand per second labelled effective primary ceiling shows where the system actually tips over, constrained by whichever of the four binds first.Single-primary write ceilings by subsystem0100k200k300k400k500kwrites / secondWAL fsync(NVMe)~100kWAL writer(trivial writes)~500kCPU (txn)(real workload)~100kNetwork out(1 Gbps, 4 replicas)~200keffective ceiling ~75k/s
The four subsystems that cap write throughput on a single-primary database. The effective ceiling is the minimum of all four, typically 50,000-100,000 writes/s on modern hardware. WAL fsync and CPU transaction processing are the most common binding constraints for realistic OLTP workloads; the WAL writer thread and network bandwidth usually have headroom. Numbers are approximate and workload-dependent.

Why these numbers are remarkably stable across databases: Postgres, MySQL, and Oracle all end up in the same 50-100k writes/s zone on comparable hardware not because they are similar code, but because they are all constrained by the same four physical limits. The transaction log is serialised (one writer), the fsync is serialised (one disk commit), the CPU has fixed core count, the NIC has fixed bandwidth. Software quality decides where in the 50-100k range you land, not whether the range itself moves.

Which constraint binds first depends on workload shape. Small writes — INSERT INTO events VALUES (now(), user_id, event) — hit fsync first. Complex writes — multi-row updates with cascading foreign keys — hit CPU first. A cluster with many replicas and skinny records hits network first. There is no single ceiling; there are four, and your workload picks which one kills you.

Why adding replicas doesn't help

Replicas scale reads only. They do not help write capacity at all.

Stare at the architecture. A write flows: client → primary → WAL → replicas. The primary does all the work of validating, logging, and executing the write. Replicas are downstream followers applying the primary's WAL records to their own state. Ten replicas apply the same records — but every record originated on the one primary.

Adding N replicas gives you:

A team at 80,000 writes/s that responds to capacity pressure by adding four replicas has done nothing for their write bottleneck. Adding replicas is the correct action only if the observed bottleneck is read throughput.

Why this confuses people: in most distributed systems (web servers, stateless workers) throwing more machines at the problem is the correct answer. Single-leader databases violate this intuition because the leader is not a coordination point — it is the only member of the cluster that can accept writes at all. The architectural shape prohibits horizontal scaling of writes.

Symptoms you've hit the wall

Four symptoms appear as a primary approaches its ceiling, usually in this order.

Primary CPU sustained near 100% during peak hours. Not brief spikes — sustained 95%+ utilisation. top shows all cores saturated by postgres processes. Reads on replicas are fine; the primary alone is fighting for CPU.

Replica lag growing without bound. Async replicas that normally lag 10-100 ms now lag seconds or minutes. The primary generates WAL faster than the single-threaded apply loop on each replica can replay it.

Commit latency rising steadily. Writes that used to commit in 3 ms now commit in 20 ms, then 50 ms. The fsync queue is growing; sessions appear in pg_stat_activity with wait_event = 'WALWrite' or 'SyncRep'. p99 grows far faster than p50 because fsync queueing has long-tail effects.

Your SLA breaks. p99 write latency was 50 ms; now it sits at 200 ms. Pagers fire. No configuration change fixes it, because the hardware is saturated, not misconfigured.

When all four appear at once, you are at the wall. You cannot tune your way out.

What you CAN'T do from here

The exits that do not exist.

Add more replicas. Zero write improvement. Useful only if you are also read-bound.

Scale up the primary. More cores, bigger NVMe, faster NIC. You gain ~1.5-2× on CPU-bound workloads. But cloud primaries top out (AWS's largest is r8g.metal-96xl, 384 vCPUs); it still has one WAL writer, one kernel fsync queue. Doubling hardware rarely doubles throughput because the bottlenecks are serialised, not parallel.

Aggressive group commit. Push commit_delay up to 1000 μs, gain 2-3× on concurrent workloads, pay 1 ms extra latency per commit. A last mile of optimisation, not ten more times of room.

Disable durability. synchronous_commit = off, fsync = off. Gain 3-5× throughput; lose the guarantee that committed data survives a primary crash. Acceptable for analytics ingestion, career-ending for a payments ledger. This changes what the database is, not just how fast it runs.

Combined, these exits buy maybe 5× over the stock primary. Past that, the architecture itself must change.

What you MUST do (ahead)

You have three architectural doors out of the single-leader box. Build 10 walks through each in turn; here is the map.

Sharding (horizontal partitioning)

Split the data across N independent primaries, each responsible for a slice of the keyspace. User IDs 0 to 999,999 live on primary A; 1,000,000 to 1,999,999 on primary B. A write for user_id = 1,234,567 goes to B. Writes for different users land on different primaries and proceed in parallel.

Write capacity scales nearly linearly with N: four primaries handle roughly 4× the writes of one. The core cost is that transactions spanning shards are hard — a single-shard update is as fast as a stock primary, but a transaction touching two shards needs two-phase commit (slow, blocks on coordinator failure) or distributed consensus (Raft, Paxos). Vitess (MySQL), Citus (Postgres), MongoDB, CockroachDB, and TiDB differ primarily in how they handle cross-shard transactions, rebalancing, and routing. Your schema either shards cleanly (most user-keyed OLTP) or does not (global aggregates).

Leaderless (Dynamo-style)

Abandon the primary concept entirely. Every node accepts writes. A write to key K is sent to all N replicas owning K (chosen by consistent hashing); the write succeeds if W of N replicas ack; reads query R of N and reconcile. If W + R > N you get quorum consistency. Write capacity scales with node count because no single machine coordinates any key. Concurrent conflicting writes are reconciled by vector clocks or last-write-wins, and read repair propagates the resolution.

The costs: consistency is eventual (or tunable quorum, but never single-writer linearisability); conflicts can be the application's problem; schema and query capabilities are narrower than an RDBMS. DynamoDB, Cassandra, Riak, and ScyllaDB live here.

Multi-primary with conflict resolution

Multiple nodes accept writes for the same data; conflicts are resolved after the fact via last-write-wins timestamps, vector clocks preserving siblings the app reconciles, or CRDTs whose operations commute so concurrent edits converge deterministically. Useful when each geographic region needs to accept local writes without a global-primary round trip (CouchDB, Postgres BDR, collaborative documents). The complexity cost is high; for most OLTP, sharding or leaderless is preferred.

In-memory with write-behind

Use an in-memory primary (Redis, MemSQL) that absorbs writes at rates unreachable by a durable RDBMS, and flush to durable storage asynchronously. Trades durability for throughput. Used for leaderboards, sessions, metric ingestion.

The transition from Build 9 to Build 10

Build 9 asked: "How do you make a single primary as safe, fast, and available as possible?" Every chapter — async shipping, sync replication, quorum acks, PITR, failover, consistency levels — answered that question within the fixed constraint of one primary accepting all writes.

Build 10 asks: "How do you escape the one-primary constraint?" Every chapter drops the single-leader assumption and picks a different structure. The new questions are: how do you route writes? how do you handle conflicts? how do you maintain consistency when there is no single authoritative copy?

Why this is the right time for the transition: you have now seen every lever inside the single-leader paradigm, named its failure modes, and configured it to a level considered expert. You are ready to understand what a system that does not assume a primary looks like — which you could not have earlier, because you would have conflated the primary-replica assumption with "how databases work" rather than treating it as one architectural choice among several.

Python demonstration — measuring the wall

Pick a workload that is sympathetic to a single primary (small writes, concurrent committers, NVMe storage) and scale the client fan-in. You will see throughput grow roughly linearly with client count, then plateau, then stay plateaued no matter how many more clients you add. That plateau is the wall.

Below is a synthetic benchmark in Python that fires trivial inserts against a local Postgres from increasing concurrency. Run it against a warm primary with max_connections = 500 and wal_level = replica.

# bench/measure_wall.py
import psycopg2, threading, time
from psycopg2.pool import ThreadedConnectionPool

DSN = "host=localhost dbname=bench user=bench"

def setup():
    conn = psycopg2.connect(DSN)
    conn.autocommit = True
    cur = conn.cursor()
    cur.execute("DROP TABLE IF EXISTS events")
    cur.execute("""CREATE TABLE events (
        id BIGSERIAL PRIMARY KEY,
        ts TIMESTAMPTZ DEFAULT now(),
        payload TEXT)""")
    conn.close()

def worker(pool, duration_s, counter):
    conn = pool.getconn()
    end = time.monotonic() + duration_s
    local = 0
    while time.monotonic() < end:
        with conn.cursor() as cur:
            cur.execute("INSERT INTO events (payload) VALUES ('x')")
        conn.commit()
        local += 1
    counter.append(local)
    pool.putconn(conn)

def run(n_clients, duration_s=20):
    pool = ThreadedConnectionPool(n_clients, n_clients, DSN)
    counter, threads = [], []
    t0 = time.monotonic()
    for _ in range(n_clients):
        t = threading.Thread(target=worker, args=(pool, duration_s, counter))
        t.start(); threads.append(t)
    for t in threads: t.join()
    elapsed = time.monotonic() - t0
    tps = sum(counter) / elapsed
    print(f"clients={n_clients:4d}  tps={tps:10.0f}")
    pool.closeall()

if __name__ == "__main__":
    setup()
    for n in [1, 2, 4, 8, 16, 32, 64, 128, 256]:
        run(n)

Typical output on a recent 16-core workstation with NVMe (Postgres 16, synchronous_commit = on, no replication):

clients=   1  tps=      4100
clients=   2  tps=      8050
clients=   4  tps=     15800
clients=   8  tps=     30200
clients=  16  tps=     54100
clients=  32  tps=     71000
clients=  64  tps=     78500
clients= 128  tps=     79200
clients= 256  tps=     78900

Throughput grows roughly linearly from 1 to 16 clients (fsync coalescing via group commit helps here), rolls off from 16 to 64, and flattens at roughly 80,000 writes/s from 64 onwards. Adding more clients past that point does not increase throughput — it only increases per-commit latency, because more work piles up in the fsync queue for the same service rate.

Why the plateau is hardware, not software: iostat during the saturated phase will show the NVMe device near its write-IOPS ceiling. top will show multiple postgres processes pinning cores. Neither CPU nor disk has headroom. The database is doing every bit of work the hardware can sustain; the plateau line is the hardware speaking. You could re-run this experiment on Oracle or MySQL and you would land in the same 50-100k band on the same machine — the architectural shape is the same.

The benchmark does not test sharding or leaderless systems — those require entirely different infrastructure and their own benchmarks (YCSB, sysbench, tpcc-like suites). But it demonstrates the single number you need to believe before Build 10 makes sense: the plateau is real, it is hardware, and no amount of client concurrency crosses it.

How the major databases handle the wall

Every production database has confronted this wall and picked one (or several) of the three doors. A quick tour.

PostgreSQL. Stock Postgres is single-primary forever; vertical scaling only. Citus (Microsoft-maintained) adds sharding via a coordinator + worker topology. Greenplum is an analytical fork with MPP sharding. BDR adds multi-primary. Core Postgres's stance is that write scaling past one primary requires an opinionated sharding layer that does not belong in the core.

MySQL. Single-primary with read replicas out of the box. Vitess (YouTube's sharding layer, CNCF graduated) is the canonical horizontal-scaling story, routing queries to per-shard primaries and handling re-sharding and cross-shard 2PC. Galera / Group Replication provide multi-primary via certification-based conflict resolution.

MongoDB. Native sharding. A sharded cluster has a mongos router, a config server, and N shards each itself a replica set with its own primary. First-class, not bolted on.

CockroachDB, TiDB, Spanner. Natively sharded into "ranges" or "regions". Each range is replicated via Raft (or Paxos); writes within a range commit via quorum. Cross-range transactions use distributed 2PC over the consensus groups. There is no single primary for the whole database — only per-range leaders — and the number of ranges grows with data.

Cassandra, DynamoDB. Fully leaderless. Every node accepts writes. Tunable consistency via R + W > N. Trade-off is weaker consistency and narrower query capabilities than an RDBMS.

Redis. Single-primary per key, but Redis Cluster shards across primaries using 16,384 hash slots (CRC16(key) mod 16384 routes to the owning primary).

Kafka. Not a database, but the lesson applies: a topic is partitioned, each partition has one leader and followers, partitions are independent. Throughput scales with partition count because each partition is an independent single-leader unit.

Why the industry convergence is striking: every major horizontally-scaled data system — Vitess, CockroachDB, Cassandra, MongoDB, DynamoDB, Kafka, Spanner, etcd — chose some flavour of "shard into independent pieces, each with its own ordering protocol." The naming differs (partitions, shards, ranges, tablets) but the idea is the same: break the single-primary assumption by partitioning, and either pick a primary per partition (Raft-per-range) or avoid primaries entirely (Dynamo-style).

A fintech's scaling journey from 1,000 to 100,000 TPS

Year 1 (Seed, 1,000 write TPS peak). Single Postgres primary on an AWS r5.2xlarge, one async replica in a neighbouring AZ for DR. Commit p50 is 3 ms. Build 9 is all you need.

Year 2 (Series A, 10,000 write TPS peak). Primary upgraded to r6i.8xlarge. Two async read replicas for dashboards, one sync replica in another AZ (FIRST 1 (sync_replica)) for zero-RPO. Commit p50 is 6 ms, p99 is 30 ms. Primary CPU peaks at 60%. Still comfortably inside Build 9.

Year 3 (Series B, 100,000 write TPS peak). Primary now the largest r8g available. CPU peaks at 98%, commit p99 at 250 ms (SLO is 50 ms), async replicas lag 30 seconds. The team tries commit_delay tuning, autovacuum, faster disks, removing some triggers; each yields ~10%. None solves it. A staff engineer runs the measurement benchmark above against a scaled-up replica; TPS rolls off at ~85,000 and flattens. The team sees the wall and plans a sharding migration.

They split users, accounts, and transactions by account_id mod 8 with Citus as the sharding layer. Each shard is its own Postgres primary on an r6i.4xlarge (× 8). Cross-shard transactions (rare inter-account transfers) go through Citus's distributed transaction protocol. Migration takes 4 months; cutover, 6 hours. Post-migration capacity: 8 × ~60,000 = ~480,000 writes/s. Commit p99 is back to 15 ms.

Fintech scale-out across three stagesThree panels showing a fintech's database topology at 1000, 10000, and 100000 writes per second. Panel 1 at seed stage shows one primary and one async DR replica. Panel 2 at Series A shows a primary, one sync replica, two async read replicas. Panel 3 at Series B shows a sharded topology with eight shards, each shard having its own primary and replica set, coordinated by a Citus router.Seed: 1k TPSPrimaryAsync DRBuild 9single primarySeries A: 10k TPSPrimarySync AAsync R1Async R2Build 9 stretchedprimary + replicasSeries B: 100k TPSCitus routerP1P2P3P4P5P6P7P8hash(account_id) mod 8Build 10sharded primaries
A fintech's three-stage journey: one primary at seed, primary plus replicas at Series A, sharded across eight primaries at Series B. The vertical axis is not drawn but each stage represents a 10× growth in write demand. The transition from stage 2 to stage 3 crosses the single-primary wall — it is not a configuration change but a structural rebuild of the data layer.

Common confusions

Going deeper

Spanner's Paxos-per-range — the elegant middle ground

Google Spanner partitions data into ranges (roughly 100 MB each); each range is replicated across 3-5 nodes using Paxos; each range has its own Paxos leader that serialises writes for that range. Writes to different ranges proceed in parallel. Cross-range transactions use distributed 2PC coordinated on top of the ranges' leaders. Scaling is implicit: as data grows, ranges split; as a range gets hot, it splits further; each split creates a new Paxos group that absorbs writes in parallel. CockroachDB and TiDB copied this architecture. The cost is that every write is a Paxos round — one RTT to a quorum — so single-range commits are 5-20 ms, slower than a local Postgres primary. Per-node throughput is lower; cluster throughput scales to hundreds of nodes and millions of writes/s.

Redis Cluster's hash slots — sharding at its simplest

Redis Cluster uses 16,384 fixed hash slots, each assigned to one primary. CRC16(key) mod 16384 picks the slot, and the slot's owner serves the key. Rebalancing moves slots between nodes via a slot-migration protocol. Multi-key commands must target keys in the same slot, enforced via "hash tags" like {user123}:profile. The routing logic fits in a hundred lines. The limitation: anything requiring cross-slot consistency has to be built on top.

The "sticky primary" pattern — app-level sharding

Some teams implement sharding at the application level rather than in the database. The app knows that tenant_id = 42 lives on db-shard-3.internal and connects directly. The database sees simple single-primary semantics. Common in multi-tenant SaaS where each tenant is naturally a shard. Advantages: tenants scale, back up, and migrate independently. Disadvantages: re-sharding a hot tenant is the app team's problem; schema migrations run N times; the connection pool manages N shard connections. Salesforce, Slack, and Shopify have used variations of this pattern.

Where this leads next

Build 9 is over. You have the complete toolkit for running a single-primary database well — WAL, async and sync replication, consistency models, failover, PITR, isolation levels, MVCC — and you can now articulate where it stops working and why.

Build 10 opens the door on the other side of the wall:

Each of these drops the single-primary assumption. Real production systems often combine them — a sharded cluster where each shard is Raft-replicated, a leaderless store with a sharded schema layer on top — but you must understand each on its own before the combinations make sense. The wall is not a failure of Build 9's ideas. It is their limit. Build 10 is what lives past that limit.

References

  1. DeCandia et al., Dynamo: Amazon's Highly Available Key-value Store, SOSP 2007 — the foundational paper on leaderless replication with tunable quorums, vector clocks, and consistent hashing. Every leaderless system since (Cassandra, Riak, DynamoDB) traces its design to this paper.
  2. Corbett et al., Spanner: Google's Globally-Distributed Database, OSDI 2012 — the paper that introduced Paxos-per-range sharding with external-consistency guarantees. The architectural template copied by CockroachDB, TiDB, and YugabyteDB.
  3. Citus documentation, Distributed PostgreSQL with Citus — the canonical reference for how Postgres gets sharded in production, including shard placement, query routing, and cross-shard transactions.
  4. Vitess documentation, Vitess Architecture Overview — YouTube's MySQL-sharding layer, the dominant horizontal-scaling story for MySQL at hyperscale. Describes vtgate routing, vtablet operation, and resharding workflows.
  5. Kleppmann, Designing Data-Intensive Applications, O'Reilly 2017, chapter 6 — the clearest book-length treatment of partitioning strategies (by key range, by hash, by document) and the trade-offs of rebalancing. The intellectual predecessor of this chapter and most of Build 10.
  6. Cockroach Labs, CockroachDB Design Document — the public design document for CockroachDB, detailing how ranges, Raft groups, the distributed SQL layer, and the transaction model compose to provide a horizontally-scaled SQL database. An excellent concrete example of the Build 10 architectural pattern.