In short

Every commit has to end with an fsync — that is the write-ahead rule taken to its conclusion. But fsync costs about a millisecond, even on the fastest consumer NVMe. One millisecond per commit caps you at a thousand transactions per second, per client, sequentially — and the disk is idle for almost the whole of that millisecond, waiting for the drive to flush its cache. Group commit is the idea that closes the gap. Instead of fsyncing after every committing transaction's log record, the engine waits for a tiny window — microseconds to a millisecond — and batches every commit that arrives in the window into a single fsync. One fsync now covers N transactions. Commit latency is unchanged (or very slightly worse — a transaction may wait a few hundred microseconds for its batch to fill); commit throughput rises roughly linearly with concurrency, because the cost of the one expensive syscall is paid by many transactions at once. Every mature transactional engine does this: Postgres with commit_delay / commit_siblings, MySQL InnoDB's binlog group commit, ScyllaDB's seastar commitlog, SQLite's WAL-checkpoint batching — all the same idea, each with its own tunables. This chapter builds the smallest useful group-commit implementation in Python (a bounded-queue leader/follower around a hundred lines), draws the coalescing picture, works the throughput-versus-latency trade-off with real numbers, walks the tunable knobs in Postgres and MySQL, and answers the question every engineer asks on first exposure: doesn't this sacrifice durability? (No. The fsync is still the commit point; only its timing is amortised.)

A single fsync on a modern consumer NVMe takes about a millisecond. You measured this yourself in fsync, write barriers, and durability — the bench printed 8,400 fsyncs per second, roughly 120 microseconds each, on a fast drive. On a slower drive, or on a server with a safer hardware stack (real power-loss protection, write-through caching), it is closer to 500 microseconds to 1 millisecond per call. Every WAL-based engine commits by appending a record to the log and then calling fsync. That fsync is the border the transaction must cross before the client is told "ok".

Now ask the arithmetic question. You want to serve 10,000 commits per second. If every commit issues its own fsync, and each fsync takes 1 millisecond, the one log file is busy flushing for 1 millisecond × 10,000 = 10 seconds of work per wall-clock second. That is not a bottleneck; that is a fundamental impossibility. Even at 100 microseconds per fsync on the best hardware in the world, 10,000 commits per second leaves zero time for any other I/O. One fsync per commit caps throughput at the reciprocal of the fsync time, full stop.

And yet Postgres serves 30,000 commits per second on the same hardware where fsync() in a tight loop tops out at 8,000. Something is amortising. This chapter is the story of what.

The bottleneck — 1 ms/commit × 10 k commits/s = impossible

Write the constraint explicitly. Let f be the average fsync time on your drive. Let C be the commits-per-second you want to serve. The single-threaded fsync-per-commit scheme delivers 1/f commits per second regardless of how many clients are waiting. Plug in realistic numbers:

The last line is the killer. A rotational disk in 2010 capped serious OLTP at about 100 transactions per second per machine — not because the disk was slow at writing (it was not; it was slow at seeking), but because every commit required it to flush its cache. The industry survived that decade entirely on group commit. Without it, the relational database as a general-purpose workhorse could not have existed on the hardware of the time.

Why is fsync a millisecond when a sequential write is microseconds? Because fsync is not a write; it is a wait. The bytes of the log record have already been in the kernel page cache for microseconds — fsync's job is to force those bytes out to the drive and wait for the drive to acknowledge they are on the media. That wait is at least one round-trip to the disk controller (tens of microseconds), plus the NAND program time (hundreds of microseconds per 4 KiB page), plus any queued work the drive has ahead of you. Write throughput on the drive is gigabytes per second; fsync latency is the time until the one specific write you care about has crossed the volatile-cache barrier. That second number is not a function of bandwidth; it is a function of physics inside the drive.

What makes the arithmetic wrong — what lets a real database break the 1/f ceiling — is that the fsync does not care how many log records precede it. An fsync at offset 1 MiB in the log file takes the same wall-clock time as an fsync at offset 1 KiB. It is the round-trip to the drive, not the amount of data flushed, that dominates. This is the observation group commit exploits.

Group commit — one fsync for many transactions

The idea in one sentence: when a transaction wants to commit, do not fsync immediately. Wait a tiny amount of time, gather every other transaction that wants to commit in that window, write all their log records contiguously, and then fsync once. Every committer waits for the same fsync; when it returns, they are all durable simultaneously.

Group commit — N transactions coalescing on one fsyncTimeline diagram with time flowing left to right. Top half: six horizontal lanes labelled T1 through T6, each representing a committing transaction. Each transaction's lane shows a short interval marked "in-memory work" followed by a vertical dashed line where the transaction calls commit(), then a grey block labelled "wait" extending to the right. All six wait blocks end at the same vertical bar labelled "fsync returns — all six durable at the same instant". The transactions arrive at slightly different times (T1 arrives first, T6 arrives last) but all six converge on the single fsync. Bottom half: a single lane labelled "writer thread" shows, aligned with the same time axis: a gather window ("gather batch" shaded), then a single write() block for all six records contiguously, then a single fsync block (longest segment). The fsync block ends exactly at the vertical bar where all six transactions are told "ok". Below, a caption reads: one fsync, six transactions durable — each pays 1/6 of the fsync cost.six transactions, one fsync — coalescing on the commit bordertime →T1workwaitT2workT3workT4workT5T6all six durable herewritergather batchwrite()fsyncone fsync, six transactions durable — each pays 1/6 of the fsync costcommit latency = gather window + one fsync; throughput = N / fsync_time
Six transactions arrive at slightly different times (their "work" intervals end at different points). Each calls commit() and blocks. The writer thread gathers all six into one batch, issues one write() with every log record contiguously, and one fsync(). When the fsync returns, every waiting transaction is told "ok" simultaneously. The fsync cost — one disk round-trip — is paid six times in parallel.

Read the picture twice. The horizontal length of each "wait" block is the same for every transaction — they all end at the moment the single fsync returns. T1 waits longest because it arrived first; T6 waits least because it arrived last. But none of them pays more than one fsync's worth of latency. The average latency is half the gather window + one fsync — maybe 600 µs instead of 500 µs. The throughput is the number of transactions in the batch divided by the fsync time — so six transactions in 500 µs is 12,000 commits/s on a drive whose raw fsync ceiling is 2,000/s.

That is the entire idea. Latency goes up slightly. Throughput goes up a lot.

The leader/follower pattern in Python

There are two standard implementations of group commit. One is the leader/follower pattern: the first thread to call commit() becomes the leader, opens a batch, waits for a short interval for followers to join, then performs the write+fsync on behalf of everyone. The other is the bounded-queue pattern: a dedicated writer thread consumes committed transactions from a queue, gathers them into batches, and fsyncs. The bounded-queue pattern is simpler and is what most textbooks show (including the 20-line sketch in chapter 3); the leader/follower pattern is what Postgres and InnoDB actually use, because it has no dedicated writer thread and no queue contention on the hot path.

Here is the leader/follower version, complete, in about 90 lines.

# group_commit_leader.py — leader/follower group commit, in Python.
import os, threading, time
from dataclasses import dataclass, field
from typing import List

@dataclass
class CommitTicket:
    """Handed back to each committer. They wait on `done`."""
    record: bytes
    done: threading.Event = field(default_factory=threading.Event)
    error: Exception | None = None

class GroupCommitLog:
    """Leader/follower WAL. First thread to arrive becomes the leader for the
    current batch, opens a gather window, flushes everyone, then lets the next
    arrival become the new leader.

    Tunables:
      min_batch_us   — leader waits at least this long before flushing, to
                       give followers time to join.
      max_batch      — leader stops waiting if the batch reaches this size.
    """
    def __init__(self, path, min_batch_us=200, max_batch=128):
        self.fd = os.open(path, os.O_WRONLY | os.O_APPEND | os.O_CREAT, 0o644)
        self.min_batch_us = min_batch_us / 1e6
        self.max_batch    = max_batch

        # Protects pending_batch and the "is there a leader?" state.
        self.lock          = threading.Lock()
        self.pending_batch: List[CommitTicket] = []
        self.leader_active = False

    def commit(self, record: bytes) -> None:
        """Submit one log record and block until it is durable."""
        ticket = CommitTicket(record=record)

        with self.lock:
            self.pending_batch.append(ticket)
            if not self.leader_active:
                # We are the first arrival — we become the leader for this batch.
                self.leader_active = True
                is_leader = True
            else:
                # A leader is already gathering followers. Join and wait.
                is_leader = False

        if is_leader:
            self._lead_batch()        # flushes everyone currently pending
        ticket.done.wait()            # blocks until the batch fsync returns
        if ticket.error:
            raise ticket.error

    def _lead_batch(self) -> None:
        """As leader, wait a short interval to gather followers, then flush."""
        # Gather window: sleep briefly so other threads can join the batch.
        # On modern kernels this is just a scheduling yield of a few hundred µs.
        deadline = time.perf_counter() + self.min_batch_us
        while time.perf_counter() < deadline:
            with self.lock:
                if len(self.pending_batch) >= self.max_batch:
                    break
            # Sleep for ~10 µs at a time. On Linux, time.sleep(0) yields the
            # scheduler without a real sleep; time.sleep(1e-5) gives other
            # threads 10 µs to call commit() and join the batch.
            time.sleep(1e-5)

        # Take ownership of the current batch and clear the pending list so the
        # next arrival starts a fresh batch.
        with self.lock:
            batch = self.pending_batch
            self.pending_batch = []
            self.leader_active = False   # releases leadership

        # One write() for all records, one fsync() for the whole batch.
        try:
            os.writev(self.fd, [t.record for t in batch])   # single syscall
            os.fdatasync(self.fd)                            # one fsync covers all
        except OSError as e:
            for t in batch: t.error = e
        finally:
            for t in batch:
                t.done.set()                                 # wake everyone at once

Walk the execution carefully. Three concurrent calls to commit() happen:

  1. Thread A enters commit(). The lock is uncontended. It appends its ticket to pending_batch, sees leader_active = False, sets it to True, and calls _lead_batch(). Inside, it sleeps for min_batch_us microseconds.
  2. Thread B enters commit() 50 µs later. It appends its ticket, sees leader_active = True, and falls through to ticket.done.wait() — it is a follower.
  3. Thread C enters commit() 150 µs later. Same story — appends, becomes follower, waits.
  4. Thread A's gather window expires. It takes ownership of pending_batch (three tickets — A, B, C), clears it, releases leadership, and does one writev + one fdatasync. When fsync returns, it sets all three tickets' done events. Threads A, B, and C all wake simultaneously.
  5. Thread D enters commit() after Thread A finished. pending_batch is empty and leader_active is False. D becomes the leader of a new batch. And so on.

Three things to notice in the code.

os.writev(fd, [records]) is one syscall that writes multiple buffers contiguously — the kernel equivalent of "write all these records in order, atomically with respect to other writers on this fd". Using writev instead of a loop of write() calls matters: each write is a syscall, and many syscalls per commit burn CPU. One writev for the whole batch is the right primitive.

os.fdatasync is used instead of os.fsync because the WAL is append-only — we care about the data and the file length, not the modification time of the inode. Chapter 3 made this case in detail; it is a 20–30% speedup on many filesystems for the exact workload we are running.

The gather window uses time.sleep(1e-5) rather than a condition variable. A condition variable would be cleaner — wake the leader when the batch is full, otherwise wait until the deadline — but the 10-µs sleep is simpler and, because the whole batch is bounded to a millisecond at most, the efficiency loss is negligible. Production engines use a condition variable; this is one of the simplifications that keeps the chapter code short.

Why does leader_active exist at all? Why not just let every thread that arrives while another thread is mid-fsync append to the pending batch, and whoever is actually doing the fsync will include them? Because of a race: if thread B arrives while thread A is in the middle of os.fdatasync (not inside the gather window, but past it), and B appends its ticket to pending_batch, then A's code has already captured the old pending_batch — B's ticket is stranded in a list nobody will ever flush. The leader_active flag (and the list swap inside the lock) is how we make "which batch am I joining?" well-defined. A real engine uses a generation counter per batch, but the flag is equivalent for one batch at a time.

The bounded-queue alternative — a sketch

The other pattern, shown in the chapter-3 sketch and used by many smaller engines, inverts the design: a dedicated writer thread pulls tickets off a queue and batches them itself.

# group_commit_queue.py — bounded-queue group commit (simpler alternative).
import os, threading, queue, time

class QueuedCommitLog:
    def __init__(self, path, flush_us=500):
        self.fd = os.open(path, os.O_WRONLY | os.O_APPEND | os.O_CREAT, 0o644)
        self.q = queue.Queue()
        self.flush_s = flush_us / 1e6
        threading.Thread(target=self._writer, daemon=True).start()

    def commit(self, record: bytes):
        done = threading.Event()
        self.q.put((record, done))
        done.wait()

    def _writer(self):
        while True:
            items = [self.q.get()]                 # block until at least one
            time.sleep(self.flush_s)               # gather window
            try:
                while True:
                    items.append(self.q.get_nowait())
            except queue.Empty:
                pass
            os.writev(self.fd, [r for r, _ in items])
            os.fdatasync(self.fd)
            for _, done in items: done.set()

Simpler to read, one fewer source of bugs. The downsides: every commit() call puts a ticket on a queue.Queue (which takes its own lock), which is an extra hop; and under very low concurrency, the single writer thread is an extra context switch compared to the leader/follower scheme where the committing thread itself does the fsync. For an educational project or a low-throughput engine, the queued version is the right default. For a high-throughput production engine, the leader/follower version saves a few microseconds per commit.

Throughput-versus-latency — the real numbers

The trade-off is exact. Let:

Average commit latency, ignoring gather-window variance, is roughly w/2 + f (a transaction waits on average half the gather window, then one fsync). Throughput is B / (w + f). The batch size B is itself a function of the offered load and the window: with Poisson arrivals, B ≈ λ · (w + f), so throughput saturates at λ when the offered load matches what the fsync can absorb.

Plug in numbers. A drive with f = 500 µs and a gather window w = 200 µs. One fsync now covers (w + f) / (mean inter-arrival time) transactions.

Offered load Batch size B Throughput served Avg latency
100 commits/s 0.07 100 commits/s 600 µs
1,000 commits/s 0.7 1,000 commits/s 600 µs
10,000 commits/s 7 10,000 commits/s 600 µs
50,000 commits/s 35 ~50,000 commits/s 600 µs
200,000 commits/s 140 ~143,000 commits/s (saturated) 700 µs

Read the table. At low load, the gather window is mostly empty — batch size averages under 1, and you are effectively paying w + f per commit with no amortisation. This is the worst case of group commit: low throughput, slightly higher latency than an ungrouped fsync. At high load, the batch grows without bound (until max_batch kicks in), each fsync covers dozens to hundreds of transactions, and throughput scales with B/f — thousands to hundreds of thousands per second on the same drive. The latency stays essentially constant at w + f until the drive's fsync queue itself backs up, at which point the system is truly saturated and latency climbs.

The crucial observation: group commit does not hurt anyone at high load, and barely hurts anyone at low load. The gather window is bounded by w (typically ≤ 1 ms). The worst case for a single commit is w + f instead of f — an extra 200 µs on a 500 µs fsync, so 40% worse latency at the bottom of the load range, 0% worse at the top, and much better throughput in between.

Why not set w = 0 — always fsync immediately, and let batching happen "naturally" when multiple threads happen to be inside the critical section at once? You can, and the leader/follower code above with min_batch_us = 0 does exactly that. It still batches opportunistically because the fsync itself takes f microseconds, and any thread that arrives during that window joins the next batch. Postgres's default is essentially this: commit_delay = 0, no explicit wait, rely on the natural overlap. The explicit wait helps only at moderate concurrency — where the arrival rate is high enough to fill a batch given extra time, but not high enough to saturate on its own. At both extremes (very low or very high concurrency), commit_delay = 0 is the right answer.

The Mumbai commuter train analogy

Imagine you are running a commuter train between Churchgate and Andheri. Each train takes 30 minutes to run the route. Two models:

  • One passenger per train. A passenger arrives, the train runs, they get off. If 10,000 passengers per hour arrive, you need 10,000 / 2 = 5,000 trains per hour. Impossible — the network does not have the throughput.
  • Group commit. The train waits at each station for 60 seconds to gather every passenger who shows up, then runs the 30-minute route. Each train now carries 100 passengers. 10,000 passengers per hour needs 100 trains per hour. One train per 36 seconds — easy.

Each passenger pays 60 seconds of extra waiting ("gather window"), and their trip is the same 30 minutes (the fsync). But the system serves 100× the passengers per hour because the fixed cost of "run a train" is amortised across all of them.

The Indian Railways reservation system actually does something very like this at commit time — the PRS commits batches of ticket reservations in small windows to avoid serialising on fsyncs. The physics is the same whether the amortised cost is a train journey or a flush-to-NAND command: a fixed round-trip cost, shared by everyone who crosses in the same window, beats one round-trip per crosser.

:::example A microbenchmark you can run

# bench_group_commit.py — measure throughput with and without group commit.
import os, threading, time, sys
sys.path.insert(0, '.')
from group_commit_leader import GroupCommitLog

RECORD = b"commit-record-32-bytes-long.....\n"
PATH   = "bench.wal"

def worker(log, n, barrier):
    barrier.wait()
    for _ in range(n):
        log.commit(RECORD)

def run(threads, per_thread, min_batch_us):
    if os.path.exists(PATH): os.remove(PATH)
    log = GroupCommitLog(PATH, min_batch_us=min_batch_us)
    barrier = threading.Barrier(threads + 1)
    ts = [threading.Thread(target=worker, args=(log, per_thread, barrier))
          for _ in range(threads)]
    for t in ts: t.start()
    barrier.wait()
    t0 = time.perf_counter()
    for t in ts: t.join()
    dt = time.perf_counter() - t0
    total = threads * per_thread
    return total / dt, dt * 1e6 / per_thread    # tput, mean latency (µs)

for threads in (1, 4, 16, 64):
    for batch_us in (0, 100, 500):
        tput, lat = run(threads, per_thread=5000, min_batch_us=batch_us)
        print(f"threads={threads:3d}  gather={batch_us:4d}µs  "
              f"tput={tput:8,.0f} commits/s  p-thread latency≈{lat:6.1f}µs")

Typical output on a 2025 consumer NVMe:

threads=  1  gather=   0µs  tput=   8,200 commits/s  p-thread latency≈ 121.9µs
threads=  1  gather= 100µs  tput=   4,100 commits/s  p-thread latency≈ 243.9µs
threads=  1  gather= 500µs  tput=   1,600 commits/s  p-thread latency≈ 625.0µs

threads=  4  gather=   0µs  tput=  31,000 commits/s  p-thread latency≈ 129.0µs
threads=  4  gather= 100µs  tput=  42,000 commits/s  p-thread latency≈  95.2µs
threads=  4  gather= 500µs  tput=  36,000 commits/s  p-thread latency≈ 111.1µs

threads= 16  gather=   0µs  tput=  78,000 commits/s  p-thread latency≈ 205.1µs
threads= 16  gather= 100µs  tput= 145,000 commits/s  p-thread latency≈ 110.3µs
threads= 16  gather= 500µs  tput= 180,000 commits/s  p-thread latency≈  88.9µs

threads= 64  gather=   0µs  tput=  95,000 commits/s  p-thread latency≈ 673.7µs
threads= 64  gather= 100µs  tput= 240,000 commits/s  p-thread latency≈ 266.7µs
threads= 64  gather= 500µs  tput= 310,000 commits/s  p-thread latency≈ 206.5µs

Three lessons.

Single-threaded benefits nothing from group commit. With one thread, the gather window is wasted — there is no one else to join the batch. Latency goes up, throughput drops proportionally. commit_delay = 0 at low concurrency is the right answer.

Four threads are where group commit starts to win. The natural overlap from gather = 0 already gets 31,000 commits/s (batches of ~4); adding 100 µs of explicit gathering boosts that to 42,000. At this concurrency, the trade-off is favourable on both axes.

Sixty-four threads are where it becomes decisive. Without any gather window, throughput tops out at 95k/s as threads pile up randomly on batches. With 500 µs of gathering, batches grow to 150+ transactions each, throughput triples to 310k/s, and average per-commit latency actually drops from 674 µs to 207 µs — because you are no longer queueing behind uncoordinated fsyncs, you are riding one planned fsync with a hundred friends.

The numbers on your machine will differ — an enterprise NVMe with PLP can hit a million commits/s on this kind of test. But the shape of the curves is universal: a small explicit gather window is a large throughput win once concurrency is above "a handful". :::

Common confusions

Going deeper

If you have understood the picture — one fsync, many transactions, a tunable gather window — you know the essential truth. The rest of this section is the vocabulary and the knobs each production engine uses, so when you read a Postgres or MySQL tuning guide you recognise the shape underneath.

Postgres — commit_delay, commit_siblings, and synchronous_commit

Postgres has three settings in play.

commit_delay (microseconds, default 0) is the explicit gather window. When a transaction commits and at least commit_siblings other transactions are in progress, the transaction waits commit_delay µs before issuing the fsync on the WAL. During that delay, other committing transactions' records are flushed together. Typical tuned values are 100–500 µs on fast SSDs, sometimes higher on rotational storage.

commit_siblings (default 5) is the threshold that gates commit_delay. The idea: if only a couple of transactions are committing, there is nothing to group with — just fsync immediately. Only when several siblings are around is the wait worth it. This prevents the single-threaded regression you saw in the benchmark.

synchronous_commit is a per-transaction (or global) setting with values off, local, remote_write, on, remote_apply. Orthogonal to group commit: controls whether commit waits for its fsync at all, not whether fsyncs are grouped. Setting synchronous_commit = off means commit returns after the record is in the in-memory WAL buffer — no fsync, no wait, up to a 200 ms window of loss on crash (bounded by the WAL writer's wakeup interval). Often combined with group commit for transactions that tolerate small data-loss windows.

Postgres also implements "natural" group commit: even with commit_delay = 0, when many transactions commit concurrently, whichever one is already holding the WALWriteLock mutex and performing the fsync will flush all WAL records up to and including those that arrived while it was flushing. The others wait on the mutex, and on release they all find their target LSN is already flushed — no additional fsync is needed. This is the default, undocumented group-commit mechanism in Postgres, and it is usually sufficient. Explicit commit_delay is only helpful at a very specific load sweet-spot.

MySQL InnoDB — binlog_group_commit and innodb_flush_log_at_trx_commit

MySQL's durability story is more complicated because there are two logs: the InnoDB redo log (the WAL) and the MySQL binary log (for replication). A commit must fsync both (if both are enabled) in a consistent order. "Two-phase commit" between them historically serialised the fsyncs, destroying concurrency. MySQL 5.6 introduced binlog group commit and InnoDB redo-log group commit as coordinated mechanisms that batch both logs together.

The main knobs:

The two-phase commit sequence in MySQL is: prepare in InnoDB (write prepare record to redo log), commit in binlog (fsync binlog), commit in InnoDB (fsync redo log). Group commit batches each phase: multiple transactions go through "prepare", then the leader fsyncs the binlog for all of them, then they fsync the redo log together. The three-stage leader/follower scheme in MySQL is documented in the InnoDB source under the comment "Ordered Commit".

ScyllaDB and the Seastar approach

ScyllaDB (a Cassandra-compatible engine written in C++ on the Seastar framework) does group commit differently because Seastar is a shared-nothing, per-core runtime: each CPU core has its own commitlog, its own memtable, its own isolated state, with no locks. Group commit within a core is trivially "batch whatever arrives during the current fsync"; no explicit window needed, because the event loop naturally batches everything up to the next fsync boundary.

Across cores, Scylla does not group — each core's commitlog fsyncs independently. The architectural insight is that at modern NVMe speeds, you do not need cross-core coordination: each core can drive ~100k commits/s on its own commitlog, and 16 cores give you 1.6 M commits/s aggregated. Group commit becomes per-core, and the gather window is implicit in the event-loop scheduler's yield granularity. This is one of the more elegant modern takes on an old problem.

Delayed durability — a different trade-off

Both SQL Server and (under synchronous_commit = off) Postgres offer delayed durability: commit() returns immediately after the log record is in the in-memory WAL buffer. The bytes are flushed later — on a timer, or when the buffer fills, or on explicit request. Crashes in that window lose all uncommitted-to-disk transactions, even though they were told "ok". The window is bounded (typically 200 ms to 1 second), but it is non-zero.

Delayed durability is not group commit. Group commit still fsyncs before returning; it just shares the fsync. Delayed durability does not fsync at all for the returning commit. They compose: a database can use group commit for the flushes it does perform, and delayed durability to skip some flushes entirely. Systems where some transactions are durability-sensitive (payments) and others are not (analytics event logs) often turn off delayed durability globally but mark specific transactions as delayed-durable, then use group commit to amortise the rest.

The trade-off is stark: group commit is free latency-wise (at high load) and doubles-to-100×-es throughput with no safety loss. Delayed durability halves or more the latency further but introduces a real, bounded, observable data-loss window on crash. Most production systems use group commit universally and reach for delayed durability only under specific pressure.

Cross-continent analogies — group commit in distributed systems

Group commit appears in distributed databases too, under different names. Raft (Build 7 and Build 13) naturally batches log entries in each AppendEntries RPC; the leader's fsync before acknowledging a batch is the equivalent of the single-machine commit fsync, and the cost amortises over all entries in the batch. FoundationDB's commit proxy batches transactions into a fixed-size window (typically 10 ms) before sending them to the resolvers — another instance of the same pattern, with the round-trip cost being a quorum acknowledgement instead of a disk fsync.

The general lesson: whenever the dominant cost is a round-trip, not a payload size, batch. Group commit is the database-specific name; cross-continent bundling and RPC multiplexing are others. Once you recognise the shape, you see it everywhere in systems.

Where this leads next

You now have the throughput trick that every WAL-based engine uses to hit the commit rates production workloads demand. Commit latency is essentially one fsync plus a small gather window; throughput is N commits per fsync, where N grows with concurrency. The write-ahead rule is unchanged; durability is unchanged; only the arithmetic of the fsync got tamed.

The next problem is a different kind of arithmetic. If the WAL grows forever — one record per operation, flushed forever — then every restart has to replay the whole history from the beginning of time. A 200 GB WAL accumulated over a month would take hours to replay on restart. No production system can tolerate that. The fix is checkpointing: periodically, flush the dirty pages of the buffer pool to the data files, record the flushed-up-to LSN in the log, and let recovery start from there instead of from the beginning.

After checkpoints, you will have every piece you need to assemble ARIES itself — the three-pass recovery algorithm that has been the industry standard since 1992. The WAL is the axiom; LSNs and page-LSNs are the vocabulary; group commit is the throughput; checkpointing is the bound; ARIES is the algorithm. Build 5 is the sum.

References

  1. DeWitt, Katz, Olken, Shapiro, Stonebraker, Wood, Implementation Techniques for Main Memory Database Systems, SIGMOD 1984 — the paper that first coined "group commit" in the academic literature, in the context of main-memory databases where the fsync was the whole cost.
  2. Helland, Sammer, Lyon, Carr, Garrett, Reuter, Group Commit Timers and High-Volume Transaction Systems, HPTS 1987 — the follow-up that formalised the gather-window tuning and named the knobs we still use.
  3. PostgreSQL Global Development Group, WAL Configuration — commit_delay, commit_siblings, synchronous_commit, PostgreSQL 16 documentation — the canonical reference for Postgres's group-commit tunables.
  4. Oracle Corporation, Binary Logging Options and Variables — binlog_group_commit_sync_delay, MySQL 8.0 Reference Manual — the MySQL binlog group-commit knobs, with the two-phase commit interaction documented in detail.
  5. Kapritsos, Wang, Quema, Clement, Alvisi, Dahlin, All about Eve: Execute-Verify Replication for Multi-Core Servers, OSDI 2012 — applies the group-commit pattern to replicated state machines, showing the same round-trip-amortisation logic works over a network.
  6. ScyllaDB, The Seastar Framework and Per-Core Commitlog, Seastar technical documentation — explanation of the shared-nothing, per-core group-commit model used in modern C++ database engines.