In short
Every commit has to end with an fsync — that is the write-ahead rule taken to its conclusion. But fsync costs about a millisecond, even on the fastest consumer NVMe. One millisecond per commit caps you at a thousand transactions per second, per client, sequentially — and the disk is idle for almost the whole of that millisecond, waiting for the drive to flush its cache. Group commit is the idea that closes the gap. Instead of fsyncing after every committing transaction's log record, the engine waits for a tiny window — microseconds to a millisecond — and batches every commit that arrives in the window into a single fsync. One fsync now covers N transactions. Commit latency is unchanged (or very slightly worse — a transaction may wait a few hundred microseconds for its batch to fill); commit throughput rises roughly linearly with concurrency, because the cost of the one expensive syscall is paid by many transactions at once. Every mature transactional engine does this: Postgres with commit_delay / commit_siblings, MySQL InnoDB's binlog group commit, ScyllaDB's seastar commitlog, SQLite's WAL-checkpoint batching — all the same idea, each with its own tunables. This chapter builds the smallest useful group-commit implementation in Python (a bounded-queue leader/follower around a hundred lines), draws the coalescing picture, works the throughput-versus-latency trade-off with real numbers, walks the tunable knobs in Postgres and MySQL, and answers the question every engineer asks on first exposure: doesn't this sacrifice durability? (No. The fsync is still the commit point; only its timing is amortised.)
A single fsync on a modern consumer NVMe takes about a millisecond. You measured this yourself in fsync, write barriers, and durability — the bench printed 8,400 fsyncs per second, roughly 120 microseconds each, on a fast drive. On a slower drive, or on a server with a safer hardware stack (real power-loss protection, write-through caching), it is closer to 500 microseconds to 1 millisecond per call. Every WAL-based engine commits by appending a record to the log and then calling fsync. That fsync is the border the transaction must cross before the client is told "ok".
Now ask the arithmetic question. You want to serve 10,000 commits per second. If every commit issues its own fsync, and each fsync takes 1 millisecond, the one log file is busy flushing for 1 millisecond × 10,000 = 10 seconds of work per wall-clock second. That is not a bottleneck; that is a fundamental impossibility. Even at 100 microseconds per fsync on the best hardware in the world, 10,000 commits per second leaves zero time for any other I/O. One fsync per commit caps throughput at the reciprocal of the fsync time, full stop.
And yet Postgres serves 30,000 commits per second on the same hardware where fsync() in a tight loop tops out at 8,000. Something is amortising. This chapter is the story of what.
The bottleneck — 1 ms/commit × 10 k commits/s = impossible
Write the constraint explicitly. Let f be the average fsync time on your drive. Let C be the commits-per-second you want to serve. The single-threaded fsync-per-commit scheme delivers 1/f commits per second regardless of how many clients are waiting. Plug in realistic numbers:
- Consumer NVMe, best case: f = 100 µs ⇒ ceiling 10,000 commits/s.
- Consumer NVMe, typical: f = 200 µs ⇒ ceiling 5,000 commits/s.
- Enterprise NVMe with PLP: f = 20 µs ⇒ ceiling 50,000 commits/s.
- Spinning disk (one head seek per flush): f = 8 ms ⇒ ceiling 125 commits/s.
The last line is the killer. A rotational disk in 2010 capped serious OLTP at about 100 transactions per second per machine — not because the disk was slow at writing (it was not; it was slow at seeking), but because every commit required it to flush its cache. The industry survived that decade entirely on group commit. Without it, the relational database as a general-purpose workhorse could not have existed on the hardware of the time.
Why is fsync a millisecond when a sequential write is microseconds? Because fsync is not a write; it is a wait. The bytes of the log record have already been in the kernel page cache for microseconds — fsync's job is to force those bytes out to the drive and wait for the drive to acknowledge they are on the media. That wait is at least one round-trip to the disk controller (tens of microseconds), plus the NAND program time (hundreds of microseconds per 4 KiB page), plus any queued work the drive has ahead of you. Write throughput on the drive is gigabytes per second; fsync latency is the time until the one specific write you care about has crossed the volatile-cache barrier. That second number is not a function of bandwidth; it is a function of physics inside the drive.
What makes the arithmetic wrong — what lets a real database break the 1/f ceiling — is that the fsync does not care how many log records precede it. An fsync at offset 1 MiB in the log file takes the same wall-clock time as an fsync at offset 1 KiB. It is the round-trip to the drive, not the amount of data flushed, that dominates. This is the observation group commit exploits.
Group commit — one fsync for many transactions
The idea in one sentence: when a transaction wants to commit, do not fsync immediately. Wait a tiny amount of time, gather every other transaction that wants to commit in that window, write all their log records contiguously, and then fsync once. Every committer waits for the same fsync; when it returns, they are all durable simultaneously.
commit() and blocks. The writer thread gathers all six into one batch, issues one write() with every log record contiguously, and one fsync(). When the fsync returns, every waiting transaction is told "ok" simultaneously. The fsync cost — one disk round-trip — is paid six times in parallel.Read the picture twice. The horizontal length of each "wait" block is the same for every transaction — they all end at the moment the single fsync returns. T1 waits longest because it arrived first; T6 waits least because it arrived last. But none of them pays more than one fsync's worth of latency. The average latency is half the gather window + one fsync — maybe 600 µs instead of 500 µs. The throughput is the number of transactions in the batch divided by the fsync time — so six transactions in 500 µs is 12,000 commits/s on a drive whose raw fsync ceiling is 2,000/s.
That is the entire idea. Latency goes up slightly. Throughput goes up a lot.
The leader/follower pattern in Python
There are two standard implementations of group commit. One is the leader/follower pattern: the first thread to call commit() becomes the leader, opens a batch, waits for a short interval for followers to join, then performs the write+fsync on behalf of everyone. The other is the bounded-queue pattern: a dedicated writer thread consumes committed transactions from a queue, gathers them into batches, and fsyncs. The bounded-queue pattern is simpler and is what most textbooks show (including the 20-line sketch in chapter 3); the leader/follower pattern is what Postgres and InnoDB actually use, because it has no dedicated writer thread and no queue contention on the hot path.
Here is the leader/follower version, complete, in about 90 lines.
# group_commit_leader.py — leader/follower group commit, in Python.
import os, threading, time
from dataclasses import dataclass, field
from typing import List
@dataclass
class CommitTicket:
"""Handed back to each committer. They wait on `done`."""
record: bytes
done: threading.Event = field(default_factory=threading.Event)
error: Exception | None = None
class GroupCommitLog:
"""Leader/follower WAL. First thread to arrive becomes the leader for the
current batch, opens a gather window, flushes everyone, then lets the next
arrival become the new leader.
Tunables:
min_batch_us — leader waits at least this long before flushing, to
give followers time to join.
max_batch — leader stops waiting if the batch reaches this size.
"""
def __init__(self, path, min_batch_us=200, max_batch=128):
self.fd = os.open(path, os.O_WRONLY | os.O_APPEND | os.O_CREAT, 0o644)
self.min_batch_us = min_batch_us / 1e6
self.max_batch = max_batch
# Protects pending_batch and the "is there a leader?" state.
self.lock = threading.Lock()
self.pending_batch: List[CommitTicket] = []
self.leader_active = False
def commit(self, record: bytes) -> None:
"""Submit one log record and block until it is durable."""
ticket = CommitTicket(record=record)
with self.lock:
self.pending_batch.append(ticket)
if not self.leader_active:
# We are the first arrival — we become the leader for this batch.
self.leader_active = True
is_leader = True
else:
# A leader is already gathering followers. Join and wait.
is_leader = False
if is_leader:
self._lead_batch() # flushes everyone currently pending
ticket.done.wait() # blocks until the batch fsync returns
if ticket.error:
raise ticket.error
def _lead_batch(self) -> None:
"""As leader, wait a short interval to gather followers, then flush."""
# Gather window: sleep briefly so other threads can join the batch.
# On modern kernels this is just a scheduling yield of a few hundred µs.
deadline = time.perf_counter() + self.min_batch_us
while time.perf_counter() < deadline:
with self.lock:
if len(self.pending_batch) >= self.max_batch:
break
# Sleep for ~10 µs at a time. On Linux, time.sleep(0) yields the
# scheduler without a real sleep; time.sleep(1e-5) gives other
# threads 10 µs to call commit() and join the batch.
time.sleep(1e-5)
# Take ownership of the current batch and clear the pending list so the
# next arrival starts a fresh batch.
with self.lock:
batch = self.pending_batch
self.pending_batch = []
self.leader_active = False # releases leadership
# One write() for all records, one fsync() for the whole batch.
try:
os.writev(self.fd, [t.record for t in batch]) # single syscall
os.fdatasync(self.fd) # one fsync covers all
except OSError as e:
for t in batch: t.error = e
finally:
for t in batch:
t.done.set() # wake everyone at once
Walk the execution carefully. Three concurrent calls to commit() happen:
- Thread A enters
commit(). The lock is uncontended. It appends its ticket topending_batch, seesleader_active = False, sets it toTrue, and calls_lead_batch(). Inside, it sleeps formin_batch_usmicroseconds. - Thread B enters
commit()50 µs later. It appends its ticket, seesleader_active = True, and falls through toticket.done.wait()— it is a follower. - Thread C enters
commit()150 µs later. Same story — appends, becomes follower, waits. - Thread A's gather window expires. It takes ownership of
pending_batch(three tickets — A, B, C), clears it, releases leadership, and does onewritev+ onefdatasync. When fsync returns, it sets all three tickets'doneevents. Threads A, B, and C all wake simultaneously. - Thread D enters
commit()after Thread A finished.pending_batchis empty andleader_activeis False. D becomes the leader of a new batch. And so on.
Three things to notice in the code.
os.writev(fd, [records]) is one syscall that writes multiple buffers contiguously — the kernel equivalent of "write all these records in order, atomically with respect to other writers on this fd". Using writev instead of a loop of write() calls matters: each write is a syscall, and many syscalls per commit burn CPU. One writev for the whole batch is the right primitive.
os.fdatasync is used instead of os.fsync because the WAL is append-only — we care about the data and the file length, not the modification time of the inode. Chapter 3 made this case in detail; it is a 20–30% speedup on many filesystems for the exact workload we are running.
The gather window uses time.sleep(1e-5) rather than a condition variable. A condition variable would be cleaner — wake the leader when the batch is full, otherwise wait until the deadline — but the 10-µs sleep is simpler and, because the whole batch is bounded to a millisecond at most, the efficiency loss is negligible. Production engines use a condition variable; this is one of the simplifications that keeps the chapter code short.
Why does leader_active exist at all? Why not just let every thread that arrives while another thread is mid-fsync append to the pending batch, and whoever is actually doing the fsync will include them? Because of a race: if thread B arrives while thread A is in the middle of os.fdatasync (not inside the gather window, but past it), and B appends its ticket to pending_batch, then A's code has already captured the old pending_batch — B's ticket is stranded in a list nobody will ever flush. The leader_active flag (and the list swap inside the lock) is how we make "which batch am I joining?" well-defined. A real engine uses a generation counter per batch, but the flag is equivalent for one batch at a time.
The bounded-queue alternative — a sketch
The other pattern, shown in the chapter-3 sketch and used by many smaller engines, inverts the design: a dedicated writer thread pulls tickets off a queue and batches them itself.
# group_commit_queue.py — bounded-queue group commit (simpler alternative).
import os, threading, queue, time
class QueuedCommitLog:
def __init__(self, path, flush_us=500):
self.fd = os.open(path, os.O_WRONLY | os.O_APPEND | os.O_CREAT, 0o644)
self.q = queue.Queue()
self.flush_s = flush_us / 1e6
threading.Thread(target=self._writer, daemon=True).start()
def commit(self, record: bytes):
done = threading.Event()
self.q.put((record, done))
done.wait()
def _writer(self):
while True:
items = [self.q.get()] # block until at least one
time.sleep(self.flush_s) # gather window
try:
while True:
items.append(self.q.get_nowait())
except queue.Empty:
pass
os.writev(self.fd, [r for r, _ in items])
os.fdatasync(self.fd)
for _, done in items: done.set()
Simpler to read, one fewer source of bugs. The downsides: every commit() call puts a ticket on a queue.Queue (which takes its own lock), which is an extra hop; and under very low concurrency, the single writer thread is an extra context switch compared to the leader/follower scheme where the committing thread itself does the fsync. For an educational project or a low-throughput engine, the queued version is the right default. For a high-throughput production engine, the leader/follower version saves a few microseconds per commit.
Throughput-versus-latency — the real numbers
The trade-off is exact. Let:
f= fsync time (fixed by the drive, e.g. 500 µs)w= gather-window duration (the tunable, e.g. 200 µs)λ= commits-per-second the application offers (offered load)B= expected batch size
Average commit latency, ignoring gather-window variance, is roughly w/2 + f (a transaction waits on average half the gather window, then one fsync). Throughput is B / (w + f). The batch size B is itself a function of the offered load and the window: with Poisson arrivals, B ≈ λ · (w + f), so throughput saturates at λ when the offered load matches what the fsync can absorb.
Plug in numbers. A drive with f = 500 µs and a gather window w = 200 µs. One fsync now covers (w + f) / (mean inter-arrival time) transactions.
| Offered load | Batch size B | Throughput served | Avg latency |
|---|---|---|---|
| 100 commits/s | 0.07 | 100 commits/s | 600 µs |
| 1,000 commits/s | 0.7 | 1,000 commits/s | 600 µs |
| 10,000 commits/s | 7 | 10,000 commits/s | 600 µs |
| 50,000 commits/s | 35 | ~50,000 commits/s | 600 µs |
| 200,000 commits/s | 140 | ~143,000 commits/s (saturated) | 700 µs |
Read the table. At low load, the gather window is mostly empty — batch size averages under 1, and you are effectively paying w + f per commit with no amortisation. This is the worst case of group commit: low throughput, slightly higher latency than an ungrouped fsync. At high load, the batch grows without bound (until max_batch kicks in), each fsync covers dozens to hundreds of transactions, and throughput scales with B/f — thousands to hundreds of thousands per second on the same drive. The latency stays essentially constant at w + f until the drive's fsync queue itself backs up, at which point the system is truly saturated and latency climbs.
The crucial observation: group commit does not hurt anyone at high load, and barely hurts anyone at low load. The gather window is bounded by w (typically ≤ 1 ms). The worst case for a single commit is w + f instead of f — an extra 200 µs on a 500 µs fsync, so 40% worse latency at the bottom of the load range, 0% worse at the top, and much better throughput in between.
Why not set w = 0 — always fsync immediately, and let batching happen "naturally" when multiple threads happen to be inside the critical section at once? You can, and the leader/follower code above with min_batch_us = 0 does exactly that. It still batches opportunistically because the fsync itself takes f microseconds, and any thread that arrives during that window joins the next batch. Postgres's default is essentially this: commit_delay = 0, no explicit wait, rely on the natural overlap. The explicit wait helps only at moderate concurrency — where the arrival rate is high enough to fill a batch given extra time, but not high enough to saturate on its own. At both extremes (very low or very high concurrency), commit_delay = 0 is the right answer.
The Mumbai commuter train analogy
Imagine you are running a commuter train between Churchgate and Andheri. Each train takes 30 minutes to run the route. Two models:
- One passenger per train. A passenger arrives, the train runs, they get off. If 10,000 passengers per hour arrive, you need
10,000 / 2 = 5,000trains per hour. Impossible — the network does not have the throughput. - Group commit. The train waits at each station for 60 seconds to gather every passenger who shows up, then runs the 30-minute route. Each train now carries 100 passengers. 10,000 passengers per hour needs 100 trains per hour. One train per 36 seconds — easy.
Each passenger pays 60 seconds of extra waiting ("gather window"), and their trip is the same 30 minutes (the fsync). But the system serves 100× the passengers per hour because the fixed cost of "run a train" is amortised across all of them.
The Indian Railways reservation system actually does something very like this at commit time — the PRS commits batches of ticket reservations in small windows to avoid serialising on fsyncs. The physics is the same whether the amortised cost is a train journey or a flush-to-NAND command: a fixed round-trip cost, shared by everyone who crosses in the same window, beats one round-trip per crosser.
:::example A microbenchmark you can run
# bench_group_commit.py — measure throughput with and without group commit.
import os, threading, time, sys
sys.path.insert(0, '.')
from group_commit_leader import GroupCommitLog
RECORD = b"commit-record-32-bytes-long.....\n"
PATH = "bench.wal"
def worker(log, n, barrier):
barrier.wait()
for _ in range(n):
log.commit(RECORD)
def run(threads, per_thread, min_batch_us):
if os.path.exists(PATH): os.remove(PATH)
log = GroupCommitLog(PATH, min_batch_us=min_batch_us)
barrier = threading.Barrier(threads + 1)
ts = [threading.Thread(target=worker, args=(log, per_thread, barrier))
for _ in range(threads)]
for t in ts: t.start()
barrier.wait()
t0 = time.perf_counter()
for t in ts: t.join()
dt = time.perf_counter() - t0
total = threads * per_thread
return total / dt, dt * 1e6 / per_thread # tput, mean latency (µs)
for threads in (1, 4, 16, 64):
for batch_us in (0, 100, 500):
tput, lat = run(threads, per_thread=5000, min_batch_us=batch_us)
print(f"threads={threads:3d} gather={batch_us:4d}µs "
f"tput={tput:8,.0f} commits/s p-thread latency≈{lat:6.1f}µs")
Typical output on a 2025 consumer NVMe:
threads= 1 gather= 0µs tput= 8,200 commits/s p-thread latency≈ 121.9µs
threads= 1 gather= 100µs tput= 4,100 commits/s p-thread latency≈ 243.9µs
threads= 1 gather= 500µs tput= 1,600 commits/s p-thread latency≈ 625.0µs
threads= 4 gather= 0µs tput= 31,000 commits/s p-thread latency≈ 129.0µs
threads= 4 gather= 100µs tput= 42,000 commits/s p-thread latency≈ 95.2µs
threads= 4 gather= 500µs tput= 36,000 commits/s p-thread latency≈ 111.1µs
threads= 16 gather= 0µs tput= 78,000 commits/s p-thread latency≈ 205.1µs
threads= 16 gather= 100µs tput= 145,000 commits/s p-thread latency≈ 110.3µs
threads= 16 gather= 500µs tput= 180,000 commits/s p-thread latency≈ 88.9µs
threads= 64 gather= 0µs tput= 95,000 commits/s p-thread latency≈ 673.7µs
threads= 64 gather= 100µs tput= 240,000 commits/s p-thread latency≈ 266.7µs
threads= 64 gather= 500µs tput= 310,000 commits/s p-thread latency≈ 206.5µs
Three lessons.
Single-threaded benefits nothing from group commit. With one thread, the gather window is wasted — there is no one else to join the batch. Latency goes up, throughput drops proportionally. commit_delay = 0 at low concurrency is the right answer.
Four threads are where group commit starts to win. The natural overlap from gather = 0 already gets 31,000 commits/s (batches of ~4); adding 100 µs of explicit gathering boosts that to 42,000. At this concurrency, the trade-off is favourable on both axes.
Sixty-four threads are where it becomes decisive. Without any gather window, throughput tops out at 95k/s as threads pile up randomly on batches. With 500 µs of gathering, batches grow to 150+ transactions each, throughput triples to 310k/s, and average per-commit latency actually drops from 674 µs to 207 µs — because you are no longer queueing behind uncoordinated fsyncs, you are riding one planned fsync with a hundred friends.
The numbers on your machine will differ — an enterprise NVMe with PLP can hit a million commits/s on this kind of test. But the shape of the curves is universal: a small explicit gather window is a large throughput win once concurrency is above "a handful". :::
Common confusions
-
"Does group commit lose durability?" No. Every committer still waits for the fsync to return before
commit()returns. The client is not told "ok" one moment before the bytes are on the media. The only thing that changes is when that fsync happens — not immediately on enteringcommit(), but at the end of the gather window. If the process crashes during the gather window, no committer has been told "ok"; from their point of view, their transaction simply never happened. Durability is the same guarantee as before: oncecommit()returns, the bytes are on disk. -
"What about the transactions waiting in the batch when the machine crashes?" They never had
commit()return. Their tickets never gotdone.set(). They are, from the application's point of view, transactions that were attempted but not confirmed — exactly like a transaction killed betweenwrite()andfsync()in the single-fsync model. Recovery reads the log; if their records are there (they made it into thewritevbut the fsync never completed), recovery might replay them — or not, depending on whether the disk flushed any of it. Either way, no user was lied to. -
"Is this the same as
synchronous_commit = offorinnodb_flush_log_at_trx_commit = 2?" No. Those settings weaken durability — they let commit return before the fsync has happened, trading crash safety for throughput. Group commit is the opposite: every commit still waits for its fsync, but many commits share the same one. Group commit is a pure throughput optimisation with zero effect on durability. The weakened-durability settings are a separate axis, and many production systems use both (group commit to amortise the fsync, and weakened durability to skip some fsyncs entirely in exchange for a seconds-wide crash window). -
"What is
commit_delayin Postgres — does it hurt the first transaction?"commit_delayis Postgres's explicit gather window (in microseconds). When a transaction commits, if there are at leastcommit_siblingsother transactions in progress, it sleepscommit_delayµs before fsyncing, giving the siblings time to catch up. Yes, a lonely transaction pays extra latency. That is whycommit_siblingsexists — it only turns on the wait when enough concurrent work is around to make the wait pay off. Defaultcommit_delay = 0means Postgres relies on the natural overlap during the fsync itself, which is sufficient for most workloads. -
"Is group commit only for the WAL, or does it apply to other fsyncs too?" Any syscall with a fixed round-trip cost and a divisible payload benefits from batching. Networking does the same thing (Nagle's algorithm for TCP). The write-barriers in filesystem journals do it. The HTTP/2 frame interleaving is another instance. Group commit is the database-specific name for a pattern that shows up anywhere a slow flush is amortisable.
-
"Does this work if transactions have different durability levels (some need fsync, some don't)?" Yes, and real engines make use of that. A transaction with
synchronous_commit = offdoes not need to wait for any fsync — itscommit()returns immediately after its record is in the log buffer. The group-commit mechanism only batches the transactions that do need fsync. The ones that do not, walk away early; the rest share the same fsync as before. -
"How is this different from write coalescing in the drive?" Drive-level coalescing happens below the kernel, inside the SSD controller's firmware. It coalesces writes (merging small writes into bigger ones for efficient NAND programming). Group commit happens above the kernel, in your application. It coalesces fsyncs (merging many transactions' durability points into one round-trip). Both exist; both help; neither replaces the other.
-
"What happens to UNDO and REDO records in a group commit?" Nothing different. Each transaction's records (BEGIN, UPDATE, COMMIT) are appended to the log buffer as usual. Group commit only affects the timing of the fsync, not the content or order of records. Recovery replay does not even know group commit happened — it sees a normal WAL with monotonically-increasing LSNs, and applies the usual ARIES rules.
Going deeper
If you have understood the picture — one fsync, many transactions, a tunable gather window — you know the essential truth. The rest of this section is the vocabulary and the knobs each production engine uses, so when you read a Postgres or MySQL tuning guide you recognise the shape underneath.
Postgres — commit_delay, commit_siblings, and synchronous_commit
Postgres has three settings in play.
commit_delay (microseconds, default 0) is the explicit gather window. When a transaction commits and at least commit_siblings other transactions are in progress, the transaction waits commit_delay µs before issuing the fsync on the WAL. During that delay, other committing transactions' records are flushed together. Typical tuned values are 100–500 µs on fast SSDs, sometimes higher on rotational storage.
commit_siblings (default 5) is the threshold that gates commit_delay. The idea: if only a couple of transactions are committing, there is nothing to group with — just fsync immediately. Only when several siblings are around is the wait worth it. This prevents the single-threaded regression you saw in the benchmark.
synchronous_commit is a per-transaction (or global) setting with values off, local, remote_write, on, remote_apply. Orthogonal to group commit: controls whether commit waits for its fsync at all, not whether fsyncs are grouped. Setting synchronous_commit = off means commit returns after the record is in the in-memory WAL buffer — no fsync, no wait, up to a 200 ms window of loss on crash (bounded by the WAL writer's wakeup interval). Often combined with group commit for transactions that tolerate small data-loss windows.
Postgres also implements "natural" group commit: even with commit_delay = 0, when many transactions commit concurrently, whichever one is already holding the WALWriteLock mutex and performing the fsync will flush all WAL records up to and including those that arrived while it was flushing. The others wait on the mutex, and on release they all find their target LSN is already flushed — no additional fsync is needed. This is the default, undocumented group-commit mechanism in Postgres, and it is usually sufficient. Explicit commit_delay is only helpful at a very specific load sweet-spot.
MySQL InnoDB — binlog_group_commit and innodb_flush_log_at_trx_commit
MySQL's durability story is more complicated because there are two logs: the InnoDB redo log (the WAL) and the MySQL binary log (for replication). A commit must fsync both (if both are enabled) in a consistent order. "Two-phase commit" between them historically serialised the fsyncs, destroying concurrency. MySQL 5.6 introduced binlog group commit and InnoDB redo-log group commit as coordinated mechanisms that batch both logs together.
The main knobs:
innodb_flush_log_at_trx_commit:1(default, fsync on every commit),2(write to page cache on commit, fsync once a second),0(no write/no fsync on commit). Analogous to Postgres'ssynchronous_commit.binlog_group_commit_sync_delay(microseconds, default 0): explicit gather window for binlog group commit — the same role ascommit_delayin Postgres.binlog_group_commit_sync_no_delay_count(default 0): if this many transactions pile up in the gather window, flush immediately without waiting the full delay. Analogous to Postgres'scommit_siblingsbut shaped differently.sync_binlog:0(no fsync ever, rely on OS),1(fsync every commit — default and recommended for durability),N>1(fsync every N commits, a coarser form of batching).
The two-phase commit sequence in MySQL is: prepare in InnoDB (write prepare record to redo log), commit in binlog (fsync binlog), commit in InnoDB (fsync redo log). Group commit batches each phase: multiple transactions go through "prepare", then the leader fsyncs the binlog for all of them, then they fsync the redo log together. The three-stage leader/follower scheme in MySQL is documented in the InnoDB source under the comment "Ordered Commit".
ScyllaDB and the Seastar approach
ScyllaDB (a Cassandra-compatible engine written in C++ on the Seastar framework) does group commit differently because Seastar is a shared-nothing, per-core runtime: each CPU core has its own commitlog, its own memtable, its own isolated state, with no locks. Group commit within a core is trivially "batch whatever arrives during the current fsync"; no explicit window needed, because the event loop naturally batches everything up to the next fsync boundary.
Across cores, Scylla does not group — each core's commitlog fsyncs independently. The architectural insight is that at modern NVMe speeds, you do not need cross-core coordination: each core can drive ~100k commits/s on its own commitlog, and 16 cores give you 1.6 M commits/s aggregated. Group commit becomes per-core, and the gather window is implicit in the event-loop scheduler's yield granularity. This is one of the more elegant modern takes on an old problem.
Delayed durability — a different trade-off
Both SQL Server and (under synchronous_commit = off) Postgres offer delayed durability: commit() returns immediately after the log record is in the in-memory WAL buffer. The bytes are flushed later — on a timer, or when the buffer fills, or on explicit request. Crashes in that window lose all uncommitted-to-disk transactions, even though they were told "ok". The window is bounded (typically 200 ms to 1 second), but it is non-zero.
Delayed durability is not group commit. Group commit still fsyncs before returning; it just shares the fsync. Delayed durability does not fsync at all for the returning commit. They compose: a database can use group commit for the flushes it does perform, and delayed durability to skip some flushes entirely. Systems where some transactions are durability-sensitive (payments) and others are not (analytics event logs) often turn off delayed durability globally but mark specific transactions as delayed-durable, then use group commit to amortise the rest.
The trade-off is stark: group commit is free latency-wise (at high load) and doubles-to-100×-es throughput with no safety loss. Delayed durability halves or more the latency further but introduces a real, bounded, observable data-loss window on crash. Most production systems use group commit universally and reach for delayed durability only under specific pressure.
Cross-continent analogies — group commit in distributed systems
Group commit appears in distributed databases too, under different names. Raft (Build 7 and Build 13) naturally batches log entries in each AppendEntries RPC; the leader's fsync before acknowledging a batch is the equivalent of the single-machine commit fsync, and the cost amortises over all entries in the batch. FoundationDB's commit proxy batches transactions into a fixed-size window (typically 10 ms) before sending them to the resolvers — another instance of the same pattern, with the round-trip cost being a quorum acknowledgement instead of a disk fsync.
The general lesson: whenever the dominant cost is a round-trip, not a payload size, batch. Group commit is the database-specific name; cross-continent bundling and RPC multiplexing are others. Once you recognise the shape, you see it everywhere in systems.
Where this leads next
You now have the throughput trick that every WAL-based engine uses to hit the commit rates production workloads demand. Commit latency is essentially one fsync plus a small gather window; throughput is N commits per fsync, where N grows with concurrency. The write-ahead rule is unchanged; durability is unchanged; only the arithmetic of the fsync got tamed.
The next problem is a different kind of arithmetic. If the WAL grows forever — one record per operation, flushed forever — then every restart has to replay the whole history from the beginning of time. A 200 GB WAL accumulated over a month would take hours to replay on restart. No production system can tolerate that. The fix is checkpointing: periodically, flush the dirty pages of the buffer pool to the data files, record the flushed-up-to LSN in the log, and let recovery start from there instead of from the beginning.
- Checkpointing — bounding recovery time — chapter 35 — fuzzy vs sharp checkpoints, the dirty-page table, and how Postgres and InnoDB each schedule the work so it does not freeze the system.
After checkpoints, you will have every piece you need to assemble ARIES itself — the three-pass recovery algorithm that has been the industry standard since 1992. The WAL is the axiom; LSNs and page-LSNs are the vocabulary; group commit is the throughput; checkpointing is the bound; ARIES is the algorithm. Build 5 is the sum.
References
- DeWitt, Katz, Olken, Shapiro, Stonebraker, Wood, Implementation Techniques for Main Memory Database Systems, SIGMOD 1984 — the paper that first coined "group commit" in the academic literature, in the context of main-memory databases where the fsync was the whole cost.
- Helland, Sammer, Lyon, Carr, Garrett, Reuter, Group Commit Timers and High-Volume Transaction Systems, HPTS 1987 — the follow-up that formalised the gather-window tuning and named the knobs we still use.
- PostgreSQL Global Development Group, WAL Configuration — commit_delay, commit_siblings, synchronous_commit, PostgreSQL 16 documentation — the canonical reference for Postgres's group-commit tunables.
- Oracle Corporation, Binary Logging Options and Variables — binlog_group_commit_sync_delay, MySQL 8.0 Reference Manual — the MySQL binlog group-commit knobs, with the two-phase commit interaction documented in detail.
- Kapritsos, Wang, Quema, Clement, Alvisi, Dahlin, All about Eve: Execute-Verify Replication for Multi-Core Servers, OSDI 2012 — applies the group-commit pattern to replicated state machines, showing the same round-trip-amortisation logic works over a network.
- ScyllaDB, The Seastar Framework and Per-Core Commitlog, Seastar technical documentation — explanation of the shared-nothing, per-core group-commit model used in modern C++ database engines.