The Write-Ahead Rule

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Build 4 ended with a wall: a B+ tree split touches three to seven pages and no primitive in POSIX, Linux, or NVMe gives you multi-page atomicity. The fix every serious in-place engine has converged on is one sentence: write a record describing the change to a durable, sequential log before modifying any data page. That is the write-ahead rule. The log is append-only, small per operation (a few hundred bytes), and fsynced once per commit. The data pages can then be written in any order, with or without fsyncs between them — if the crash catches any subset of them, recovery reads the log and either redoes the missing writes from the log's after-image or undoes the partial ones from the log's before-image. The rule is sufficient to survive any crash because it reverses the source-of-truth relationship: the log is canonical, the data pages are a cache of what the log says. A crash that reaches the log durably means the transaction happened; a crash that did not reach the log means it did not happen; no in-between state is reachable. That reversal also buys you the freedom to use the fast corner of the durability matrix — no-force (do not flush data pages at commit, flush only the log record) and steal (let the buffer pool write dirty uncommitted pages out whenever it likes) — which is what Postgres, InnoDB, SQL Server, and every modern relational engine actually do. This chapter states the rule, sketches REDO and UNDO in one sentence each, draws the 2×2 force/steal matrix and shows why WAL lands you at no-force/steal, and builds a 50-line Python toy WAL for a single key-value store that you can crash in the middle and watch recover.

You have just spent Build 4 watching a B+ tree split corrupt itself three different ways under a crash. Torn writes, half-landed page sets, orphan leaves, parents pointing at the wrong children. By the end of chapter 31, one observation was unavoidable: no arrangement of per-page fsyncs, no amount of clever ordering, no torn-write defence closes the multi-page atomicity gap. The storage stack does not expose a primitive that commits N pages together. If you want that primitive, you must build it yourself, in software, on top of the primitives the stack does give you — sequential append and fsync.

The primitive you build is called the write-ahead log, and the whole of Build 5 is spent formalising it. But before the formalism, there is one sentence. Write it down, tape it above your monitor, read it before every commit to a recovery code path you ever touch:

Write a record describing the change to a durable, sequential log. Then apply the change to the data pages.

That is the write-ahead rule. Nothing else in Build 5 — not log records, not LSNs, not checkpoints, not ARIES — makes sense without it. Get the rule wrong and every sophistication you layer on top inherits the bug. Get the rule right and every crash becomes a replay problem, not a corruption problem.

This chapter does five things. It states the rule precisely. It explains in one paragraph why the rule is sufficient to recover from any crash. It introduces REDO and UNDO in one sentence each — the two operations recovery does. It draws the 2×2 force/steal matrix and shows where the rule places you. And it builds a ~50-line Python toy WAL for a single key-value store, so you can see the whole of Build 5 in one screenful of code.

The rule, stated precisely

The informal version — log the change before applying it — hides a subtlety that matters on day one. Here is the precise version, which is what every in-place recovery algorithm actually obeys:

No dirty data page may be written to disk before the log record describing that change has been durably written to the log.

Three things in that sentence are load-bearing.

"No dirty data page may be written to disk." The constraint is on the flush of the data page, not on the modification in memory. You are free to update the page in RAM — flip bits, add records, rearrange the slot array — the moment the transaction wants to. You are only forbidden from persisting that page to disk before the log record is durable. The buffer pool can still hand out the page to readers. Latches still work the same way. What changes is the policy governing when a dirty buffer-pool page may be written back to the tablespace: it may not, until the log record for its latest change has been fsynced.

"Durably written to the log." Durably means: fsync on the log file has returned. It is not enough that the log record is in the log buffer (in RAM, where a crash will lose it). It is not enough that it has been write()d to the kernel page cache (chapter 3 taught you why). The log fsync must have completed. This fsync is the moment the transaction is committed — not the moment the data pages land, which can be milliseconds or minutes later.

"Describing that change." The log record must contain enough information to either redo the change from scratch (if the data page did not land) or undo it (if the data page landed but the transaction aborted). In the simplest form, that means the record carries both the before-image and the after-image of whatever it changed — the old bytes and the new bytes. Clever engines pack this down to deltas, but the invariant is: from the log alone, recovery can reconstruct the intended final state.

The rule is a constraint on ordering. Read it aloud as an ordering on events:

t0 : transaction modifies page P in buffer pool
t1 : log record for the change is appended to the log buffer
t2 : log buffer is fsync'ed   ← commit point
t3 : page P (dirty) is written to the tablespace  ← MAY NOT precede t2
t4 : tablespace fsync (eventually, on checkpoint or eviction)

The forbidden order is t3 < t2. Any code path that could let a dirty page reach disk before its log record is fsynced violates the rule and breaks recovery. Every WAL-based engine's buffer pool has latches and page-LSN checks specifically to forbid that ordering.

Why does the rule phrase itself in terms of disk writes rather than memory modifications? Because the purpose of the rule is to make recovery work, and recovery only sees what is on disk. If the transaction modifies a page in RAM but the crash happens before that page reaches disk, recovery finds the page in its old state on disk — no special handling needed, the tombstone is just that the data page never changed. The only state recovery cannot handle is the one where the data page moved on disk but the log record describing why did not. The rule exists solely to make that state unreachable.

Why this is sufficient

It is worth making explicit, because the argument is a single paragraph and the paragraph is the whole of Build 5.

Suppose the power cuts at some arbitrary instant. Two questions matter: did log record R make it durably to the log? and did the corresponding data page change make it to the tablespace? That is a 2×2 table of possibilities. Walk the four cells.

Why the write-ahead rule is sufficient. The forbidden cell — data on disk, log not on disk — is exactly the one that would be catastrophic (a data change with no way to understand or undo it). The rule constructs the ordering that makes that cell unreachable. The remaining three cells are all handled by one simple recovery loop.

Cell A — log durable, data durable. The transaction committed and its pages landed. Recovery opens the database, scans the log, sees a committed record whose after-image matches what is on the page, and leaves everything alone. No work.

Cell B — log durable, data not durable. The transaction committed (its log fsync returned) but one or more of its dirty pages never reached the tablespace before the crash. Recovery scans the log, sees the committed record, reads the data page, notices it is still in the pre-change state, and redoes the change from the log's after-image. The page is fixed; the commit stands.

Cell C — log not durable, data durable. The rule forbids this. The data page could only reach disk after its log record was fsynced; if the log record did not reach disk, the buffer pool would have refused to write the page out. By construction, this cell is unreachable.

Cell D — log not durable, data not durable. The transaction never committed from recovery's point of view. The log has no record of it; the data pages are in their pre-change state. Recovery leaves everything alone. From the client's perspective, the commit was never acknowledged (because the log fsync never returned), so no lie was told.

Three cells are trivially handled; the fourth is impossible by construction. That is the whole recovery argument. Every elaboration in Build 5 — LSNs to remember what has already been replayed, checkpoints to bound how far back recovery scans, fuzzy checkpoints to let it run concurrently with normal work, ARIES's three-pass structure to handle uncommitted transactions — is machinery to make this simple argument efficient on a terabyte database. The argument itself is the table above.

Why is the log the source of truth rather than the data pages? Because the log is the thing that gets fsynced on commit; the data pages are flushed later, opportunistically, in batches. If you made the data pages canonical, every commit would have to wait for every dirty page it touched to fsync — which might be dozens of pages scattered across the file, each with its own disk seek and flush overhead. Making the log canonical means commit waits for one fsync on one sequential file, and the data pages become a cache that can fall behind safely. This is the throughput argument for WAL, and it is why group commit (chapter 34) can then amortise even that one fsync across many transactions.

REDO and UNDO, one sentence each

Recovery does exactly two things to data pages, no more:

REDO: apply a log record's after-image to its data page when the page on disk is older than the log record. "The log says the page should look like X; the page on disk looks like Y, which is older; overwrite it with X."

UNDO: apply a log record's before-image to its data page when the transaction that wrote it did not commit. "The log says an uncommitted transaction changed the page; restore the before-image so the change is rolled back."

REDO handles Cell B above — committed changes whose pages did not land. UNDO handles a different hazard we have not yet named: steal, the case where a dirty page from an uncommitted transaction was written to disk (because the buffer pool needed its slot for something else) and then the transaction rolled back or crashed. That page on disk now reflects a change the user never committed; UNDO erases it.

These two sentences are the entire vocabulary of recovery. Every log record type in every engine is some combination of "carries information to enable redo", "carries information to enable undo", or both. Chapter 33 catalogues the shapes; for this chapter, the one-sentence definitions are enough.

The force/steal matrix and where WAL lands you

Recovery literature uses two axes to classify how a storage engine handles dirty pages at commit and under buffer-pool pressure. They are the force/no-force axis and the steal/no-steal axis, and together they partition storage engines into four families.

Force vs no-force is about what happens at commit. Force means every dirty page touched by the committing transaction is written to disk (and fsynced) before the commit is acknowledged. No-force means commit acknowledges without waiting for the data pages — only the log record is forced. Force is slow (many scattered disk writes per commit) but durable even without a log. No-force is fast (one sequential log fsync) but needs a log for REDO on crash.

Steal vs no-steal is about what happens under buffer-pool pressure. No-steal means the buffer pool may not write a dirty page to disk while any of its changes belong to an uncommitted transaction. Steal means the buffer pool may evict and write out any dirty page at any time, regardless of transaction state. No-steal is simple but restricts eviction (what if the buffer pool runs out of room and every page belongs to an uncommitted long transaction?). Steal is flexible but requires UNDO for recovery, because a stolen page from a later-aborted transaction must be rolled back from disk.

The 2×2 gives you four cells. Three of them are pathological; one is what every modern engine does.

The force/steal matrix. No-force/steal is where every modern WAL-based engine lives. The price is that the log records must carry enough information for both REDO (to recover commits whose data pages did not land) and UNDO (to roll back uncommitted changes whose dirty pages leaked to disk via steal). The payoff is a commit that costs one sequential log fsync and a buffer pool with no eviction restrictions.

Force / no-steal (top-left). The engine flushes every dirty page at commit, and refuses to evict any uncommitted-transaction page. Commits are slow (many fsyncs per commit). Buffer pool is rigid (one long transaction can pin arbitrarily many pages and starve eviction). Advantage: recovery does nothing — every committed change is on disk, every uncommitted change is in RAM and discarded on crash. No REDO, no UNDO, no log needed. This is the textbook toy engine; no production database does it.

Force / steal (bottom-left). Slow commits but flexible eviction. Needs UNDO (to roll back stolen uncommitted pages) but not REDO (because force guarantees committed changes are on disk). Rare combination.

No-force / no-steal (top-right). Fast commits and no UNDO — but the buffer pool cannot evict uncommitted pages, which is fragile under long transactions. Also rare.

No-force / steal (bottom-right). Fast commits, flexible eviction. Needs both REDO (committed changes that did not reach disk) and UNDO (stolen uncommitted changes that did reach disk). Every serious production engine since InnoDB, Postgres, SQL Server, Oracle, and DB2 lives in this cell. The WAL is what makes it safe: every change — committed or not — has a log record describing both its before and after. Recovery uses the before for UNDO and the after for REDO.

The write-ahead rule is the enabling invariant for this cell. Without it, the stolen uncommitted page on disk has no before-image to roll back to, and the missing-committed page on disk has no after-image to redo from. With it, every operation recovery needs to do is reconstructible from the log alone.

Why do engines pay the cost of REDO+UNDO rather than picking a simpler cell? Because the cost is paid in log bandwidth (sequential, cheap) and the benefit is throughput and operational flexibility. A storage engine that cannot evict uncommitted pages cannot safely run a transaction that touches more rows than fit in the buffer pool — which is any analytical query, any bulk load, any long-running migration. A storage engine that forces dirty pages at commit pays commit latency proportional to how many pages the transaction touched, which is unusable under OLTP load. No-force/steal is the only cell compatible with the workloads production databases are actually asked to run. The write-ahead rule is the price of admission.

A 50-line Python WAL for a single KV store

Enough theory. Let's build the smallest WAL that demonstrably survives a crash. The system has one in-memory dict as the "data page", an append-only log file as the WAL, and a trivial recovery routine that replays the log at startup.

# wal_kv.py — the write-ahead rule in 50 lines.
import os, json, struct

class WAL_KV:
    LOG_PATH  = "wal.log"        # sequential, append-only
    DATA_PATH = "data.json"      # the "data page" — snapshot of committed state

    def __init__(self):
        self.mem = {}            # in-memory KV (the buffer pool of one page)
        self._load_snapshot()    # read the last durable snapshot, if any
        self._recover()          # replay the log to catch up

    def _load_snapshot(self):
        if os.path.exists(self.DATA_PATH):
            with open(self.DATA_PATH, "r") as f:
                self.mem = json.load(f)

    def _recover(self):
        """REDO pass: apply every committed log record to in-memory state."""
        if not os.path.exists(self.LOG_PATH): return
        with open(self.LOG_PATH, "rb") as f:
            while hdr := f.read(4):
                (n,) = struct.unpack(">I", hdr)
                rec = json.loads(f.read(n))
                if rec["type"] == "PUT":    self.mem[rec["k"]] = rec["v"]
                elif rec["type"] == "DEL":  self.mem.pop(rec["k"], None)
                # COMMIT records mark the end of a transaction — nothing to apply.

    # --- the write-ahead rule: log FIRST, then modify memory -----------------
    def _log_append(self, rec: dict):
        body = json.dumps(rec).encode()
        fd = os.open(self.LOG_PATH, os.O_WRONLY | os.O_APPEND | os.O_CREAT, 0o644)
        try:
            os.write(fd, struct.pack(">I", len(body)) + body)
            os.fsync(fd)         # ← durable BEFORE we touch self.mem
        finally:
            os.close(fd)

    def put(self, k, v):
        self._log_append({"type": "PUT", "k": k, "v": v})   # log first
        self._log_append({"type": "COMMIT"})                 # commit marker
        self.mem[k] = v                                      # then apply

    def delete(self, k):
        self._log_append({"type": "DEL", "k": k})
        self._log_append({"type": "COMMIT"})
        self.mem.pop(k, None)

    def get(self, k): return self.mem.get(k)

    def checkpoint(self):
        """Write the in-memory state to disk and truncate the log."""
        tmp = self.DATA_PATH + ".tmp"
        with open(tmp, "w") as f: json.dump(self.mem, f)
        os.replace(tmp, self.DATA_PATH)              # atomic rename
        # after the snapshot is durable, the log can be discarded
        open(self.LOG_PATH, "wb").close()            # truncate

Read it line by line. Forty-odd lines of code; every line is doing one of the things we named above.

_log_append is the write-ahead rule made flesh. It writes a length-prefixed record to the log file and fsyncs it. The fsync returns before put() modifies self.mem. If the power cuts after the log fsync but before self.mem[k] = v, no harm done — memory is lost on crash anyway; recovery replays the log from disk and sets self.mem[k] to the same value.

put calls _log_append twice: once for the change, once for a COMMIT marker. The commit marker is what tells recovery these records are committed, replay them. If the crash happens after the change record is logged but before the commit record is logged, recovery sees an uncommitted change and should ignore it — in this toy, we do not strictly enforce that (we replay every PUT), but a real WAL uses the commit marker to know which records to apply.

_recover walks the log from the start, decodes every record, and applies it to self.mem. This is a pure-REDO pass: no UNDO, because this toy only commits atomically (one PUT, one COMMIT, no partial transactions). A real engine's recover would be ARIES's three passes (chapter 37); ours is one pass in six lines.

checkpoint writes the in-memory state out as a snapshot and truncates the log. The next startup reads the snapshot and has no log to replay. This bounds how long recovery takes — without checkpoints, the log grows forever and every restart is slower than the last. Chapter 35 is the whole story of checkpointing; here it is five lines at the end of the class.

Crashing it on purpose

Here is a test program that uses WAL_KV, injects a crash, and verifies recovery.

# crash_test.py — prove the WAL survives a crash mid-operation.
import os, subprocess, sys

# Clean slate.
for p in ("wal.log", "data.json"):
    if os.path.exists(p): os.remove(p)

# Phase 1: write some data, then die.
proc = subprocess.Popen([sys.executable, "-c", """
import os, sys
sys.path.insert(0, '.')
from wal_kv import WAL_KV
db = WAL_KV()
db.put('city', 'Bengaluru')
db.put('pincode', '560001')
db.put('temp_c', '24')
os.kill(os.getpid(), 9)   # simulate power cut — NO checkpoint, NO clean shutdown
"""])
proc.wait()

# Phase 2: reopen and check.
from wal_kv import WAL_KV
db = WAL_KV()
print("after crash + recovery:")
print("  city   =", db.get('city'))
print("  pin    =", db.get('pincode'))
print("  temp   =", db.get('temp_c'))

Output:

after crash + recovery:
  city   = Bengaluru
  pin    = 560001
  temp   = 24

The SIGKILL at the end of phase 1 means Python never got to run any shutdown hook. The in-memory self.mem dict died with the process. Nothing was snapshotted to data.json. The only thing on disk was wal.log. On reopen, _recover replayed the three PUT records, reconstructed self.mem, and the get() calls returned the right values.

Now modify _log_append to do the fsync after modifying self.mem instead of before — that is, violate the write-ahead rule. Run the same test with an extra twist: kill the process between the self.mem[k] = v line and the os.fsync(fd) line. Now the memory change did happen (briefly, in RAM, before the process died) but the fsync did not. On recovery, the log is empty (or short by one record), and the "data page" (the snapshot) never saw the change either. The PUT evaporates.

This is the experiment that justifies the rule. Every production database has a harness like this running continuously, because the rule is easy to state and a nightmare to get right under real concurrency.

Putting it all together — a crash-before vs crash-after example

Crash before the log fsync vs crash after

Consider a single db.put("balance", "500") call. The instruction sequence inside _log_append plus put is roughly:

1. append log record   (in-memory log buffer)
2. write() to log fd   (bytes now in kernel page cache)
3. fsync(log fd)       ← durability commits here
4. return from _log_append
5. self.mem["balance"] = "500"
6. return from put

Crash between 1 and 3. The log record exists only in RAM — kernel page cache at best, not on the device. Reboot. The log file on disk does not contain the record. Recovery replays an empty log, self.mem has no "balance" entry, get("balance") returns None. The user's put call never returned, so no commit was acknowledged; no lie is told. The transaction is as if it never happened.

Crash after 3, before 5. The log record is on disk. self.mem was never updated. Reboot. Recovery opens the log, reads the PUT record, applies self.mem["balance"] = "500", and get("balance") returns "500". The user's put returned (because the log fsync returned), the commit was acknowledged, and the value is preserved.

Crash after 5. self.mem was updated; the log record is on disk. Reboot. Recovery replays the log, sets self.mem["balance"] = "500" again (idempotent), and get("balance") returns "500". Same outcome as the previous case.

The only moment the user has been told "committed" is after step 3. At every crash point where the user was told "committed", recovery produces the committed state. At every crash point where the user was not told "committed", recovery produces the pre-commit state. No in-between. That is what the write-ahead rule buys you.

Common confusions

"Is the WAL just another log — like the append-only log from Build 2?" No. They are structurally similar (both are sequential append-only files) but their roles are fundamentally different. The Build 2 append-only log is the data — your get queries scan it or an index over it to find the latest value. The WAL describes changes to the data; the data lives elsewhere, in data pages or a memtable. The WAL is the source of truth for whether a change happened; the data pages are a cache of the current materialised state. Confusing them will make ARIES read strangely.
"If I fsync after every modification, do I need a WAL?" Only if every modification fits in one atomic disk operation — a single sector write, typically 512 bytes or 4 KiB. A 16 KiB B+ tree page does not (chapter 29), and a multi-page split does not (chapter 31). Fsync gives you per-write durability; the WAL gives you cross-write atomicity. Real databases need both.
"The WAL doubles my I/O — isn't that slow?" It adds one sequential write per operation, yes. But: (a) sequential writes are roughly 100× cheaper than random writes on spinning disks and 2–5× cheaper on SSDs, (b) group commit (chapter 34) amortises the log fsync across many transactions, and (c) the data-page writes can be deferred (no-force) and batched, which in practice reduces total I/O because many changes to one page are coalesced. On OLTP workloads, WAL-based engines are faster than force-based engines by large margins.
"Can I just make the log records small and skip fsync-ing the data pages entirely?" You can defer the data-page writes arbitrarily, yes — that is no-force. But you cannot skip them forever, or the log would have to hold the entire database history. The data pages get flushed eventually (on buffer-pool eviction or at checkpoint), and the log gets truncated once a checkpoint proves the data pages have caught up.
"What if the log file itself gets corrupted?" Each log record has a checksum; on read, corrupted records at the tail are treated as "did not reach disk" and truncated. A corrupted record in the middle of the log is an engine-level catastrophe that typically requires backup restore. For this reason, real WALs are stored on a different disk from the data files when possible, and many deployments replicate the log to other machines (Build 7 covers this).
"Does the rule apply to reads?" No. Reads touch data pages but never modify them, so no log record is needed and no ordering constraint applies. Reads go straight through the buffer pool from the data pages. The WAL is a write-path mechanism.
"If the data pages are just a cache, why not drop them and serve reads from the log?" Because the log is ordered by operation time, not by key. Finding the current value of key X in the log means scanning every record that might have touched X — O(log size), not O(log key). The data pages let you find X in O(log n) via the tree index. The WAL is for writes and recovery; data pages are for reads and queries. They are complementary, not substitutable.
"Is UNDO always needed even if I never abort transactions?" Yes, because of the steal buffer-pool policy. Even if your application never calls rollback, the buffer pool may have evicted (and written to disk) a dirty page belonging to a transaction that then crashed before committing. Recovery sees the log record for that transaction with no matching commit, must UNDO the change on disk. "Aborts" in recovery's eyes include "transactions interrupted by a crash that never reached commit".

Going deeper

This section is for the reader who wants to know how Postgres, MySQL, and SQLite each implement the write-ahead rule in practice, and how the rule interacts with group commit — the single most important performance optimisation in transactional storage.

Postgres's WAL

Postgres's write-ahead log lives in the pg_wal directory as a sequence of 16 MiB segment files named by a monotonically increasing identifier. Every DML operation — INSERT, UPDATE, DELETE, index modification, B-tree split, vacuum — emits one or more WAL records into an in-memory WAL buffer. The buffer is flushed to disk at three moments: on transaction commit (XLogFlush up to the commit LSN), when the buffer fills (wal_buffers threshold), and on the WAL writer's periodic wakeup (wal_writer_delay, default 200 ms).

Postgres enforces the write-ahead rule via a field called PageLSN on every data page. When the buffer pool considers flushing a dirty page to the tablespace, it first checks the page's LSN — the log sequence number of the latest WAL record that modified the page — against the current flushed-WAL LSN. If the page's LSN is greater than the flushed-WAL LSN, the buffer pool must fsync the WAL up to the page's LSN before writing the page out. This check is the write-ahead rule in code, and it lives in src/backend/storage/buffer/bufmgr.c around the FlushBuffer function.

The synchronous_commit GUC controls how aggressively commit waits for the fsync. Values: off (commit returns immediately, fsync happens later — fastest, small data-loss window on crash), local (fsync on the local WAL, default), remote_write (wait for a replica to acknowledge), remote_apply (wait for a replica to apply). These knobs are the production-facing side of the write-ahead rule: the rule itself is absolute (always log before page), but the durability of the log fsync is the tunable.

InnoDB's redo log and the mini-transaction framework

MySQL's InnoDB groups physical changes into mini-transactions (mtr). An mtr is a bounded sequence of page modifications that must commit atomically from recovery's perspective; a B+ tree split is one mtr, a row update is another. During an mtr, the engine collects redo log records in a per-mtr buffer; on mtr commit, the buffer is appended to the global redo log atomically (under the log mutex). Only after the mtr's redo records are in the global redo log may the mtr's dirty pages be flushed to the tablespace.

The write-ahead rule is encoded in the mtr framework: a page cannot be flushed until its dirtying mtr has committed and the global log has been flushed up to the mtr's LSN. This is checked on every buffer-pool flush path, in storage/innobase/buf/buf0flu.cc.

InnoDB's innodb_flush_log_at_trx_commit is the analogue of synchronous_commit: 1 means fsync on every commit (default, durable), 2 means write() to the kernel page cache on every commit but fsync only once per second (commits survive process crashes but not power cuts), 0 means do not even write() on commit (fastest, up to 1 second of loss). Again, the rule is absolute; the tunable is how aggressive the commit fsync is.

SQLite's WAL mode

SQLite has two durability modes. The older rollback journal mode is force/no-steal: before any modification, SQLite copies the pre-change pages to a journal file, fsyncs it, modifies the data file in place, fsyncs that, deletes the journal. The journal file is for UNDO only (there is no REDO — the modifications are force-flushed before commit completes).

The newer WAL mode (introduced in 2010, now the default for most workloads) inverts the design to match the modern no-force/steal pattern. All writes go to a separate -wal file, appended to sequentially. Readers read from both the main data file and the WAL (newer values in the WAL override older ones in the data file). A checkpoint operation periodically copies pages from the WAL back into the main data file and truncates the WAL.

SQLite's WAL mode is the purest small-scale illustration of the write-ahead rule: the WAL file is literally a log of page-sized after-images, and the checkpoint is literally the batch transfer from log to data pages. The whole mechanism is about 1,500 lines of C in src/wal.c. It is the best piece of code for a reader who wants to see a real WAL in production-grade detail without drowning in InnoDB or Postgres.

Group commit — preview

The one fsync per commit is the rate-limiting step of every WAL-based engine. On an NVMe drive, fsync takes 50–200 microseconds; on a spinning disk, 5–15 milliseconds. A naive implementation caps transaction throughput at around 1 / fsync_time — a few thousand TPS on SSDs, a couple hundred TPS on rust.

Group commit amortises the fsync across concurrent transactions. When transaction A commits, the engine pauses briefly (microseconds to a millisecond) before issuing the fsync, accumulating the WAL records of any transactions B, C, D that commit in that window. One fsync covers all of them; all four commits complete at roughly the same time. Throughput scales with concurrency: 100 concurrent committers share one fsync, giving 100× the per-fsync throughput.

The beauty of group commit is that the write-ahead rule does not care: as long as the log fsync happens before any of the group's data pages reach disk, the rule is satisfied. Group commit is a throughput optimisation that sits on top of the rule without modifying it. Chapter 34 is this story in full.

Where this leads next

You now know the invariant that holds all of Build 5 together: log before page, durably, always. The rest of Build 5 is the machinery that makes this invariant efficient and complete at database scale.

Log records, LSNs, and the log sequence — chapter 33. The vocabulary a WAL speaks. Every log record carries a monotonically increasing log sequence number (LSN), and every data page carries the LSN of the latest record that modified it. Recovery uses the LSNs to decide what to replay.
Group commit: amortising fsync — chapter 34. The commit fsync is the single most expensive operation in OLTP. Group commit turns N transactions into one fsync. This is how real databases hit 100,000 commits per second.
Checkpointing: bounding recovery time — chapter 35. Without checkpoints, the log grows forever and every restart is slower than the last. Checkpoints are the snapshots that let recovery start from the middle of the log instead of the beginning.
The ARIES algorithm — chapter 37. The 1992 paper that unified every working recovery scheme into one three-pass framework: analysis, redo, undo. Every modern RDBMS implements ARIES or a close variant.

The write-ahead rule is the axiom. Build 5 is the theorems.

References

Mohan, Haderle, Lindsay, Pirahesh, Schwarz, ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS 17(1), 1992 — the foundational paper that formalises the write-ahead rule and the no-force/steal combination. Build 5 is a guided tour of this paper.
Härder and Reuter, Principles of Transaction-Oriented Database Recovery, ACM Computing Surveys 15(4), 1983 — the paper that introduced the force/no-force × steal/no-steal taxonomy and named the four quadrants.
PostgreSQL Global Development Group, Write-Ahead Logging (WAL), PostgreSQL 16 documentation, Chapter 30 — the official description of Postgres's WAL, including PageLSN enforcement and synchronous_commit semantics.
Oracle Corporation, InnoDB Redo Log and Mini-Transactions, MySQL 8.0 Reference Manual — the official description of InnoDB's mtr framework and how it enforces the write-ahead rule.
D. Richard Hipp, Write-Ahead Logging in SQLite, SQLite official documentation — the cleanest small-scale production implementation of a WAL, complete enough to read end-to-end in an afternoon.
Jim Gray, Andreas Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann, 1993 — chapters 9 and 10, the canonical textbook treatment of the write-ahead rule, REDO/UNDO, and the force/steal matrix.