In short
A B+ tree split is multi-page by construction. Inserting one key into a full leaf means writing a fresh right-half leaf, rewriting the left-half leaf, rewriting the parent to add a new child pointer, and — if the parent cascaded — rewriting one or more ancestors and possibly a brand-new root. Between three and six pages on disk have to change for one insert, and from the operating system's perspective those are independent writes. You can call fsync after every single one of them and still be destroyed by a power cut: with six pages, there are six moments where two have reached the disk and four have not, and every one of those moments leaves the tree in a state that violates its invariants. The previous chapter's torn-write defences — full-page-writes, double-write buffers — solve atomicity within a single page. This chapter is the hazard one layer up: atomicity across multiple pages. No amount of fsync-per-page fixes it, because the problem is not that any one page is torn; the problem is that the set of pages landed partially. The fix is the one every serious in-place storage engine has converged on: write a single log record describing the whole logical operation, fsync the log, then do the physical page writes in any order. If the power cuts mid-splash, recovery reads the log and either redoes the missing pages or discards the partial ones. The log is the atomicity layer that fsync does not give you. That log is the write-ahead log, and the next Build is about nothing else. Copy-on-write engines like LMDB dodge the hazard entirely by never modifying a live page in place — the meta-page flip is the only atomic operation, and a crash before it leaves the old tree intact — which is why LMDB has no WAL and still survives every power cut. This is the closer of Build 4. The next chapter you read begins Build 5.
You have spent the last ten chapters building a storage engine that would make a 1990s database engineer recognise their own handiwork. Pages as the unit of I/O, a buffer pool to cache them, a B+ tree laid out across them, slotted layouts for variable-length records, latches for concurrency, a torn-write defence for single-page atomicity. Every one of those pieces is necessary. Every one has been standing on a load-bearing assumption that this chapter is going to break.
The assumption is: one logical mutation is one physical page write.
It is not. Sometimes it is. An in-place update to a row that fits on its existing page, changing a non-indexed column — that is one page write, and the torn-write defence from chapter 29 makes it crash-safe. But the moment the tree has to split, the one-page story collapses. A split is several pages, and the several pages have to either all land or none land, and the kernel does not give you that primitive.
This is the wall Build 4 runs into. Every page you have learned about so far — header, slot array, keys, pointers, checksum — is ready to be crash-safe one page at a time. The tree itself is not. The gap between "each page is crash-safe" and "the tree is crash-safe" is precisely what Build 5 closes, and this chapter is the last look at the storage engine before that work begins.
The crash scenario — one insert, six pages
Look back at the tree from chapter 23. A B+ tree of branch factor 4 on the verge of inserting key 20 into the already-full leaf [17, 19, 21]. The algorithm said: split the leaf into [17, 19] and [20, 21], promote key 20 to the parent, adjust the parent. Now imagine the parent was also full. The parent splits too, promoting a key to the grandparent. If the grandparent was also full — bad luck, but possible on a skinny tree — it splits, and a new root is allocated above it.
Count the pages that physically have to change on disk for that one insert:
- Old leaf — rewritten with the left half of keys.
- New right leaf — a freshly allocated page with the right half of keys.
- Old leaf's right-sibling pointer — actually part of page 1, but the previous right sibling's
prevpointer, if the implementation uses doubly-linked leaves, is page 3. - Parent internal node — rewritten to include the new child pointer and routing key.
- New right internal node — if the parent split.
- Grandparent — rewritten if the parent split.
- New root — if the cascade reached the top.
For a common case (leaf splits, parent does not), you have three pages to write. For a full cascade that reaches the root, you have six or seven. Every one of those pages, on disk, is an independent 4 KiB write at an independent offset. The kernel treats them as six unrelated I/O requests. The SSD treats them as six unrelated NAND program operations. There is no primitive in POSIX, in the block layer, or in NVMe that says "commit all of these or none of them."
Four pages, four independent writes, any subset of which the kernel may have completed. With four writes there are 2^4 = 16 possible landed-sets, and fifteen of them are inconsistent. Only the all-landed case and the none-landed case are states your tree can actually tolerate. Every in-between is a corruption.
Why doesn't fsync between the writes help? Let's say after writing A you call fsync, after writing B you call fsync, after writing C you call fsync, after writing D you call fsync. Now each page is individually durable the moment its fsync returns. But the crash can happen between two fsync calls — for instance, after the fsync on B returned and before the fsync on C started. At that instant, A and B are on disk, C and D are not. You have not saved yourself; you have spent four times as many fsyncs to land in the same inconsistent state, just with better-defined boundaries between the states. Per-page fsync gives you per-page durability. It does not give you cross-page atomicity, which is what a split needs.
Why can't you write the pages in a careful order so that no intermediate state is inconsistent? Because there is no such ordering. If you write the new leaf B first, then the parent C (now pointing at B), then the old leaf A (now with fewer keys) — a crash between C and A leaves the parent pointing at B, which is fine, and at A, which still has the old four keys including 21, which the parent's routing says should live in B. Two copies of key 21, one of which lookup will never reach. If you write A first, then B, then C — a crash between B and C leaves the parent pointing at A (with only two keys) and nothing pointing at B. Every ordering has at least one bad intermediate. There is no write order that makes the sequence self-repairing.
Seeing the half-split in Python
Let's make the hazard concrete. We will implement a toy B+ tree split, write its pages to disk one at a time, and simulate a crash after k of n pages have landed. Then we will open the result and walk the tree.
# half_split.py — watch a B+ tree split half-commit under a crash.
import os, json
PAGE_SIZE = 256 # small pages for readability
NPAGES = 32 # whole-file page count
def page_offset(pid): return pid * PAGE_SIZE
def write_page(f, pid, contents: dict):
data = json.dumps(contents).encode()
assert len(data) <= PAGE_SIZE - 1
buf = data + b" " * (PAGE_SIZE - len(data))
os.pwrite(f, buf, page_offset(pid))
def read_page(f, pid):
raw = os.pread(f, PAGE_SIZE, page_offset(pid)).strip()
return json.loads(raw) if raw.startswith(b"{") else None
# --- the starting tree ----------------------------------------------------
# Page 1: root internal node routing between leaves at pages 2 and 3.
# Page 2: full leaf [17, 19, 21]. Page 3: leaf [25, 29].
with open("tree.db", "wb") as f: f.write(b"\0" * PAGE_SIZE * NPAGES)
fd = os.open("tree.db", os.O_RDWR)
write_page(fd, 1, {"kind": "internal", "keys": [25], "children": [2, 3]})
write_page(fd, 2, {"kind": "leaf", "keys": [17, 19, 21], "values": ["v17","v19","v21"]})
write_page(fd, 3, {"kind": "leaf", "keys": [25, 29], "values": ["v25","v29"]})
os.fsync(fd)
# --- the insert(20) sequence ---------------------------------------------
# Algorithm:
# 1. Overwrite page 2 with the left half of the split: [17, 19].
# 2. Write a new right leaf at page 4: [20, 21].
# 3. Overwrite page 1 (parent) to add child 4 and separator key 20.
# The fsync at the end would make all three durable — but we will
# simulate a crash *before* step 3 lands.
INTENDED_WRITES = [
(2, {"kind": "leaf", "keys": [17, 19], "values": ["v17","v19"]}),
(4, {"kind": "leaf", "keys": [20, 21], "values": ["v20","v21"]}),
(1, {"kind": "internal", "keys": [20, 25], "children": [2, 4, 3]}),
]
CRASH_AFTER = 2 # only the first 2 of 3 writes land
for pid, contents in INTENDED_WRITES[:CRASH_AFTER]:
write_page(fd, pid, contents)
os.fsync(fd)
os.close(fd) # ← power cut before write 3
# --- reopen and walk ------------------------------------------------------
fd = os.open("tree.db", os.O_RDONLY)
print("root :", read_page(fd, 1))
print("leaf2 :", read_page(fd, 2))
print("leaf3 :", read_page(fd, 3))
print("leaf4 :", read_page(fd, 4))
# descend from root looking up key 21
root = read_page(fd, 1)
for k in root["keys"]:
if 21 < k:
target_child = root["children"][root["keys"].index(k)]
break
else:
target_child = root["children"][-1]
print(f"lookup(21) descends to page {target_child} ->", read_page(fd, target_child))
os.close(fd)
Run it. The output is the entire chapter in seven lines:
root : {'kind': 'internal', 'keys': [25], 'children': [2, 3]}
leaf2 : {'kind': 'leaf', 'keys': [17, 19], 'values': ['v17', 'v19']}
leaf3 : {'kind': 'leaf', 'keys': [25, 29], 'values': ['v25', 'v29']}
leaf4 : {'kind': 'leaf', 'keys': [20, 21], 'values': ['v20', 'v21']}
lookup(21) descends to page 2 -> {'kind': 'leaf', 'keys': [17, 19], 'values': [...]}
Three invariants of the tree are simultaneously broken.
- Key 21 has vanished from every reachable path. The root still has its pre-split routing (
keys=[25], so21 < 25goes to child page 2), but child page 2 is now the post-split left half[17, 19]. The lookup descends, scans, returnsNone. A durable insert has evaporated. - Key 20 has vanished from every reachable path. Same story. Page 4 holds it, but the root does not know about page 4. Nothing points at page 4. A reader cannot find it.
- Page 4 is an orphan. It exists on disk, it has valid data, and the free-list does not consider it free either (the write landed). The next allocator pass will see a page that is neither referenced nor on the free list — garbage collection confusion that can only be resolved by a full scan of the tree.
Every individual page is internally consistent. Every individual page passes its checksum. The tree is broken because the set of pages did not move together.
Scaling the hazard
The toy tree's split is 3 pages. A real split on a 4 KiB-page B+ tree with 4 levels, where the cascade reaches the root, is 7 pages: two leaves (old rewritten + new), two internal level-3 (old rewritten + new), two level-2 (old rewritten + new), and one new root. Seven independent writes, 2^7 = 128 possible landed-sets, 126 of them inconsistent.
Now scale the write rate. A production OLTP database doing 10,000 inserts per second, with say 5% of inserts triggering a split somewhere, is doing 500 splits per second. Each split is 3 to 7 pages. If every page had to land together and there is no log, the mean time to corruption is approximately however long the datacentre takes to lose power once.
This is not a theoretical hazard. Between 2005 and 2012 the Linux kernel mailing list has a long stream of corruption reports from databases running on hardware without battery-backed write caches, and the pattern in most of them is exactly this: a split that half-committed, an orphan page, a routing path that lost its target. Every one of those reports led to the same fix: turn on innodb_doublewrite, turn on full_page_writes, use a filesystem that supports fsync correctly, and — above all — the write-ahead log, which is the subject of the next Build, did its job.
Why fsync-per-page cannot save you
A natural first instinct, seeing the hazard, is: what if I fsync after every page? Surely the kernel cannot have lost a page that I fsynced?
Let's test the instinct. Modify the script to fsync after each of the three writes:
for pid, contents in INTENDED_WRITES[:CRASH_AFTER]:
write_page(fd, pid, contents)
os.fsync(fd) # now every landed page is truly on disk
Now simulate a crash after the second fsync returned but before the third write even started. What does disk hold?
- Page 2 (the rewritten old leaf): durably on disk — fsync(1) made it so.
- Page 4 (the new right leaf): durably on disk — fsync(2) made it so.
- Page 1 (the rewritten parent): still the old bytes — the third write never happened.
The outcome is identical to the no-fsync case: same three invariants broken, same orphan page, same vanished keys. The only thing per-page fsync did was guarantee that the crash-visible state of pages 2 and 4 is the intended new state rather than the possibly-torn new state — which is useful against torn writes per page, but does nothing against the multi-page atomicity gap.
Why is this worth saying out loud? Because it is the most common misunderstanding among engineers who have just learned about fsync and are trying to reason about durability. They collapse two separate guarantees: "my write reached non-volatile media" and "my write happened atomically with some other write". The first is what fsync gives you, after a long discussion of write caches and FUA and battery-backed RAID controllers. The second is what fsync does not give you, and no amount of fsync-ing will ever give you. The gap between them is exactly wide enough to fit a write-ahead log.
The hazard is structural, not operational. It is not about slow fsyncs or buggy firmware or weird filesystems. It is about the fundamental absence of a multi-page atomicity primitive in the storage stack. POSIX does not expose one. Linux does not expose one. NVMe 1.4 added optional Atomic Write Unit Normal and Atomic Write Unit Power Fail parameters, but both are per-I/O-command, not per-N-commands, and both are advertised at up to a single sector — still smaller than one database page, let alone seven.
If you want multi-page atomicity, you must build it yourself in software. Every database that does in-place mutation has built it. The structure they converged on is the write-ahead log.
The fix — log the logical change first, then do the physical work
The insight that carries you out of this chapter is one sentence:
Write a description of what you are about to do to a durable, sequential log. Fsync the log. Then do the physical page writes in any order. If the power cuts mid-splash, recovery reads the log and either redoes the missing writes or discards the partial ones.
Unpack it. The log is a sequential file (append-only) in the same directory as the database. Before any split touches a page, the engine writes one log record describing the whole operation:
LSN 0x7A3F : SPLIT-LEAF
parent_page_id = 1
left_leaf_id = 2
right_leaf_id = 4 (newly allocated)
median_key = 20
left_keys = [17, 19]
left_values = [v17, v19]
right_keys = [20, 21]
right_values = [v20, v21]
parent_before = {keys:[25], children:[2,3]}
parent_after = {keys:[20,25], children:[2,4,3]}
This one record, perhaps 200 bytes, contains every bit of information needed to redo the split from scratch or undo it back to the pre-split state. The engine fsyncs the log once. That single fsync is the commit of the split — conceptually and operationally.
Now the engine does the physical page writes: A, B, C in any order, with or without intervening fsyncs. If the power cuts halfway through, recovery works like this:
- Open the log. Scan it forward from the last checkpoint.
- For each log record, check whether the operation is fully reflected on disk. For the SPLIT-LEAF record above, that means: does page 2 have
keys=[17,19], does page 4 exist withkeys=[20,21], does page 1 havekeys=[20,25]? - If any page is missing the update, redo it — overwrite that page from the log record's "after" image.
- If no log record exists for some partial state (for instance, an orphan page that was allocated but the log transaction never committed), undo or discard it.
After step 3 runs for every record in the log tail, the tree on disk matches exactly what the log says — and the log says the split happened, so the tree reflects the split, entirely. The hazard is gone. The same machinery handles multi-page updates of any shape: splits, merges, page allocations, free-list mutations, tuple-level MVCC updates. Every physical change the engine wants to make gets described in the log first, and the log is the single source of truth.
The cost is modest. One extra sequential write (the log record), one extra fsync (of the log, before any in-place writes). In exchange you get multi-page atomicity, which is the hazard this chapter is about, and single-page torn-write recovery for free (the log record contains the page's full new bytes), and the ability to replay lost transactions after a crash. Three problems, one mechanism, one log. This is the write-ahead log, and it is the foundation of every serious in-place storage engine since the 1980s.
What the recovery sequence looks like, concretely
Take the crash from the Python simulation above. Pages A (leaf 2) and B (leaf 4) landed; page C (parent 1) did not. The log contains one committed record: the SPLIT-LEAF described above. Recovery runs:
scan log forward from checkpoint_lsn = 0x7A00
record @ 0x7A3F : SPLIT-LEAF parent=1, left=2, right=4, median=20
→ read page 2 from disk: {kind:leaf, keys:[17,19], values:[v17,v19]}
matches record.left_after exactly. leave alone.
→ read page 4 from disk: {kind:leaf, keys:[20,21], values:[v20,v21]}
matches record.right_after exactly. leave alone.
→ read page 1 from disk: {kind:internal, keys:[25], children:[2,3]}
does NOT match record.parent_after = {keys:[20,25], children:[2,4,3]}
REDO: overwrite page 1 with record.parent_after
fsync.
recovery complete.
Four reads, one write, one fsync. The tree on disk now reflects the split exactly. The lookup search(21) that previously returned None now descends via the rewritten root to page 4 and finds the value. The orphan is no longer orphaned. The vanished keys have returned. The database opens and accepts writes as if the crash never happened.
If the crash had been earlier — only page 2 landed, pages 4 and 1 did not — recovery would redo both: allocate page 4 from the log's right-after bytes, rewrite page 1 from the log's parent-after bytes. If the crash had been before even page 2 landed, recovery would redo page 2 as well.
If the crash had been before the log record's fsync completed, the log record would not be in the durable log at all, and recovery would treat the split as if it never happened. Pages 2, 4, and 1 would all be in their pre-split states (or would be redone to pre-split if partially overwritten by the engine's subsequent work). The split is atomic: it is either in the log and fully applied by recovery, or it is not in the log and not applied at all.
This is the write-ahead rule in miniature. Build 5 turns this paragraph into eight chapters.
When copy-on-write engines sidestep the hazard
The whole story above assumes in-place update. The leaf at page 2 is rewritten. The parent at page 1 is rewritten. The engine is modifying pages that are live-referenced by the tree's root. That is the only setting in which the multi-page atomicity gap exists.
In a copy-on-write engine, the same split is done entirely differently. The engine does not touch page 2, page 1, or any page the live tree's root reaches. Instead:
- Allocate a fresh page (say 8,001) and write the new left leaf
[17, 19]there. - Allocate another fresh page (8,002) and write the new right leaf
[20, 21]there. - Allocate another fresh page (8,003) and write the new parent
{keys:[20,25], children:[8001, 8002, 3]}there. - If the root itself changed, allocate a fresh page for the new root too.
fsyncall the fresh pages. They are durable now, but nothing points at them — the meta-page still has the old root pointer.- Write the alternate meta-page with the new root pointer and a higher transaction id.
fsync.
The crash story is trivial. A crash at any point before step 6's fsync completes leaves the old meta-page as the authoritative root. The fresh pages are orphans, reclaimable by the next GC pass. The old tree — page 1 still pointing at pages 2 and 3, leaves 2 and 3 holding their original contents — is intact. The split simply did not happen from any reader's perspective.
A crash after step 6's fsync completes leaves the new meta-page as the authoritative root. The fresh pages are the live tree. The split happened entirely.
There is no in-between, because the only operation that distinguishes "old tree" from "new tree" is a single-page write to the meta-page at a fixed offset — and that write is either a whole new page (good) or a whole old page (also good, if the write never started) or a torn chimera (detected by the meta-page's checksum, treated as absent, the other meta-slot's old-but-valid contents used instead).
LMDB is the canonical CoW B+ tree. BoltDB (written by Ben Johnson as the storage engine for etcd's v2) is another. Berkeley DB's "HA" mode has a CoW variant. Every one of these engines has zero lines of code for multi-page atomicity recovery, because they made it a non-problem by construction. They pay in space (every update copies every page on the path) and throughput (single writer, because two concurrent writers would race on the meta-page flip). What they buy is a storage engine that cannot be corrupted by a crash, ever, no matter where it happens, no matter what the hardware does below 4 KiB.
Why doesn't everyone use CoW then? Because the 5-10× write amplification on heavy workloads is a real cost. A Postgres cluster ingesting 100,000 writes per second cannot afford to rewrite 5-7 pages for every write — the storage bandwidth would be the bottleneck. In-place + WAL costs one sequential log append plus one dirty-page write (eventually batched) per operation, which is roughly one page of I/O, and uses group-commit to amortise fsyncs across many transactions. CoW is simpler; in-place + WAL is faster under load. The dichotomy is the same one that has shaped storage-engine design since 1978, and it has not been resolved; both families are alive and well in 2026.
The three strategies against the crash-mid-split hazard
| strategy | what happens during the split | crash mid-split outcome | recovery cost |
|---|---|---|---|
| naive in-place (no WAL) | rewrite live pages one at a time | orphan pages, vanished keys, permanent corruption | impossible without backup |
| in-place + WAL | log record fsync; then rewrite live pages | log has the whole split atomically | scan log tail, redo missing pages |
| copy-on-write | write fresh pages; flip meta-page | old tree intact, fresh pages are orphans | zero — reclaim orphans at GC |
All three end up with the split either applied fully or not at all. The first achieves this only by refusing to handle a real crash. The second and third achieve it by adding a different layer — a sequential log, or a meta-page flip — that provides the multi-page atomicity the page-writing layer lacks.
LSM-tree engines (Build 3) are a fourth approach: memtable in RAM plus immutable SSTables on disk. The memtable is ephemeral and rebuilt from a separate WAL on restart; SSTables are never modified in place, so the multi-page-split hazard simply does not apply. Different part of the design space; same problem solved by avoiding mutation rather than by logging it.
Common confusions
-
"The torn-write defence from chapter 29 already solved this." No — that defence is for a single page's atomicity against a write that tears mid-page. This chapter is about multi-page atomicity when the individual pages are all written atomically (because the torn-write defence worked) but the set of writes landed partially. Full-page-writes and the double-write buffer are about page-granular atomicity; the WAL is about transaction-granular atomicity. Real engines use both together.
-
"If I fsync between every page write, surely I am safe." You are safe against losing a page that you claimed was durable — that is what fsync guarantees. You are not safe against a crash between two fsyncs, which is the window in which the tree is inconsistent. Per-page fsync gives you per-page durability, not cross-page atomicity.
-
"Can't I just make the split one page?" A split by definition is a restructuring — new leaf, rewritten parent. You cannot collapse it into one page unless you accept a different tree structure (LSM, append-only, CoW) where "rewrite parent" is not the operation being performed. Within a B+ tree doing in-place update, the multi-page nature of splits is inherent.
-
"LMDB has no WAL, so it must be slower." The read path of LMDB is frequently faster than in-place-plus-WAL engines, because reads go through the kernel's page cache via mmap with no lock acquisition. The write path is slower per-operation due to copy-on-write write amplification; whether that matters depends on workload. LMDB is used in production at scale (Cloudflare's load balancer, OpenLDAP) precisely because for many workloads the tradeoff favours CoW.
-
"This is all solved by battery-backed RAID controllers and enterprise SSDs with power-loss protection." Enterprise storage with PLP capacitors reduces the frequency of problems — if the drive finishes its in-flight writes on capacitor power, there is no torn-write hazard and no lost-writes hazard. But (a) the capacitor is sized for a bounded number of in-flight operations, not arbitrary recovery, and (b) the vast majority of deployments are on hardware without these guarantees, including every cloud VM and every laptop. The software defences must exist even when the hardware is good, because the software cannot assume the hardware is good.
-
"Can I just take a filesystem-level snapshot before every split?" You can, but the cost is enormous — a filesystem snapshot is a heavyweight operation that takes tens to hundreds of milliseconds on most filesystems, and you cannot do one per insert. CoW databases essentially internalise this idea at a much finer grain: every update is its own tiny snapshot, at the cost of doing all the per-update work. The snapshot approach at filesystem granularity is useful for backup, not for per-operation atomicity.
-
"Does this hazard apply to deletes with merges too?" Absolutely. A merge is structurally the same — rewrite left sibling with merged contents, rewrite parent to drop separator and child pointer, sometimes cascade upward and possibly shrink the tree by one level when the root collapses. Three to six pages, exactly the same multi-page atomicity problem. Every operation that mutates tree structure has the hazard.
-
"Can't checksums detect the problem and trigger a repair?" Checksums detect page-internal corruption, not cross-page inconsistency. Every page in the crash-mid-split example has a valid checksum — they were each individually written cleanly. The tree is inconsistent at the inter-page level, where there is no checksum to compute, because the "correct" relationship between pages is defined by the tree's invariants and the operation being performed, not by a hash. The WAL is what records the operation; checksums are orthogonal.
Going deeper
This section is for the reader who wants to know how three production engines actually implement the multi-page atomicity fix, and how those implementations differ in the fine print.
InnoDB's redo log and the mini-transaction
MySQL's InnoDB handles multi-page atomicity via mini-transactions (mtr). Every physical change the engine makes — a page modification, a page allocation, a latch acquisition — is grouped into an mtr. The mtr collects redo log records in a private buffer; when the mtr commits, the buffer is atomically appended to the global redo log, and only then can any of the mtr's pages be flushed from the buffer pool to disk.
A split operation is one mtr. Inside it, the mtr builds up redo log records for each of the pages it touches — the rewritten leaf, the new right leaf, the rewritten parent, any cascaded ancestors. When the mtr commits, all those records go to the redo log as a contiguous block, with a boundary marker ("mtr end"). The redo log is fsynced when the user transaction commits (or later, depending on innodb_flush_log_at_trx_commit).
Recovery processes mtrs one at a time. For each mtr that is complete in the log (has an end marker), apply all its records. For an mtr that is incomplete (the end marker never made it), discard everything since its start. The mtr boundary is the atomicity unit. InnoDB's multi-page atomicity is therefore a property of the mtr framework, built on top of the redo log.
The Percona and Oracle documentation for InnoDB covers this in the section "Redo Log" and the source code in storage/innobase/mtr/. Reading the source once is a rite of passage.
Postgres's WAL records and the FPI
Postgres does not have the mtr concept, but achieves the same effect through WAL record types that describe entire physical operations. A leaf split in a nbtree index emits records of type XLOG_BTREE_SPLIT_L or XLOG_BTREE_SPLIT_R (left-biased and right-biased splits), which contain the new contents of every page the split touches.
Because Postgres pairs this with the full-page-image (FPI) mechanism from chapter 29, every page that is rewritten by a split carries its full new 8 KiB image in the WAL (at least on the first modification after a checkpoint). Recovery replay is: for each WAL record, overwrite every page whose FPI is in the record, then apply any delta records. The split's atomicity is enforced by the fact that the WAL record is written as a single fsync'd append; if the fsync did not complete, the record is not in the durable log and the split did not happen.
The code lives in src/backend/access/nbtree/nbtxlog.c and the redo function is btree_xlog_split. Every B+ tree WAL record type is defined there, and each one carries enough information to reconstruct the post-split pages from scratch. Postgres's approach is slightly more redundant than InnoDB's (full page images for every modified page on every first write after checkpoint) but conceptually simpler.
LMDB — zero lines because zero hazard
LMDB's copy-on-write design means the hazard does not exist to be defended against. The commit sequence is:
- Write all the fresh pages in the transaction.
fsyncthe data file.- Write one meta-page at a fixed offset.
fsyncthe data file.
If the crash happens before step 4 completes, step 3's meta-page write is either not started, in flight (possibly torn — detected by CRC on read), or completed. In cases 1 and 2, the other meta-page (the one with the previous transaction id) is still valid and authoritative, so the old tree is used and the new fresh pages are orphans. In case 3, the new meta-page's CRC is valid and the new tree is used.
The entire crash-safety code in LMDB is a few hundred lines: the CRC check on meta-page read, the "pick the higher-txn-id valid meta-page" logic, and the free-list reclamation that eventually sweeps up orphans. There is no redo log, no undo log, no recovery procedure that touches data pages. The reason LMDB is about 4,000 lines of C total is precisely because this hazard — which is most of the complexity budget of an in-place engine — is designed out rather than designed around.
Howard Chu's talks and papers make this point explicitly: LMDB's design goal was to eliminate entire classes of bugs by avoiding the operations that create them. No in-place writes means no multi-page atomicity problem.
Group commit — amortising the cost of the WAL fsync
The one thing all three in-place approaches (InnoDB mtrs, Postgres WAL records, hypothetical custom engines) share is the WAL fsync on commit. fsync is expensive: on an NVMe drive with power-loss protection, around 20-50 microseconds; on commodity drives, 1-10 milliseconds. If every insert paid its own fsync, write throughput would cap at 100-50,000 ops per second depending on hardware.
Group commit batches many user transactions into one fsync. The sequence is:
- Transaction A calls commit. Its WAL records are copied into the in-memory log buffer.
- Before fsyncing, the engine waits a tiny interval (microseconds to milliseconds) to see if other transactions will also commit.
- Transactions B, C, D also commit in that window; their WAL records join the buffer.
- One fsync durability-guarantees all of A, B, C, D's records.
- Acknowledge commit to all four clients.
The fsync cost amortises across the batch. A workload of 10,000 concurrent committing transactions can pay one fsync for the whole batch if they arrive within the group-commit window. InnoDB, Postgres, and every serious WAL-based engine implements this. The tradeoff: latency-sensitive single-transaction workloads pay a small batching delay; throughput-sensitive high-concurrency workloads get orders of magnitude higher commit throughput.
Group commit is possible because the WAL is sequential — many transactions can append to it concurrently and one fsync durability-guarantees all of them. You cannot group-commit a CoW meta-page flip, because there is only one meta-page and one writer; the CoW family pays the fsync per transaction, full cost. This is one of the reasons in-place + WAL wins on write-heavy OLTP.
Where this leads next — Build 5
This is the last chapter of Build 4. You have built a page-oriented storage engine: pages allocated from a file, a buffer pool caching them, a B+ tree organised across them, slotted layouts for records, torn-write defences for single-page atomicity. What you have not built is a way to make transactions — groupings of multi-page changes — survive a crash intact.
That is Build 5. Eight chapters, one topic: the write-ahead log and the ARIES-style recovery algorithm that uses it. The structure of Build 5, in the order it will arrive:
- The write-ahead rule — the invariant every crash-safe in-place engine obeys: no dirty page is written to disk before its describing log record has been durably written to the log. One sentence; the whole of recovery follows from it.
- Log records — pure redo, pure undo, redo-undo — the vocabulary a WAL speaks. Each record type is a different trade between log size and replay complexity.
- LSN, log ordering, and page LSN — the monotonic sequence number that threads through every page and every record, and how recovery uses it to know what has already been applied.
- Checkpoints and the REDO/UNDO split — how recovery knows where to start replaying from, and how it knows when it is done.
- The ARIES algorithm — the 1992 Mohan et al. paper that unified every working recovery scheme into one framework. This paper is the reason you can run
psqlafter a datacentre power loss and have it come back in seconds with no data loss. - Group commit and the throughput of the log — how the fsync on the critical path gets amortised across thousands of transactions.
- Physical vs logical logging — the dimension along which engines differ in what they put in each log record.
- Recovery in production — what actually happens when Postgres or MySQL opens a crashed database, measured in operations and in time.
The storage engine you have built is ready to receive the log. Every hazard named in this chapter — the crash-mid-split, the fsync-per-page deception, the in-place atomicity gap — is a motivation for the next Build. Build 5 is where database engineering earns its reputation as a discipline: taking a simple idea (a sequential log of operations) and deriving from it a machine that survives the worst hardware failure you can throw at it.
Close the Build 4 book. Open Build 5.
References
- Mohan, Haderle, Lindsay, Pirahesh, Schwarz, ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS 17(1), 1992 — the foundational paper for every serious WAL-based recovery algorithm. Build 5 is a guided tour of this paper.
- Oracle Corporation, InnoDB Mini-Transactions and Redo Log, MySQL 8.0 Reference Manual — the official description of the mtr framework and how it enforces multi-page atomicity via the redo log.
- PostgreSQL Global Development Group, Write-Ahead Logging (WAL), PostgreSQL 16 Documentation, Chapter 30 — the official introduction to Postgres's WAL mechanism, including full-page images and recovery.
- Howard Chu, LMDB: Lightning Memory-Mapped Database, LinuxCon Europe 2012 — the design rationale for LMDB's CoW-without-a-log approach, and a clear articulation of why "design the hazard out" beats "defend against the hazard".
- Jim Gray, Andreas Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann, 1993 — chapters 7-11 on recovery, the canonical textbook treatment of every idea in this chapter and Build 5.
- Goetz Graefe, A Survey of B-Tree Logging and Recovery Techniques, ACM TODS 37(1), 2012 — a modern catalogue of the specific WAL record shapes engines use for B-tree operations, and the tradeoffs each shape implies.