In short

Your B+ tree uses 16 KiB pages. When you update a leaf, you seek to that page's offset and write 16 KiB. That one write() call is not atomic on the media. Underneath, the kernel breaks it into several 4 KiB page-cache writes; the SSD breaks each of those into one-or-more sector operations; the NAND breaks those into programmed sub-pages. If the power cuts at any unfortunate instant, the page on disk is a torn write: the first few kilobytes are the new version, the rest is the old version, and the two halves have checksums that do not match the bytes between them. No syscall return code told you this happened — fsync was still in flight or had just started. The filesystem does not heal it; POSIX never promised it would. Consumer SSDs typically guarantee atomicity only at the sector level (512 B or 4 KiB), which is smaller than the page your database thinks it wrote. Two production defences exist. Postgres full_page_writes copies the whole page into the write-ahead log the first time each page is dirtied after a checkpoint; if a crash tears the real page, recovery overwrites it with the pristine log copy. InnoDB's double-write buffer stages every dirty page in a 2 MiB contiguous region on disk first, fsyncs that, then writes the page in place; if the in-place write tears, recovery reads the good copy out of the buffer. Both defences cost you a second write per dirty page — they double your write I/O in exchange for a tree that cannot be corrupted by a crash. This chapter names the hazard, simulates it, walks both defences, and lines up the insight that drives Build 5: if you have to log the page anyway, you may as well log the operation, and now you have invented the write-ahead log.

Here is a bug that will not show up in any test on your laptop, but will destroy a production database the first time the datacentre loses power.

You have a B+ tree. It uses 16 KiB pages, the Postgres default. You update a leaf: the leaf's new contents are in RAM, you seek to the leaf's offset, you call write(fd, buf, 16384) and then fsync. fsync returns 0. Your commit is acknowledged to the client.

Except the "fsync returns" step is not the one you should be staring at. Stare at the step before it — the 16 KiB write. That one write() is a single syscall in your code, but underneath it the kernel splits it into four separate 4 KiB page-cache operations, then hands them to the block layer, which hands them to the NVMe driver, which issues one or more commands to the SSD, which passes them to firmware, which programs the data onto NAND in chunks whose size depends on the drive's geometry and mood.

At any of those subdivision points, if the 5 V rail collapses — because the power supply failed, because the rack breaker tripped, because somebody tripped over a cord — some of your 16 KiB may have reached the media and some may not have. The drive, on reboot, presents you with a page whose first 4 KiB is the new version you wrote and whose last 12 KiB is the old version that was there before. Or the first 8 KiB new, last 8 KiB old. Or the first 512 bytes new, the next 512 bytes old, alternating. The exact pattern depends on the hardware. Every pattern has a name: torn write.

The page on disk is now a chimera. Its header says the tree has n+1 keys. Its body lays out the n keys from before. Its checksum, computed across the whole page, matches neither version. Your B+ tree, the moment you read page 427 back from disk after the reboot, sees garbage where a node should be, and every read that descends through that subtree returns a lie or crashes the process.

This is the hazard every in-place update must survive. This chapter explains where it comes from, simulates it so you can see it with your own eyes, and walks the two production defences — Postgres's full_page_writes and InnoDB's double-write buffer — that let a real database ship in-place updates without fearing the power cord. At the end we will spot the insight that falls out of this defence: if we have to copy the page to a log anyway, we may as well log the operation rather than the page, and we have now invented the thing we will spend Build 5 formalising — the write-ahead log.

Seeing the tear — one page, two truths

Picture a single 16 KiB page on a disk sector, right in the middle of a write.

A torn 16 KiB pageA horizontal strip representing a 16 KiB page on disk is divided into four 4 KiB sub-sectors. Above the strip is a label "on disk, after power cut". The leftmost 4 KiB sector is coloured green and labelled NEW. The second 4 KiB sector is coloured yellow-green and labelled HALF-WRITTEN. The third and fourth 4 KiB sectors are coloured grey and labelled OLD. To the left, the intended 16 KiB new page (entirely green) is drawn as "what you called write() with". To the right of the torn page, text explains that the checksum computed over the mixture matches neither the all-new header nor the all-old body. Below, a dashed rectangle labels the whole thing "a chimera no invariant will hold". Arrows show the kernel handing 4 KiB chunks to the block layer one by one, and a lightning bolt interrupting after chunk 1 has fully committed and chunk 2 is mid-flight.one 16 KiB B+ tree page, mid-write power cutwhat you called write() withall-new 16 KiBwhat ended up on diskNEW4 KiBHALFtornOLD4 KiBOLD4 KiB016 KiB offsetchecksum over all 16 KiB matches neither NEW nor OLDheader claims n+1 keys, body lays out n — a chimeraWhy it happens:• SSDs typically guarantee atomicity at 512 B or 4 KiB — smaller than your page.• NVMe may split a 16 KiB write into N sector commands that can land out of order.• The kernel page cache writes dirty 4 KiB pages independently to the block layer.• Capacitor-backed (PLP) drives can sometimes finish the flight — most consumer SSDs cannot.The tear is undetectable from a syscall return code alone. Only the checksum, read on the next open, reveals it.
A 16 KiB B+ tree page mid-write, with the power cut captured between sector 1 and sector 2. The first 4 KiB is the new page; the second sector is physically half-programmed; the remaining 8 KiB is still the old page. The page's own internal checksum, computed end-to-end, matches neither the new nor the old contents. No syscall could have warned you: fsync had not yet returned.

Four properties of this hazard are worth saying out loud before we get to defences.

Atomicity is a property of a layer, not of the whole stack. POSIX write(2) does not promise that multi-block writes are atomic. The page cache does not promise that dirty pages are flushed in order. The block layer does not promise that a multi-sector I/O is all-or-nothing. The SSD firmware promises atomicity at one unit — typically a 512 B or 4 KiB sector — and a guarantee at a smaller unit says nothing about a larger one.

The page size the database chose is usually bigger than the atomic unit the hardware gives. Postgres pages are 8 KiB by default (16 KiB is also common). InnoDB pages are 16 KiB. SQL Server pages are 8 KiB. SQLite pages default to 4 KiB but are often configured higher. The mismatch between database-page and hardware-atom is the entire reason torn writes exist as a named hazard.

fsync does not help. fsync makes sure the bytes reach the media; it does not make them arrive as one indivisible unit. If the power cuts halfway through an fsync call, some of the pages the kernel was flushing are on the media and some are not. The torn write happens inside the fsync window, not before it.

The filesystem does not heal it. Journaling filesystems (ext4, XFS) protect filesystem metadata — the inode, the directory entry, the block map — from tearing. They do not journal user data by default. A database page living inside an ext4 file is on its own.

Why SSDs do not just promise 16 KiB atomicity: because it is expensive. The drive would need a DRAM buffer big enough to hold every in-flight multi-sector write and a capacitor big enough to flush all of them on power loss. Consumer drives minimise both. Enterprise drives with power-loss protection can sometimes offer larger atomic-write guarantees — NVMe has an optional Atomic Write Unit Normal parameter you can query — but the database cannot assume the guarantee without checking, and many deployments will not have a drive that supports it. So every portable engine must assume the smallest reasonable atomic unit (512 B) and defend itself.

A Python simulation of the tear

You can reproduce the failure mode without ever losing real power. All you have to do is: write a page, then truncate the file at a random offset in the middle of that page, then read it back. The bytes after the truncation are the bytes that were there before — for a file that already existed, the "old" contents. Replace "truncate" with "crash after N of 4 sectors landed" and you have the exact torn-write scenario.

# torn.py — reproduce a torn write on a 16 KiB database page.
import os, random, hashlib, struct

PAGE = 16384                         # one B+ tree page
SECTOR = 4096                        # hardware atomicity — typically smaller than PAGE

def checksum(buf: bytes) -> bytes:
    return hashlib.sha256(buf[:PAGE - 32]).digest()[:32]

def make_page(generation: int, payload: bytes) -> bytes:
    # layout: [payload (PAGE - 32 bytes)] [sha256 of payload (32 bytes)]
    body = payload + bytes(PAGE - 32 - len(payload))
    return body + checksum(body)

def write_page_with_crash(path: str, new_page: bytes, crash_after_sectors: int):
    """Simulate a crash after `crash_after_sectors` of the new page have landed."""
    # Open for in-place update. The file already contains the old page.
    fd = os.open(path, os.O_WRONLY)
    try:
        # Write the new page sector-by-sector, as the kernel might.
        for s in range(crash_after_sectors):
            os.pwrite(fd, new_page[s*SECTOR:(s+1)*SECTOR], s*SECTOR)
        os.fsync(fd)                 # only the sectors that made it
    finally:
        os.close(fd)                 # power cut — the remaining sectors never happened

def read_page(path: str) -> bytes:
    with open(path, "rb") as f:
        return f.read(PAGE)

def verify(buf: bytes) -> bool:
    return checksum(buf[:PAGE - 32]) == buf[PAGE - 32:]

# --- scenario -------------------------------------------------------------
old_page = make_page(1, b"OLD-VERSION-OF-LEAF-427 " * 400)
new_page = make_page(2, b"NEW-VERSION-OF-LEAF-427 " * 400)

with open("leaf.page", "wb") as f:
    f.write(old_page)                # the file starts in a clean state

# Crash after only the first sector of the new page has been written.
write_page_with_crash("leaf.page", new_page, crash_after_sectors=1)

torn = read_page("leaf.page")
print("first 24 bytes :", torn[:24])
print("last 24 bytes  :", torn[PAGE-56:PAGE-32])
print("checksum valid :", verify(torn))

The output is the entire chapter in five lines:

first 24 bytes : b'NEW-VERSION-OF-LEAF-427 '
last 24 bytes  : b'OLD-VERSION-OF-LEAF-427 '
checksum valid : False

The first sector is the new page. The last sector is the old page. The checksum, computed across the whole 16 KiB, matches neither — it was computed on the all-new bytes, but what the disk holds is a mixture. The invariant the page on disk is either the old version or the new version is violated. This is the torn write, in miniature.

Why the checksum is not itself the defence: a valid checksum tells you the page is either the old version or the new version (whichever you managed to re-checksum after the last write). It cannot tell you which version is there, nor can it reconstruct the bytes that are wrong. The checksum is a detector. What you need for recovery is a source of truth from somewhere else — a spare copy of the page, taken at a moment you knew was consistent, that you can read when the in-place page fails its checksum. The two production defences below are both ways of keeping that spare copy.

What happens if the database ignores torn writes

Suppose your engine does no torn-write defence at all. You crash in the middle of writing page 427 — an internal B+ tree node that routes keys between children. On reboot, you read page 427; its checksum fails. You panic the process and alert the operator. Good — no corruption returned to a client.

But now picture a checksum-less engine (or one with a weak checksum like XOR that a torn write can happen to re-satisfy by luck). On reboot, you read page 427 and trust its bytes. Its header says the left child covers keys up to 500; its left-child pointer points at page 90 (new) but its right-child pointer still says page 91 (old). A get(250) descends to page 90, which now holds a completely different subtree's worth of keys, none of them near 250. The read returns the wrong value — or more commonly, a structure-violation assertion fires deep in the tree code and the whole database refuses to open.

Either way, the torn write has turned a durable transaction into either a crashed process (good) or silent data corruption (catastrophic). The goal of the defences below is to eliminate the second outcome entirely.

Postgres full_page_writes — log the pristine page once per checkpoint

Postgres's answer is conceptually the simplest: before you modify a page in place, first copy the whole pristine page into the write-ahead log. Then, on recovery, if the in-place page is torn, overwrite it with the copy from the log.

The mechanism lives behind the GUC parameter full_page_writes (default on), documented in the Postgres configuration manual. Here is what it does.

A Postgres checkpoint is a point in time where every dirty page in the buffer pool has been flushed to disk; the WAL up to the checkpoint's LSN can be safely discarded for recovery purposes. Between checkpoints, pages get dirtied and eventually flushed. The rule is:

The first time a given page is modified after a checkpoint, Postgres writes a full image of that page into the WAL, as part of the WAL record for the modification.

Subsequent modifications to the same page between the same two checkpoints need log only the delta (the row insert, the index update, whatever), because the full page image earlier in the WAL already provides a pristine anchor. At the next checkpoint, the counter resets: the first modification after the new checkpoint writes the full page image again.

Recovery logic:

  1. After a crash, read every WAL record since the last checkpoint.
  2. For each WAL record that contains a full page image, overwrite the corresponding on-disk page with that image unconditionally — even if the on-disk page's checksum is fine, because you cannot distinguish "checksum fine but torn" from "checksum fine and correct" without the anchor.
  3. After applying all full-page-image records, replay the delta records on top.

The result: every page that was being written when the crash hit gets restored from the WAL copy. The torn version is obliterated. The database's invariants hold.

The cost: write amplification. The first modification to a page after a checkpoint costs you ~16 KiB in the WAL (for the page image) on top of the actual delta. Under a write-heavy workload and frequent checkpoints, this can double or triple your WAL volume. The checkpoint_timeout and max_wal_size parameters are tuned partly to manage this cost — less frequent checkpoints mean more modifications share one full-page-image and the amortised cost drops, at the price of longer recovery times.

You can turn full_page_writes off. The Postgres manual is explicit about what happens if you do: "turning this parameter off speeds normal operation, but might lead to either unrecoverable data corruption, or silent data corruption, after a system failure." The only scenarios where turning it off is safe are filesystems or hardware that themselves guarantee atomic page-sized writes — ZFS, btrfs in some configurations, or an NVMe drive with AWUPF (Atomic Write Unit Power Fail) greater than or equal to the Postgres page size. Even then, the setting is a sharp knife; the Postgres developers' advice is to leave it on.

A WAL record with a full-page image, schematically

WAL record @ LSN 0x1A2B3C40
  xid         : 42817
  rmgr        : Heap
  operation   : INSERT
  relation    : public.users  (relfilenode 16384)
  page        : 427
  ---- full page image (16 KiB) ----
    0000: [page header]
    0018: [item pointers]
    ...
    3FE0: [tuples from bottom up]
    3FF0: [pd_checksum, pd_lsn, ...]
  ---- end image ----
  delta       : INSERT tuple (id=7, name='Aarav') at item offset 5

On recovery, when the replay reaches this record:

  1. Read page 427 from disk.
  2. Overwrite it with the 16 KiB image embedded in the record (ignoring whatever was there).
  3. Apply the delta — insert the new tuple at item offset 5.
  4. Update pd_lsn so this page knows about the record.

If the torn-write victim was page 427 — whose in-place write was interrupted by the crash — the overwrite in step 2 is the heal. If page 427 was fine and the torn victim was some other page, that other page has its own full-page-image record somewhere in the WAL between the same two checkpoints; step 2 fires for that one too.

Why the rule is "first modification after a checkpoint" and not "every modification": because once we have written the pristine image once, any subsequent in-place write to the same page that gets torn can still be recovered by going back to the image plus replaying the delta records up to the crash. The image gives us a known-good anchor; the deltas get us back to where we were. We only need one anchor per (page, checkpoint-interval) pair, not one per operation. This is why infrequent checkpoints amortise the full-page-write cost — a heavily-updated page might be dirtied a hundred times between checkpoints and pay the 16 KiB tax only on the first of those hundred.

InnoDB's double-write buffer — stage, fsync, then write in place

InnoDB, the MySQL storage engine, solves the same problem with a different primitive: a fixed-size double-write buffer that sits in the database's data file at a known offset.

The double-write buffer is typically 2 MiB — 128 pages of 16 KiB each. Every time InnoDB flushes dirty pages from its buffer pool to the tablespace, it does the following, documented in the MySQL reference manual under "InnoDB Doublewrite Buffer":

  1. Copy the dirty pages (up to 128 of them) into the double-write buffer region, contiguously.
  2. fsync the double-write buffer. Now all those pages have a pristine spare on disk.
  3. For each page, seek to its real tablespace offset and write it in place.
  4. fsync the tablespace.
  5. (The double-write buffer slots are now reusable for the next flush batch.)

If the power cuts during step 3 — the in-place write phase — some pages are torn in the tablespace. On recovery:

  1. Read every page in the double-write buffer. Check its checksum. Discard any with bad checksums (those were presumably stale from a previous round, not yet overwritten by a new flush).
  2. For each good page in the buffer, read the corresponding page at its tablespace offset. Check its checksum.
  3. If the tablespace page's checksum fails, overwrite it with the buffer copy. If the tablespace page's checksum is fine, leave it alone (it finished before the crash).
  4. Clear the double-write buffer.

The invariant the buffer guarantees is: after the step-2 fsync, every dirty page has a consistent on-disk copy somewhere. If the in-place copy gets torn, the spare survives. If the spare gets torn (which would require a crash during step 2 rather than step 3), the in-place copy is still the old version, which is also consistent.

The cost is the same shape as Postgres's: every dirty page is written twice. Once to the buffer, once in place. MySQL exposes the setting as innodb_doublewrite; the documentation is as blunt as Postgres's about what happens if you turn it off: "If you disable innodb_doublewrite for performance ... your data may be unrecoverable in the event of an operating system crash or power outage."

InnoDB also supports a per-tablespace variant (innodb_doublewrite_files, introduced in MySQL 8.0.20) that splits the buffer across multiple files to reduce contention, and a fast-path for filesystems known to support atomic writes (e.g., ext4 on certain Linux kernels with Fusion-io drives) that can skip the buffer entirely. The mechanism has evolved, but the core idea — stage, fsync, then commit in place — has been in MySQL since InnoDB's earliest days.

Postgres vs InnoDB — same insight, different placement

Both defences obey the same law: before you overwrite a page in place, make sure a consistent copy of it exists on durable storage somewhere else. They differ in where they put the copy.

  • Postgres puts the copy in the write-ahead log — a sequential append-only stream the engine was writing anyway for other reasons. The full-page image is one more record in that stream, amortised across later delta records for the same page.
  • InnoDB puts the copy in a fixed-size circular buffer in the tablespace itself. The buffer is not a log; it holds only the pages currently being flushed, and it is overwritten on every flush batch.

Consequences:

aspect Postgres FPW InnoDB double-write
storage WAL grows with page images fixed 2 MiB buffer
frequency first modification per checkpoint every flush
write amplification 1× the page per checkpoint-interval 2× the page, always
tunable full_page_writes, checkpoint_timeout innodb_doublewrite, buffer size
interacts with recovery replay buffer-pool flush path

The Postgres approach is slightly more I/O-efficient for hot pages (you pay the full-page image once per checkpoint, not once per flush), at the cost of a larger WAL. The InnoDB approach uses bounded space and does not grow the WAL, at the cost of writing every dirty page twice forever.

Neither is wrong. They are two reasonable placements of the same spare copy.

Common confusions

Going deeper

The rest of this section is for the reader who wants to know how the industry is trying to eliminate the torn-write problem at the hardware layer — and why, even with those efforts, the log-based defences are not going away.

NVMe Atomic Write Unit — what the standard actually promises

NVMe defines several atomicity parameters a device can advertise:

You can query these on Linux with nvme id-ctrl /dev/nvme0 and nvme id-ns /dev/nvme0n1. A typical consumer drive reports AWUPF = 0 or 1 (meaning 512 bytes). A well-configured enterprise drive can report AWUPF = 31 (meaning 16 KiB — exactly one database page). If you verify AWUPF ≥ your page size for every drive in your fleet, and you configure the filesystem to not split multi-sector writes (ext4 with dioread_lock off and O_DIRECT), you can reasonably turn off full_page_writes or innodb_doublewrite. Few deployments do this verification end-to-end because the savings (maybe 20–40% WAL volume) are usually not worth the audit burden.

ZNS — zoned namespaces and the disappearance of in-place update

Zoned namespaces (ZNS) is an NVMe extension where the SSD's address space is divided into zones, each of which must be written sequentially and reset as a whole. ZNS drives do not support random in-place overwrites at the block level — the only mutation primitives are "append to a zone" and "reset a zone." This matches the underlying NAND's physical model much more closely than the legacy block-device pretense of random writes, and it eliminates the internal garbage collection that consumer SSDs spend much of their lifecycle doing.

For a database on ZNS, in-place update at the page level is simply not available; every update is effectively append-to-a-zone. This collapses the torn-write problem — an append that tears corrupts only the tail of the zone, which the database never considered live. LSM-tree and CoW engines map onto ZNS naturally; B+ tree engines need to do copy-on-write anyway to avoid in-place mutation, which means they are essentially becoming CoW engines. The long-term industry bet is that storage engines will converge on append-only / CoW designs as ZNS adoption grows.

Direct Access (DAX) and byte-addressable persistent memory

DAX is the Linux mechanism for mapping a persistent-memory device (Intel Optane, now discontinued, and its successors) directly into a process's address space. There is no page cache; there is no block layer; a store instruction to the mapped region goes straight through the CPU's caches to persistent memory, and a clflushopt plus sfence makes it durable with nanosecond latency.

On DAX, the atomic unit is the CPU's 8-byte store — not 512 bytes, not 4 KiB. A crash during a 16-byte write can leave 8 bytes new and 8 bytes old. This is smaller than conventional storage, which means database engines porting to persistent memory have to rethink torn-write defences at the cache-line level. The PMDK library (libpmemobj) offers transactional primitives for exactly this, built around 8-byte atomic pmem_memcpy_persist and a redo log sitting in the same memory region. Even there, the log-based defence reappears — the scale is finer but the shape is identical.

Copy-on-write versus double-write — same family, different accounting

The double-write buffer is structurally a bounded copy-on-write. Every dirty page is written to a fresh location (the buffer) before its in-place destination is touched. The meta-page equivalent — "which copy is authoritative" — is implicit in the flush protocol's ordering rather than in a pointer flip, but the safety argument is the same: at any crash point, one of the two copies is consistent, and recovery picks the right one.

Postgres's full_page_writes is structurally redo-log-based copy-on-write: the "fresh location" is inside the WAL, interleaved with the operation records that will be replayed on top of the image. Recovery reads the image then replays; the image plus the replay reconstructs the intended page state.

LMDB's CoW B+ tree is pointer-flip copy-on-write: the fresh location is a new page in the file proper, and the meta-page switch makes it authoritative without a log at all.

Three placements of the same idea — stage a known-good copy before overwriting in place — chosen based on whether the engine wants a log (Postgres: yes, using the one it already has), a bounded buffer (InnoDB: yes, to keep WAL volume low), or no log at all (LMDB: write the whole path from leaf to root, swap the meta).

The torn-write hazard is what forces the choice. Every in-place storage engine will pick one of the three, and all three cost you roughly one extra page-sized write per commit. Database engineering is, in the most exasperating sense, the art of paying that tax as efficiently as possible.

Where this leads next

You now know the hazard — torn writes — and you know two defences that guard against it. Postgres's full-page-write and InnoDB's double-write both copy a pristine page to durable storage before touching the live copy. That observation, stated one level up, is the seed of the entire next Build:

If you are going to copy the page to durable storage before modifying it, why stop at the page? Copy the operation to durable storage — a description of what you are about to do, compact and append-friendly — and now you have a log that can both prevent torn writes and also let you replay lost transactions on recovery.

That log is the write-ahead log. Build 5 turns this paragraph into 800 lines of code, 30 pages of the ARIES paper, and every serious database's recovery subsystem.

The last five chapters of Build 4 and the entire Build 5 sit on one observation: durability-of-operation is cheaper than durability-of-state. Torn writes are the reason that observation had to be made.

References

  1. PostgreSQL Global Development Group, Reliability and the Write-Ahead Log — full_page_writes, PostgreSQL 16 documentation — the official statement of what full_page_writes does and why turning it off is dangerous.
  2. Oracle Corporation, InnoDB Doublewrite Buffer, MySQL 8.0 reference manual — the official description of the double-write buffer, its placement, and its recovery protocol.
  3. Zheng et al., Understanding the Robustness of SSDs under Power Fault, FAST 2013 — the empirical paper that measured how often and how badly real SSDs tear writes on power loss. The numbers remain the best public evidence that the hazard is not theoretical.
  4. NVM Express, Inc., NVMe Base Specification — Atomic Write Unit parameters, §5.15 — the standard's definitions of AWUN, AWUPF, and AWUPF per namespace.
  5. Bjørling et al., ZNS: Avoiding the Block Interface Tax for Flash-based SSDs, USENIX ATC 2021 — the paper motivating zoned namespaces and showing how sequential-only writes eliminate classes of hazards including torn writes.
  6. Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS 1992 — the recovery algorithm that formalises the write-ahead rule and motivates why operation-logging is cheaper than page-logging.