In short

There are exactly two ways to change a page that is already on disk. In-place update writes the new bytes over the old, at the same disk offset — one page of I/O, zero extra space, and a window of milliseconds during which the page on disk is neither the old version nor the new one but a half-smudged mixture of both, which a power cut turns into permanent corruption. Copy-on-write (CoW) writes the new page to a fresh disk offset, leaves the old page untouched, and then — as the very last step — overwrites a single meta-page at a fixed location to point at the new page as the new root of the tree. The old page and the old meta-page are still there until the next checkpoint reclaims them; the new page and new meta-page are either both fully written or neither visible. A crash in the middle simply wastes the half-written new page and the reader sees the old tree, intact, as if the write never happened. The price you pay for this atomicity is space (every update touches every page on the path from leaf to root, all of which must be copied) and CPU (you are writing more bytes and re-checksumming more pages). What you gain is that you need no write-ahead log at all — the meta-page swap is the commit. LMDB, Symas Corporation's single-file B+ tree used by OpenLDAP, Cloudflare, and (for a while) Bitcoin Core, is the canonical example: about four thousand lines of C, one writer at a time, MVCC snapshots for free, crash-safe by construction. In-place engines — InnoDB, Postgres's heap, SQLite's rollback mode — get back the space and speed but must pair every in-place write with a WAL entry to survive a crash. This chapter walks both mechanisms, traces a crash in each, and names the workloads where each one wins.

You have a B+ tree. Every node is a 4 KiB page. You just called insert(key, value) and your code has done the interesting part: descended to the right leaf, noted that the leaf needs one more key, and now holds the new leaf contents in a buffer in RAM. The leaf's page id is 427. It lives at file offset 427 \times 4096 = 1{,}748{,}992 bytes into your database file.

There is exactly one question left in this chapter, and it is not complicated to state: how do the new bytes get into that file?

Two answers exist. Both are in production use today. Both have their champions and their graveyards.

The first answer is the one you would invent in an afternoon if nobody told you it was dangerous. Seek to offset 1,748,992. Write 4,096 bytes. Call fsync. Return. One page of I/O; zero wasted space; the fastest possible mutation. This is in-place update, and it is what InnoDB, Oracle, DB2, and Postgres's heap pages all do. It is also, by itself, catastrophically unsafe — which is why none of those engines ship it alone. Every one of them pairs in-place writes with a write-ahead log (the subject of Build 5), an append-only record of what they are about to do that they write first and fsync before touching the real page. The log is the safety net; without it, the in-place write is a tightrope over a wood chipper.

The second answer is the one that sounds wasteful the first time you hear it but turns out, on reflection, to be the one a cautious engineer might invent instead. Do not touch page 427. Write the new leaf to a fresh page at a free offset somewhere else — say page 900. Now the parent internal node has to point at page 900 instead of 427, so do not touch the parent either: copy it to page 901. Its parent has to be rewritten too; copy it to page 902. Work all the way up to the root, copying every node along the path. Finally — and this is the whole trick — the file has a reserved fixed-offset page called the meta-page that says "the root of the current tree is at page X". Atomically overwrite that one meta-page to point at the new root instead of the old one. The instant that write lands, the entire new tree becomes visible; until it lands, the entire old tree is still there, untouched. A crash in the middle throws away the half-built new tree, costs nothing, and leaves the database in exactly the state it was in before the transaction started. This is copy-on-write, and LMDB is its most famous incarnation.

The price of CoW is right there in the description: you wrote six pages (leaf, three internal, root, meta) to change one key. You used up more disk space, at least temporarily — the old pages are still allocated until garbage collection frees them. You did more CPU work on every update.

The benefit is that you got atomic transactions without a write-ahead log. A single meta-page flip is the entire commit mechanism. No redo log, no rollback log, no undo chain, no recovery procedure that has to read a log and replay operations. Crash? The old meta-page still points at the old tree. Readers see a consistent snapshot. Restart, and you are exactly where you were. Recovery takes zero milliseconds.

Two mechanisms, two sets of tradeoffs, two families of databases. This chapter walks them both.

The picture — one page, two destinies

In-place update versus copy-on-write for one leaf modificationTwo side-by-side diagrams sharing a title bar reading "updating leaf at page 427 with a new key". Left half, labelled IN-PLACE UPDATE: a column of stacked pages shown as rectangles labelled 425, 426, 427, 428, 429. Page 427 is highlighted orange and an arrow labelled "overwrite at same offset" points at it from the side. Below, text reads "one page of I/O, zero extra space, needs a WAL to be crash-safe". Right half, labelled COPY-ON-WRITE: the same column of pages, page 427 unchanged; off to the right a new page 900 highlighted green with label "new leaf", and above it pages 901, 902 also green labelled "new internal, new root". A fixed small box at the top labelled "meta-page" has an arrow crossed out pointing at the old root page 100 and a new arrow pointing at new root page 902. The arrow from meta-page to 902 is labelled "atomic pointer swap = commit". Below, text reads "six pages of I/O; old tree still visible until meta flip; no WAL needed".updating leaf at page 427 with a new keyIN-PLACE UPDATEpage 425page 426page 427page 428page 429overwriteat same offset1 page of I/O, zero extra spaceneeds a WAL to be crash-safeduring the write: page neither old nor newCOPY-ON-WRITEmetaold root 100new root 902atomic pointer swap = commitpage 427old leaf, untouchedpage 428page 429page 900new leafpage 901page 902write new copies of every page on the paththen flip one meta-page pointer~6 pages of I/O; old tree still visible until flipno WAL needed — the flip is the commitcrash during in-place write → page is torn, unreadable without WAL replaycrash during CoW write → old tree intact, new pages orphaned, zero recovery needed
Two strategies for modifying a single leaf. In-place overwrites page 427 at its original offset, which is fast and compact but leaves a window during which the page is neither old nor new. CoW writes a new leaf at page 900, copies every page on the path up to a new root at 902, and then makes one final write: the meta-page at a reserved fixed offset, which switches atomically from pointing at the old root to pointing at the new one. That last write is the transaction commit.

The picture is the whole chapter in one figure. Every detail below is an elaboration of it.

In-place update — fast, dangerous, universal

The in-place story is short because it is the one you already know.

def update_in_place(file, page_id, new_page_bytes):
    offset = page_id * PAGE_SIZE
    file.seek(offset)
    file.write(new_page_bytes)        # kernel page cache
    os.fsync(file.fileno())           # durable

That is the entire primitive. Three lines. Every database you have used relies on it at some level, and every database that does also pays the price of making it crash-safe. That price is the subject of Build 5, but you need a sketch of it to understand why CoW is interesting.

Why is in-place unsafe on its own? Because write(fd, buf, 4096) is not atomic at any layer below your application. The kernel might have copied 2 KiB of your 4 KiB into the page cache when a power cut hits; the disk controller might have acknowledged 512 bytes of a 4 KiB sector when the capacitor drains; the NAND flash might have programmed four of eight 512-byte sub-pages inside a single logical sector. After the reboot, page 427 contains some bytes from the old version and some from the new, interleaved unpredictably. No checksum will match; no B+ tree invariant will hold; your database is broken. This failure mode has a name — the torn write — and every in-place storage engine has to defend against it.

The defence is the write-ahead log. Before touching page 427, write a WAL record that says "I am about to change page 427 from X to Y", fsync the log, and only then do the in-place write. If the power dies mid-page-write, you reboot, read the log, find the unfinished record, and either redo the page write from the log (if the log has the new bytes) or undo it (if the page got half-written and the log records what the old bytes were). Either way, the tree is restored to a consistent state.

The WAL adds a second fsync to every commit, a second stream of writes to the disk, and an entire recovery subsystem to the database — the ARIES paper (Build 5) is thirty pages and describes exactly how to do this right. For now the point is just: in-place update by itself is not safe; in-place + WAL is safe but expensive; without the WAL, a crash corrupts the database. Postgres calls its WAL pg_wal; MySQL's InnoDB calls it ib_logfile; SQLite in its default mode uses a rollback journal that plays the same role in reverse (write the old bytes to the journal first, then overwrite the page, then delete the journal). Different names, same mechanism.

Copy-on-write — the meta-page flip

The CoW story requires that we first name the actors.

In a CoW B+ tree, every page on disk lives at one of three lifecycle stages:

At the very start of the file there are two meta-pages at fixed offsets — say, pages 0 and 1. Each meta-page holds: a transaction id (a monotonically increasing counter), the page number of that transaction's root, a pointer to the head of the free-page list, and a checksum. Only one of the two meta-pages is "current" at any given time — the one with the higher, valid transaction id. The other is the previous one.

The whole commit protocol — the entire transactional machinery of the engine — is this:

  1. Start a write transaction. Read the current meta-page to find the root.
  2. For every page you need to modify (the leaf, every ancestor on the path up to the root), allocate a free page and write the new version of that page there. Do not touch the old pages.
  3. When the root has been rewritten to its new location, fsync the file — this ensures every new page is durable.
  4. Prepare the other meta-page (the non-current one) with the new transaction id, the new root pointer, the updated free-list head. Write it, fsync it.

The commit is step 4's fsync. If the power dies before that fsync returns, the other meta-page is either still the old one or is half-written — and in either case the current meta-page (the one with the still-higher transaction id on the other slot) is unchanged, so on reboot the database opens the old meta-page and sees the old tree exactly as it was. The new pages are orphans; they will be reclaimed when the free-list gets to them.

If the power dies after that fsync returns, the new meta-page is on disk and has the highest valid transaction id, so on reboot the database opens it and sees the new tree.

There is no in-between. The only thing that could create an in-between is a torn write on the meta-page itself — but the meta-page is a single block, typically 4 KiB, and we check its checksum on read. If the checksum fails, we treat that meta-page as absent and use the other one, which is still the old valid meta-page. Zero-downtime recovery, zero log replay.

Why two meta-pages?

With one meta-page, a torn write to that page would leave you with no valid meta-page and no way to open the database. With two, one slot is always the previous valid state, so even a double torn write cannot destroy both transaction ids at once (the transactions are separated by at least one full fsync boundary). In steady state, the two slots alternate: even-numbered transactions land in slot 0, odd-numbered in slot 1, or similar.

The pseudocode

# cowtree.py — a minimal copy-on-write B+ tree, stripped to the essentials.
# Pages are 4096-byte buffers; pages 0 and 1 are reserved meta-pages.

PAGE_SIZE = 4096
META_SLOTS = (0, 1)

class MetaPage:
    __slots__ = ("txn_id", "root_page", "free_list_head", "checksum")

def read_current_meta(file):
    m0 = _read_meta(file, 0)
    m1 = _read_meta(file, 1)
    valid = [m for m in (m0, m1) if _checksum_ok(m)]
    if not valid:
        raise RuntimeError("both meta-pages corrupt — database unopenable")
    return max(valid, key=lambda m: m.txn_id)

def begin_write(file):
    current = read_current_meta(file)
    return WriteTxn(file, current)

class WriteTxn:
    def __init__(self, file, meta):
        self.file = file
        self.meta = meta
        self.new_pages = {}              # page_id -> new bytes, for pages this txn rewrote
        self.allocated = []              # freshly-allocated page ids

    def rewrite(self, old_page_id, new_bytes):
        # Always allocate a new page; never touch the old one.
        new_page_id = self._alloc_page()
        self.new_pages[new_page_id] = new_bytes
        return new_page_id

    def _alloc_page(self):
        # In a real implementation, pull from the free list; here just grow the file.
        file_size = os.path.getsize(self.file.name)
        new_id = file_size // PAGE_SIZE
        self.allocated.append(new_id)
        return new_id

    def commit(self, new_root_id):
        # 1. Write every new page to its new offset.
        for pid, data in self.new_pages.items():
            self.file.seek(pid * PAGE_SIZE)
            self.file.write(data)
        # 2. fsync the data pages. They are all at fresh offsets, so no torn-write
        #    can corrupt live data. The worst case is orphaned new pages.
        os.fsync(self.file.fileno())
        # 3. Pick the other meta-slot and write the new meta there.
        new_slot = META_SLOTS[1] if self.meta.slot == META_SLOTS[0] else META_SLOTS[0]
        new_meta = MetaPage()
        new_meta.txn_id = self.meta.txn_id + 1
        new_meta.root_page = new_root_id
        new_meta.free_list_head = self._updated_free_list()   # old pages go on the list
        new_meta.checksum = _compute_checksum(new_meta)
        _write_meta(self.file, new_slot, new_meta)
        # 4. fsync. THIS fsync is the commit. Returning from it means the
        #    new tree is durable and visible to every future reader.
        os.fsync(self.file.fileno())

Read that commit function slowly. There is no WAL. There is no redo log. There is no rollback log. There is no recovery procedure. There are two fsyncs — one for data pages, one for the meta — and the second is the entire commit mechanism.

Why does the first fsync (of the data pages) matter? Because if we wrote the new meta-page pointing at a new root whose data pages had not yet reached the disk, a crash between the two fsyncs would leave us with a meta-page pointing at pages that — from the physical disk's perspective — do not yet exist, or contain garbage. The first fsync guarantees the new pages are on disk before the meta-page that references them. Order of durability matters: data first, then meta. This ordering rule is the same one filesystems use for journaled writes and the same one every CoW engine enforces.

Why is writing to a fresh offset safer than writing in place, when both are the same write() syscall and the same fsync? Because a torn write to a fresh page corrupts only that new page, which the database does not yet consider live. Until the meta-page flip, nothing points at the new page; corruption of it is just wasted allocation. A torn write to a page the meta already points at (in-place) corrupts a page the database believes is authoritative — and any reader that touches it sees the corruption. CoW separates "write the new data" from "make the new data authoritative"; in-place collapses them into one step.

LMDB — the canonical single-file CoW B+ tree

If you want to see CoW done with real engineering in about four thousand lines of code, read LMDB — the Lightning Memory-Mapped Database, written by Howard Chu at Symas Corporation starting in 2011 and used today by OpenLDAP, Cloudflare's load balancer, Monero, Bitcoin Core (until 2021), and a long tail of embedded applications that need a transactional key-value store inside a single file.

LMDB takes the CoW picture and makes four design choices that are worth knowing about:

  1. Memory-mapped. The entire database file is mmaped into the process's address space. Reads are pointer dereferences; there is no buffer pool because the kernel's page cache is the buffer pool. Writes use plain pwrite to punch fresh pages into the file, then msync (or fsync) at commit. The RAM-to-disk interface is entirely the kernel's problem.
  2. Single writer, many readers. Exactly one write transaction can be active at a time. This is not a bug; it is the feature that makes MVCC snapshots essentially free. Readers take an O(1) snapshot by recording the current meta-page's transaction id; the writer's new pages are invisible to them because the readers are following pointers from a meta they captured before the writer began. Multiple writers would require complicated locking; LMDB gives up throughput on the write path to get simplicity and concurrent-read scalability that scales linearly with cores.
  3. Free-list as a B+ tree. The list of pages freed by old transactions is itself stored as a B+ tree inside the same file, keyed by the transaction id that freed them. When a new writer starts, it reads the free-list B+ tree to find reusable pages; when a writer commits, it appends its own freed pages to the free-list as part of the transaction. This means page allocation is transactional and crash-safe just like everything else.
  4. Two meta-pages, alternating slots, CRC-checked. Exactly the protocol described above. Pages 0 and 1 of the file hold the two meta-pages; the one with the higher valid CRC'd transaction id is the current one. Recovery is: open the file, read both meta-pages, pick the good one with the higher transaction id, done.

The result is a database that opens in microseconds (no log replay), survives any crash (the old meta-page is always valid), provides serializable isolation for the writer and repeatable-read snapshots for every reader (MVCC by construction), and fits in a single file you can cp to back up (the file is internally consistent at any instant).

The cost, as always, is written in the description. Every update rewrites every page on the path. A one-key change to a billion-record B+ tree at depth 4 writes five pages (leaf, three internal, root) plus the meta-page — about 24 KiB for a 24-byte change. This is called write amplification, and LMDB's is the highest of any engine in common use. On workloads where that amplification is not the bottleneck — read-heavy, low-write-throughput, needs simple crash safety — LMDB is unbeatable. On write-heavy workloads, the in-place+WAL engines win by 5-10× on sustained throughput.

Trace an update under each scheme

Start with the tree from chapter 23, of depth 4 on a billion-row table. The current root is at page 100; the relevant internal ancestors are pages 250 and 3,500; the target leaf is page 4,827. The database file has grown to 8,000 pages. Meta-pages are at 0 and 1; the current meta (slot 0) has txn_id = 117 and root_page = 100.

You call update(key=42, value="hello"). Both engines find the same leaf to modify: page 4,827.

In-place trace (simplified — no WAL here, which is unsafe):

t=0    seek to offset 4827 * 4096 = 19,771,392
t=0+   write 4096 bytes over the old contents of page 4,827
t=1ms  fsync — kernel flushes page cache to disk controller cache
t=2ms  FUA flag forces controller to write to NAND; fsync returns
       total I/O:  1 page written
       space used: 0 extra pages
       recovery:   impossible without a WAL; the old page is gone

In-place trace (with WAL, the real world):

t=0    append WAL record "txn 118: page 4,827 from OLD to NEW" (say 200 bytes)
t=0+   fsync the WAL  — this is the commit
t=1ms  at some later time (checkpoint), write page 4,827 over its old contents
t=?    on crash: scan WAL forward, reapply any committed records to pages
       total I/O:  1 WAL fsync + 1 page write (eventually)
       space used: some WAL segments, rotated on checkpoint
       recovery:   O(time since last checkpoint) — replay the WAL tail

CoW trace (LMDB-style):

t=0    allocate page 8,001 (fresh), write the new leaf bytes there
t=0+   allocate page 8,002, write internal node (copy of page 3,500 but
       with the pointer to 4,827 replaced by 8,001)
t=0+   allocate page 8,003, write internal node (copy of 250 with pointer
       to 3,500 replaced by 8,002)
t=0+   allocate page 8,004, write root (copy of 100 with pointer to
       250 replaced by 8,003)
t=1ms  fsync — all four new pages are durable
t=1ms+ write the OTHER meta-page (slot 1): txn_id=118, root_page=8,004,
       free_list includes pages {4,827; 3,500; 250; 100}
t=2ms  fsync — THIS is the commit; now slot 1 has the higher txn id
       total I/O:  5 pages written (4 data + 1 meta)
       space used: 4 pages freed, 4 pages allocated; net zero at next GC
       recovery:   instantaneous; on reboot, meta with highest valid CRC wins

Five pages under CoW versus one under pure in-place (or one-plus-some-WAL under in-place+WAL). Five times the I/O for one key change. That is the number you are paying for the zero-recovery property.

Now run the trace again but this time assume the power cuts at t=1.5ms — halfway through the meta-write in the CoW case, halfway through the page-write in the in-place case.

In-place (no WAL), mid-write crash. The kernel had written 2,048 of 4,096 bytes to page 4,827 when the rails dropped. The disk controller had acknowledged some but not all of them. On reboot, page 4,827 is a mixture of old and new bytes. The page's checksum (if there is one) fails. Its B+ tree invariants (if checked) fail. Every lookup that descends to page 4,827 returns corrupt data or a checksum error. The database is corrupt and unrecoverable without a backup.

In-place + WAL, mid-write crash. Same torn write on page 4,827. But the WAL record was fsynced at t=0+. On reboot, the WAL scan finds an unfinished redo record for page 4,827, reads the intended new bytes from the log, writes them over the torn page, fsyncs. The page is now consistent. Recovery time: milliseconds. Database intact.

CoW, mid-write crash. The new pages at 8,001 through 8,004 had been fsynced at t=1ms. The meta-page write at slot 1 had started but not completed. On reboot, both meta-pages are read: slot 0 has txn_id=117, checksum valid; slot 1 has txn_id=118, checksum INVALID (because the write was torn). The database picks slot 0. The root is page 100. Pages 8,001–8,004 are orphans; they will be discovered and reclaimed on the next full GC scan. The database is in the state it was at txn_id=117, exactly as if the update never happened. Recovery time: zero.

Crash analysis — the whole picture

Gather the three crash outcomes into a table. This is the clearest way to see why CoW is worth the write amplification for durability-paranoid workloads.

Scheme Crash during data write Crash during commit step Recovery cost on reboot
In-place only Torn data page, permanent corruption Same Impossible without backup
In-place + WAL Torn data page, restored from log Log fsync either landed (commit) or not (no commit) Scan + replay log tail (ms to s)
Copy-on-write Orphaned new pages, old tree intact Meta fsync either landed (new tree) or not (old tree) Zero — pick the valid meta

Why does CoW have zero recovery cost? Because the only atomic step that distinguishes "committed" from "not committed" is a single-page write to a fixed offset — the meta-page. A single aligned 4 KiB write to a checksummed page is either (a) entirely the new contents, (b) entirely the old contents because the disk ignored the write, or (c) torn and detected by checksum mismatch. Case (a) means commit succeeded; cases (b) and (c) mean commit failed and the old state is still valid. There is no "partially committed" state to roll back, so there is nothing to replay. Contrast with in-place: a mid-page torn write can leave a page in a state that is neither old nor new and whose recovery requires knowing what the full old bytes or full new bytes were — which is exactly what the WAL stores.

Tradeoffs — when each wins

The fair summary is that neither scheme dominates. Each wins on a different class of workload.

CoW wins when:

In-place + WAL wins when:

Empirically: OLTP workloads with thousands of concurrent writers (Postgres, MySQL, Oracle) always use in-place + WAL. Embedded key-value stores, MVCC-first engines, and anything that cares more about operational simplicity than raw write throughput (LMDB, BoltDB, Berkeley DB in some modes) use CoW. LSM-trees (Build 3) are a third family that is CoW-like in spirit (SSTables are immutable; compaction writes fresh files) but organised for sequential bulk writes rather than random updates.

Common confusions

Going deeper

If the meta-flip picture is clear, the below are the refinements that real CoW engines add on top.

LMDB's single-writer design and how it works in practice

Howard Chu's 2011 paper on LMDB spends most of its space on two ideas: the single-writer choice, and the memory-mapped read path. The combination is what makes LMDB's read throughput scale to hundreds of thousands of operations per second per core.

A read transaction in LMDB acquires a slot in a shared reader table — a small shared-memory array where each concurrent reader records its transaction id. Readers do not take any lock on the tree. A writer cannot reuse a page that is still referenced by any live reader's snapshot (the reader table tells it which transaction ids are still live). When a reader finishes, it clears its slot; the writer can then reclaim pages freed after that transaction.

This works because the writer is the only thread that modifies the free-list, so the reader-visibility check is a single-threaded operation against a shared-memory table — no coordination between readers is needed, no locks are acquired except a short-lived one on the writer. The read path is wait-free and lock-free in the common case.

The cost of "only one writer" is that write throughput is capped at whatever one thread can achieve against the storage device — typically tens of thousands of transactions per second on fast NVMe. For read-heavy workloads this is more than enough. For write-heavy workloads, LMDB is not the right tool; Postgres or RocksDB are.

Shadow paging — the historical ancestor

CoW is the modern name for what Jim Gray and others called shadow paging in the 1970s. Gray's 1978 paper "Notes on Database Operating Systems" describes a system where every page in the file has two slots — the current slot and the shadow slot — and a page table maps logical page numbers to one or the other. A write copies the page to the shadow slot, updates the page table's shadow copy, and at commit atomically swaps the page table pointer.

The mechanism is identical to LMDB's meta-flip, but the implementation is heavier: an extra indirection table, double the space for every page, and more complex free-space management. Shadow paging fell out of favour in the 1980s as WAL-based in-place engines (System R, then commercial Oracle and DB2) demonstrated higher throughput on the workloads IBM and Oracle cared about.

It came back in the 2000s under the "copy-on-write" name when workloads shifted: embedded databases needed crash safety without DBA-managed logs, filesystems needed snapshots, and virtual machines needed fork-like cloning. What was a losing tradeoff for 1985 mainframes turned out to be a winning tradeoff for 2010 flash-based single-node key-value stores.

Btrfs and ZFS — CoW at the filesystem layer

Both Btrfs (Linux) and ZFS (Solaris, now OpenZFS) implement CoW at the filesystem level. Every block write allocates a new block; the pointer to it propagates up the filesystem's tree of block groups, directories, and the superblock; the superblock flip is the commit.

This gives filesystems three things that traditional in-place filesystems (ext4, XFS, NTFS) need a journal to approximate:

The cost — write amplification and no in-place rewrites — is the same as LMDB's. On heavy random write workloads (databases on ZFS, VM disks on Btrfs), performance suffers relative to ext4+WAL, and operators often either tune the filesystem heavily or choose a different one.

Postgres HOT updates — a middle ground

Postgres uses in-place storage (heap pages) with a WAL, but it has a clever optimisation called heap-only tuple (HOT) updates. When a row is updated in a way that doesn't touch any indexed columns and the new version fits on the same page as the old, Postgres writes the new version in the same page with a pointer from the old tuple slot, without updating any secondary indexes. The old tuple becomes invisible after the transaction commits; VACUUM reclaims it later.

This is CoW-at-tuple-granularity inside an otherwise in-place engine. The new tuple is a copy; the old tuple survives as long as any snapshot might need it (MVCC). The commit mechanism is still the WAL, so HOT doesn't change Postgres's crash story — but it does reduce write amplification and index-update traffic for common update patterns (UPDATE ... SET last_seen_at = NOW()).

HOT is the kind of hybrid you get when a mature engine has room to adopt good ideas from the other camp without rewriting from scratch. The lesson: "CoW vs in-place" is the textbook dichotomy, but real engines mix and match at different granularities.

Where this leads next

A page, in both schemes, is still just 4 KiB of bytes. So far we have treated it as a fixed-layout structure — 16-byte header, then keys, then pointers, fixed widths throughout. But real databases store variable-length keys (strings, blobs) and variable-length values (rows of varying column lengths). The next chapter introduces the slotted page — a layout that packs variable-length records into a page by growing a slot array from one end and data from the other, meeting in the middle. It is the standard layout used by Postgres, SQLite, InnoDB, and every other page-oriented engine.

After that, Build 4 picks up the remaining machinery: free-space maps, latching for concurrency, and the bridge to Build 5's write-ahead log. The storage engine is becoming a real storage engine, one page-layout and one lock at a time.

Choose your mutation strategy — in-place with WAL, or CoW with meta-flip — and everything else in the storage stack hangs off that choice.

References

  1. Howard Chu, LMDB: Lightning Memory-Mapped Database, LinuxCon Europe (2012) — the original design talk. Short, punchy, the best introduction by the author.
  2. Jim Gray, Notes on Database Operating Systems, in "Operating Systems: An Advanced Course" (Springer LNCS, 1978) — the paper that described shadow paging before anyone called it CoW.
  3. Ohad Rodeh, B-trees, Shadowing, and Clones, ACM TOS (2008) — the design paper for the Btrfs B-tree, a CoW B+ tree on which the filesystem is built. Bridges the database and filesystem takes on the idea.
  4. Mohan, Haderle, Lindsay, Pirahesh, Schwarz, ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS (1992) — the foundational in-place + WAL recovery paper. Build 5 is this paper.
  5. Jeff Bonwick, Matt Ahrens, The Zettabyte File System, technical report (2003) — the ZFS design paper; CoW applied to a filesystem, with integrity checksums and snapshots.
  6. LMDB source, libraries/liblmdb/mdb.c in the OpenLDAP repository — the canonical CoW B+ tree, about 10,000 lines of C, readable over a weekend.