In short
There are exactly two ways to change a page that is already on disk. In-place update writes the new bytes over the old, at the same disk offset — one page of I/O, zero extra space, and a window of milliseconds during which the page on disk is neither the old version nor the new one but a half-smudged mixture of both, which a power cut turns into permanent corruption. Copy-on-write (CoW) writes the new page to a fresh disk offset, leaves the old page untouched, and then — as the very last step — overwrites a single meta-page at a fixed location to point at the new page as the new root of the tree. The old page and the old meta-page are still there until the next checkpoint reclaims them; the new page and new meta-page are either both fully written or neither visible. A crash in the middle simply wastes the half-written new page and the reader sees the old tree, intact, as if the write never happened. The price you pay for this atomicity is space (every update touches every page on the path from leaf to root, all of which must be copied) and CPU (you are writing more bytes and re-checksumming more pages). What you gain is that you need no write-ahead log at all — the meta-page swap is the commit. LMDB, Symas Corporation's single-file B+ tree used by OpenLDAP, Cloudflare, and (for a while) Bitcoin Core, is the canonical example: about four thousand lines of C, one writer at a time, MVCC snapshots for free, crash-safe by construction. In-place engines — InnoDB, Postgres's heap, SQLite's rollback mode — get back the space and speed but must pair every in-place write with a WAL entry to survive a crash. This chapter walks both mechanisms, traces a crash in each, and names the workloads where each one wins.
You have a B+ tree. Every node is a 4 KiB page. You just called insert(key, value) and your code has done the interesting part: descended to the right leaf, noted that the leaf needs one more key, and now holds the new leaf contents in a buffer in RAM. The leaf's page id is 427. It lives at file offset 427 \times 4096 = 1{,}748{,}992 bytes into your database file.
There is exactly one question left in this chapter, and it is not complicated to state: how do the new bytes get into that file?
Two answers exist. Both are in production use today. Both have their champions and their graveyards.
The first answer is the one you would invent in an afternoon if nobody told you it was dangerous. Seek to offset 1,748,992. Write 4,096 bytes. Call fsync. Return. One page of I/O; zero wasted space; the fastest possible mutation. This is in-place update, and it is what InnoDB, Oracle, DB2, and Postgres's heap pages all do. It is also, by itself, catastrophically unsafe — which is why none of those engines ship it alone. Every one of them pairs in-place writes with a write-ahead log (the subject of Build 5), an append-only record of what they are about to do that they write first and fsync before touching the real page. The log is the safety net; without it, the in-place write is a tightrope over a wood chipper.
The second answer is the one that sounds wasteful the first time you hear it but turns out, on reflection, to be the one a cautious engineer might invent instead. Do not touch page 427. Write the new leaf to a fresh page at a free offset somewhere else — say page 900. Now the parent internal node has to point at page 900 instead of 427, so do not touch the parent either: copy it to page 901. Its parent has to be rewritten too; copy it to page 902. Work all the way up to the root, copying every node along the path. Finally — and this is the whole trick — the file has a reserved fixed-offset page called the meta-page that says "the root of the current tree is at page X". Atomically overwrite that one meta-page to point at the new root instead of the old one. The instant that write lands, the entire new tree becomes visible; until it lands, the entire old tree is still there, untouched. A crash in the middle throws away the half-built new tree, costs nothing, and leaves the database in exactly the state it was in before the transaction started. This is copy-on-write, and LMDB is its most famous incarnation.
The price of CoW is right there in the description: you wrote six pages (leaf, three internal, root, meta) to change one key. You used up more disk space, at least temporarily — the old pages are still allocated until garbage collection frees them. You did more CPU work on every update.
The benefit is that you got atomic transactions without a write-ahead log. A single meta-page flip is the entire commit mechanism. No redo log, no rollback log, no undo chain, no recovery procedure that has to read a log and replay operations. Crash? The old meta-page still points at the old tree. Readers see a consistent snapshot. Restart, and you are exactly where you were. Recovery takes zero milliseconds.
Two mechanisms, two sets of tradeoffs, two families of databases. This chapter walks them both.
The picture — one page, two destinies
The picture is the whole chapter in one figure. Every detail below is an elaboration of it.
In-place update — fast, dangerous, universal
The in-place story is short because it is the one you already know.
def update_in_place(file, page_id, new_page_bytes):
offset = page_id * PAGE_SIZE
file.seek(offset)
file.write(new_page_bytes) # kernel page cache
os.fsync(file.fileno()) # durable
That is the entire primitive. Three lines. Every database you have used relies on it at some level, and every database that does also pays the price of making it crash-safe. That price is the subject of Build 5, but you need a sketch of it to understand why CoW is interesting.
Why is in-place unsafe on its own? Because write(fd, buf, 4096) is not atomic at any layer below your application. The kernel might have copied 2 KiB of your 4 KiB into the page cache when a power cut hits; the disk controller might have acknowledged 512 bytes of a 4 KiB sector when the capacitor drains; the NAND flash might have programmed four of eight 512-byte sub-pages inside a single logical sector. After the reboot, page 427 contains some bytes from the old version and some from the new, interleaved unpredictably. No checksum will match; no B+ tree invariant will hold; your database is broken. This failure mode has a name — the torn write — and every in-place storage engine has to defend against it.
The defence is the write-ahead log. Before touching page 427, write a WAL record that says "I am about to change page 427 from X to Y", fsync the log, and only then do the in-place write. If the power dies mid-page-write, you reboot, read the log, find the unfinished record, and either redo the page write from the log (if the log has the new bytes) or undo it (if the page got half-written and the log records what the old bytes were). Either way, the tree is restored to a consistent state.
The WAL adds a second fsync to every commit, a second stream of writes to the disk, and an entire recovery subsystem to the database — the ARIES paper (Build 5) is thirty pages and describes exactly how to do this right. For now the point is just: in-place update by itself is not safe; in-place + WAL is safe but expensive; without the WAL, a crash corrupts the database. Postgres calls its WAL pg_wal; MySQL's InnoDB calls it ib_logfile; SQLite in its default mode uses a rollback journal that plays the same role in reverse (write the old bytes to the journal first, then overwrite the page, then delete the journal). Different names, same mechanism.
Copy-on-write — the meta-page flip
The CoW story requires that we first name the actors.
In a CoW B+ tree, every page on disk lives at one of three lifecycle stages:
- live — the current committed tree references this page (from its meta-page, through the root, through internal nodes, down to leaves).
- old — a previous version's page that is no longer referenced by the current meta-page, but is still there on disk. May be reclaimed on the next free-list pass.
- new — freshly written during an in-flight transaction that has not yet committed. Will become live on commit; will become old (and then reclaimable) on abort.
At the very start of the file there are two meta-pages at fixed offsets — say, pages 0 and 1. Each meta-page holds: a transaction id (a monotonically increasing counter), the page number of that transaction's root, a pointer to the head of the free-page list, and a checksum. Only one of the two meta-pages is "current" at any given time — the one with the higher, valid transaction id. The other is the previous one.
The whole commit protocol — the entire transactional machinery of the engine — is this:
- Start a write transaction. Read the current meta-page to find the root.
- For every page you need to modify (the leaf, every ancestor on the path up to the root), allocate a free page and write the new version of that page there. Do not touch the old pages.
- When the root has been rewritten to its new location,
fsyncthe file — this ensures every new page is durable. - Prepare the other meta-page (the non-current one) with the new transaction id, the new root pointer, the updated free-list head. Write it,
fsyncit.
The commit is step 4's fsync. If the power dies before that fsync returns, the other meta-page is either still the old one or is half-written — and in either case the current meta-page (the one with the still-higher transaction id on the other slot) is unchanged, so on reboot the database opens the old meta-page and sees the old tree exactly as it was. The new pages are orphans; they will be reclaimed when the free-list gets to them.
If the power dies after that fsync returns, the new meta-page is on disk and has the highest valid transaction id, so on reboot the database opens it and sees the new tree.
There is no in-between. The only thing that could create an in-between is a torn write on the meta-page itself — but the meta-page is a single block, typically 4 KiB, and we check its checksum on read. If the checksum fails, we treat that meta-page as absent and use the other one, which is still the old valid meta-page. Zero-downtime recovery, zero log replay.
Why two meta-pages?
With one meta-page, a torn write to that page would leave you with no valid meta-page and no way to open the database. With two, one slot is always the previous valid state, so even a double torn write cannot destroy both transaction ids at once (the transactions are separated by at least one full fsync boundary). In steady state, the two slots alternate: even-numbered transactions land in slot 0, odd-numbered in slot 1, or similar.
The pseudocode
# cowtree.py — a minimal copy-on-write B+ tree, stripped to the essentials.
# Pages are 4096-byte buffers; pages 0 and 1 are reserved meta-pages.
PAGE_SIZE = 4096
META_SLOTS = (0, 1)
class MetaPage:
__slots__ = ("txn_id", "root_page", "free_list_head", "checksum")
def read_current_meta(file):
m0 = _read_meta(file, 0)
m1 = _read_meta(file, 1)
valid = [m for m in (m0, m1) if _checksum_ok(m)]
if not valid:
raise RuntimeError("both meta-pages corrupt — database unopenable")
return max(valid, key=lambda m: m.txn_id)
def begin_write(file):
current = read_current_meta(file)
return WriteTxn(file, current)
class WriteTxn:
def __init__(self, file, meta):
self.file = file
self.meta = meta
self.new_pages = {} # page_id -> new bytes, for pages this txn rewrote
self.allocated = [] # freshly-allocated page ids
def rewrite(self, old_page_id, new_bytes):
# Always allocate a new page; never touch the old one.
new_page_id = self._alloc_page()
self.new_pages[new_page_id] = new_bytes
return new_page_id
def _alloc_page(self):
# In a real implementation, pull from the free list; here just grow the file.
file_size = os.path.getsize(self.file.name)
new_id = file_size // PAGE_SIZE
self.allocated.append(new_id)
return new_id
def commit(self, new_root_id):
# 1. Write every new page to its new offset.
for pid, data in self.new_pages.items():
self.file.seek(pid * PAGE_SIZE)
self.file.write(data)
# 2. fsync the data pages. They are all at fresh offsets, so no torn-write
# can corrupt live data. The worst case is orphaned new pages.
os.fsync(self.file.fileno())
# 3. Pick the other meta-slot and write the new meta there.
new_slot = META_SLOTS[1] if self.meta.slot == META_SLOTS[0] else META_SLOTS[0]
new_meta = MetaPage()
new_meta.txn_id = self.meta.txn_id + 1
new_meta.root_page = new_root_id
new_meta.free_list_head = self._updated_free_list() # old pages go on the list
new_meta.checksum = _compute_checksum(new_meta)
_write_meta(self.file, new_slot, new_meta)
# 4. fsync. THIS fsync is the commit. Returning from it means the
# new tree is durable and visible to every future reader.
os.fsync(self.file.fileno())
Read that commit function slowly. There is no WAL. There is no redo log. There is no rollback log. There is no recovery procedure. There are two fsyncs — one for data pages, one for the meta — and the second is the entire commit mechanism.
Why does the first fsync (of the data pages) matter? Because if we wrote the new meta-page pointing at a new root whose data pages had not yet reached the disk, a crash between the two fsyncs would leave us with a meta-page pointing at pages that — from the physical disk's perspective — do not yet exist, or contain garbage. The first fsync guarantees the new pages are on disk before the meta-page that references them. Order of durability matters: data first, then meta. This ordering rule is the same one filesystems use for journaled writes and the same one every CoW engine enforces.
Why is writing to a fresh offset safer than writing in place, when both are the same write() syscall and the same fsync? Because a torn write to a fresh page corrupts only that new page, which the database does not yet consider live. Until the meta-page flip, nothing points at the new page; corruption of it is just wasted allocation. A torn write to a page the meta already points at (in-place) corrupts a page the database believes is authoritative — and any reader that touches it sees the corruption. CoW separates "write the new data" from "make the new data authoritative"; in-place collapses them into one step.
LMDB — the canonical single-file CoW B+ tree
If you want to see CoW done with real engineering in about four thousand lines of code, read LMDB — the Lightning Memory-Mapped Database, written by Howard Chu at Symas Corporation starting in 2011 and used today by OpenLDAP, Cloudflare's load balancer, Monero, Bitcoin Core (until 2021), and a long tail of embedded applications that need a transactional key-value store inside a single file.
LMDB takes the CoW picture and makes four design choices that are worth knowing about:
- Memory-mapped. The entire database file is
mmaped into the process's address space. Reads are pointer dereferences; there is no buffer pool because the kernel's page cache is the buffer pool. Writes use plainpwriteto punch fresh pages into the file, thenmsync(orfsync) at commit. The RAM-to-disk interface is entirely the kernel's problem. - Single writer, many readers. Exactly one write transaction can be active at a time. This is not a bug; it is the feature that makes MVCC snapshots essentially free. Readers take an O(1) snapshot by recording the current meta-page's transaction id; the writer's new pages are invisible to them because the readers are following pointers from a meta they captured before the writer began. Multiple writers would require complicated locking; LMDB gives up throughput on the write path to get simplicity and concurrent-read scalability that scales linearly with cores.
- Free-list as a B+ tree. The list of pages freed by old transactions is itself stored as a B+ tree inside the same file, keyed by the transaction id that freed them. When a new writer starts, it reads the free-list B+ tree to find reusable pages; when a writer commits, it appends its own freed pages to the free-list as part of the transaction. This means page allocation is transactional and crash-safe just like everything else.
- Two meta-pages, alternating slots, CRC-checked. Exactly the protocol described above. Pages 0 and 1 of the file hold the two meta-pages; the one with the higher valid CRC'd transaction id is the current one. Recovery is: open the file, read both meta-pages, pick the good one with the higher transaction id, done.
The result is a database that opens in microseconds (no log replay), survives any crash (the old meta-page is always valid), provides serializable isolation for the writer and repeatable-read snapshots for every reader (MVCC by construction), and fits in a single file you can cp to back up (the file is internally consistent at any instant).
The cost, as always, is written in the description. Every update rewrites every page on the path. A one-key change to a billion-record B+ tree at depth 4 writes five pages (leaf, three internal, root) plus the meta-page — about 24 KiB for a 24-byte change. This is called write amplification, and LMDB's is the highest of any engine in common use. On workloads where that amplification is not the bottleneck — read-heavy, low-write-throughput, needs simple crash safety — LMDB is unbeatable. On write-heavy workloads, the in-place+WAL engines win by 5-10× on sustained throughput.
Trace an update under each scheme
Start with the tree from chapter 23, of depth 4 on a billion-row table. The current root is at page 100; the relevant internal ancestors are pages 250 and 3,500; the target leaf is page 4,827. The database file has grown to 8,000 pages. Meta-pages are at 0 and 1; the current meta (slot 0) has txn_id = 117 and root_page = 100.
You call update(key=42, value="hello"). Both engines find the same leaf to modify: page 4,827.
In-place trace (simplified — no WAL here, which is unsafe):
t=0 seek to offset 4827 * 4096 = 19,771,392
t=0+ write 4096 bytes over the old contents of page 4,827
t=1ms fsync — kernel flushes page cache to disk controller cache
t=2ms FUA flag forces controller to write to NAND; fsync returns
total I/O: 1 page written
space used: 0 extra pages
recovery: impossible without a WAL; the old page is gone
In-place trace (with WAL, the real world):
t=0 append WAL record "txn 118: page 4,827 from OLD to NEW" (say 200 bytes)
t=0+ fsync the WAL — this is the commit
t=1ms at some later time (checkpoint), write page 4,827 over its old contents
t=? on crash: scan WAL forward, reapply any committed records to pages
total I/O: 1 WAL fsync + 1 page write (eventually)
space used: some WAL segments, rotated on checkpoint
recovery: O(time since last checkpoint) — replay the WAL tail
CoW trace (LMDB-style):
t=0 allocate page 8,001 (fresh), write the new leaf bytes there
t=0+ allocate page 8,002, write internal node (copy of page 3,500 but
with the pointer to 4,827 replaced by 8,001)
t=0+ allocate page 8,003, write internal node (copy of 250 with pointer
to 3,500 replaced by 8,002)
t=0+ allocate page 8,004, write root (copy of 100 with pointer to
250 replaced by 8,003)
t=1ms fsync — all four new pages are durable
t=1ms+ write the OTHER meta-page (slot 1): txn_id=118, root_page=8,004,
free_list includes pages {4,827; 3,500; 250; 100}
t=2ms fsync — THIS is the commit; now slot 1 has the higher txn id
total I/O: 5 pages written (4 data + 1 meta)
space used: 4 pages freed, 4 pages allocated; net zero at next GC
recovery: instantaneous; on reboot, meta with highest valid CRC wins
Five pages under CoW versus one under pure in-place (or one-plus-some-WAL under in-place+WAL). Five times the I/O for one key change. That is the number you are paying for the zero-recovery property.
Now run the trace again but this time assume the power cuts at t=1.5ms — halfway through the meta-write in the CoW case, halfway through the page-write in the in-place case.
In-place (no WAL), mid-write crash. The kernel had written 2,048 of 4,096 bytes to page 4,827 when the rails dropped. The disk controller had acknowledged some but not all of them. On reboot, page 4,827 is a mixture of old and new bytes. The page's checksum (if there is one) fails. Its B+ tree invariants (if checked) fail. Every lookup that descends to page 4,827 returns corrupt data or a checksum error. The database is corrupt and unrecoverable without a backup.
In-place + WAL, mid-write crash. Same torn write on page 4,827. But the WAL record was fsynced at t=0+. On reboot, the WAL scan finds an unfinished redo record for page 4,827, reads the intended new bytes from the log, writes them over the torn page, fsyncs. The page is now consistent. Recovery time: milliseconds. Database intact.
CoW, mid-write crash. The new pages at 8,001 through 8,004 had been fsynced at t=1ms. The meta-page write at slot 1 had started but not completed. On reboot, both meta-pages are read: slot 0 has txn_id=117, checksum valid; slot 1 has txn_id=118, checksum INVALID (because the write was torn). The database picks slot 0. The root is page 100. Pages 8,001–8,004 are orphans; they will be discovered and reclaimed on the next full GC scan. The database is in the state it was at txn_id=117, exactly as if the update never happened. Recovery time: zero.
Crash analysis — the whole picture
Gather the three crash outcomes into a table. This is the clearest way to see why CoW is worth the write amplification for durability-paranoid workloads.
| Scheme | Crash during data write | Crash during commit step | Recovery cost on reboot |
|---|---|---|---|
| In-place only | Torn data page, permanent corruption | Same | Impossible without backup |
| In-place + WAL | Torn data page, restored from log | Log fsync either landed (commit) or not (no commit) | Scan + replay log tail (ms to s) |
| Copy-on-write | Orphaned new pages, old tree intact | Meta fsync either landed (new tree) or not (old tree) | Zero — pick the valid meta |
Why does CoW have zero recovery cost? Because the only atomic step that distinguishes "committed" from "not committed" is a single-page write to a fixed offset — the meta-page. A single aligned 4 KiB write to a checksummed page is either (a) entirely the new contents, (b) entirely the old contents because the disk ignored the write, or (c) torn and detected by checksum mismatch. Case (a) means commit succeeded; cases (b) and (c) mean commit failed and the old state is still valid. There is no "partially committed" state to roll back, so there is nothing to replay. Contrast with in-place: a mid-page torn write can leave a page in a state that is neither old nor new and whose recovery requires knowing what the full old bytes or full new bytes were — which is exactly what the WAL stores.
Tradeoffs — when each wins
The fair summary is that neither scheme dominates. Each wins on a different class of workload.
CoW wins when:
- The workload is read-heavy. Readers take free snapshots by capturing the meta-page's transaction id; they see a consistent view without taking any lock, and readers never block writers.
- You want zero-recovery-time crash safety without building a WAL subsystem. If your engine is embedded, you distribute the database file as a single artefact, or your operators insist on "just copy the file" backups, the file-level consistency of CoW is a huge operational win.
- You need MVCC for cheap. Every CoW commit naturally keeps the previous version around (until GC); readers pinned to older meta-pages can read old data for free.
- Write volume is modest. A desktop app's address book, an embedded configuration store, a caching index at the edge — hundreds to thousands of writes per second are well below CoW's ceiling.
In-place + WAL wins when:
- The workload is write-heavy and sustained. A 5-10× write amplification penalty, multiplied by hundreds of thousands of writes per second, is the difference between "database is fine" and "storage is saturated".
- Transactions touch many pages. CoW's cost scales with the number of distinct pages modified; in-place's cost scales with the number of distinct page changes, which is usually smaller because the WAL record is tiny.
- You need fine-grained locking and concurrent writers. In-place engines use page-level or row-level latches; multiple writers can touch disjoint pages at once. CoW's single-writer restriction is a hard ceiling.
- You need group commit. Batching many transactions into one
fsyncof the WAL is straightforward; batching many transactions into one CoW meta-flip requires that they serialize on the writer, defeating the point.
Empirically: OLTP workloads with thousands of concurrent writers (Postgres, MySQL, Oracle) always use in-place + WAL. Embedded key-value stores, MVCC-first engines, and anything that cares more about operational simplicity than raw write throughput (LMDB, BoltDB, Berkeley DB in some modes) use CoW. LSM-trees (Build 3) are a third family that is CoW-like in spirit (SSTables are immutable; compaction writes fresh files) but organised for sequential bulk writes rather than random updates.
Common confusions
-
"Is CoW the same as MVCC?" Related but not identical. MVCC (multi-version concurrency control) is a general scheme where reads see a snapshot at some point in time. You can implement MVCC on top of in-place storage by keeping old versions in a separate undo log (Postgres, Oracle). You can implement MVCC on top of CoW by keeping old versions as the old pages that CoW already did not overwrite (LMDB). CoW makes MVCC cheap and natural; in-place makes it possible but requires explicit versioning machinery.
-
"Does CoW mean no WAL ever?" For the tree itself, yes — LMDB has no WAL. For transactions that want to batch multiple logical operations into one commit, some CoW engines still keep a short in-memory redo buffer that gets written as part of the meta-flip, but it is not the ARIES-style persistent log of in-place engines. The term "copy-on-write" here is specifically about how pages are mutated on disk, not about how logical transactions are grouped.
-
"Doesn't CoW fragment the file horribly?" Over time, without GC, yes — every update allocates a fresh page and leaves the old page on disk. Production CoW engines run a free-list compactor that identifies pages freed by transactions older than any live reader, and makes them available for reuse. LMDB's free-list lives in its own B+ tree and is updated transactionally with every write. Sequential allocation from the free-list also tends to keep related pages close together on disk, which helps scan performance.
-
"What is the difference between CoW and 'shadow paging'?" Shadow paging is the 1970s ancestor of CoW, described in Jim Gray's early database papers. The mechanism — write modified pages to new locations, flip a root pointer — is identical. Modern CoW engines add refinements (free-list management, MVCC integration,
mmap) but the core idea is the same one Gray sketched forty years ago. -
"I've seen 'copy-on-write' in filesystems (Btrfs, ZFS) and in fork(). Are those the same thing?" Same mechanism, different unit. A CoW filesystem (Btrfs, ZFS, APFS) applies the meta-flip idea to the whole filesystem's block map: a snapshot is a captured root pointer, a write allocates a new block, a transaction is a tree of new blocks made live by a single superblock flip.
fork()'s copy-on-write applies it to memory pages: until a page is written, the parent and child share it; on first write, the kernel allocates a new page and rewires one side's page table. All three exploit the insight that sharing is cheap and divergence only needs to happen on write. -
"Why does LMDB only allow one writer at a time?" Because two concurrent writers would both need to update the free-list B+ tree and the meta-page, and resolving the conflict requires either a lock (which defeats the point of CoW's simplicity) or a complicated merge protocol. Howard Chu made the deliberate choice: single writer, and use that simplicity to make the engine small, correct, and fast on the read path. If you need many writers, you wrap LMDB in a queue or use a different engine — simple, explicit, one-problem-per-tool.
-
"Does fsync really flush the meta-page atomically?"
fsyncwaits for the kernel to send the write to the device and (on modern Linux) for the device to acknowledge it into non-volatile storage (see chapter 3). Whether the device itself writes a 4 KiB page atomically depends on the device: SSDs with power-loss protection do; cheap consumer SSDs sometimes don't, which is why every serious CoW engine puts a CRC on the meta-page and checks it on read. If the CRC fails, the meta-page write was torn; the engine falls back to the other meta-slot. Torn meta-page writes are the only failure mode CoW needs a defence for, and a 4-byte checksum is sufficient.
Going deeper
If the meta-flip picture is clear, the below are the refinements that real CoW engines add on top.
LMDB's single-writer design and how it works in practice
Howard Chu's 2011 paper on LMDB spends most of its space on two ideas: the single-writer choice, and the memory-mapped read path. The combination is what makes LMDB's read throughput scale to hundreds of thousands of operations per second per core.
A read transaction in LMDB acquires a slot in a shared reader table — a small shared-memory array where each concurrent reader records its transaction id. Readers do not take any lock on the tree. A writer cannot reuse a page that is still referenced by any live reader's snapshot (the reader table tells it which transaction ids are still live). When a reader finishes, it clears its slot; the writer can then reclaim pages freed after that transaction.
This works because the writer is the only thread that modifies the free-list, so the reader-visibility check is a single-threaded operation against a shared-memory table — no coordination between readers is needed, no locks are acquired except a short-lived one on the writer. The read path is wait-free and lock-free in the common case.
The cost of "only one writer" is that write throughput is capped at whatever one thread can achieve against the storage device — typically tens of thousands of transactions per second on fast NVMe. For read-heavy workloads this is more than enough. For write-heavy workloads, LMDB is not the right tool; Postgres or RocksDB are.
Shadow paging — the historical ancestor
CoW is the modern name for what Jim Gray and others called shadow paging in the 1970s. Gray's 1978 paper "Notes on Database Operating Systems" describes a system where every page in the file has two slots — the current slot and the shadow slot — and a page table maps logical page numbers to one or the other. A write copies the page to the shadow slot, updates the page table's shadow copy, and at commit atomically swaps the page table pointer.
The mechanism is identical to LMDB's meta-flip, but the implementation is heavier: an extra indirection table, double the space for every page, and more complex free-space management. Shadow paging fell out of favour in the 1980s as WAL-based in-place engines (System R, then commercial Oracle and DB2) demonstrated higher throughput on the workloads IBM and Oracle cared about.
It came back in the 2000s under the "copy-on-write" name when workloads shifted: embedded databases needed crash safety without DBA-managed logs, filesystems needed snapshots, and virtual machines needed fork-like cloning. What was a losing tradeoff for 1985 mainframes turned out to be a winning tradeoff for 2010 flash-based single-node key-value stores.
Btrfs and ZFS — CoW at the filesystem layer
Both Btrfs (Linux) and ZFS (Solaris, now OpenZFS) implement CoW at the filesystem level. Every block write allocates a new block; the pointer to it propagates up the filesystem's tree of block groups, directories, and the superblock; the superblock flip is the commit.
This gives filesystems three things that traditional in-place filesystems (ext4, XFS, NTFS) need a journal to approximate:
- Instant snapshots. Capture the current superblock pointer; you have a snapshot. New writes go to fresh blocks; the snapshot's blocks are never freed while the snapshot exists. A
zfs snapshotis an O(1) operation. - Integrity. Every block has a checksum stored in its parent; a block read that returns wrong bytes is detected and (on a mirrored or RAID-Z pool) repaired automatically. This is the feature that made ZFS famous in the Sun era.
- Clone-on-write. A file copy is just a pointer copy; the data blocks are shared until one side writes.
zfs cloneis O(1).
The cost — write amplification and no in-place rewrites — is the same as LMDB's. On heavy random write workloads (databases on ZFS, VM disks on Btrfs), performance suffers relative to ext4+WAL, and operators often either tune the filesystem heavily or choose a different one.
Postgres HOT updates — a middle ground
Postgres uses in-place storage (heap pages) with a WAL, but it has a clever optimisation called heap-only tuple (HOT) updates. When a row is updated in a way that doesn't touch any indexed columns and the new version fits on the same page as the old, Postgres writes the new version in the same page with a pointer from the old tuple slot, without updating any secondary indexes. The old tuple becomes invisible after the transaction commits; VACUUM reclaims it later.
This is CoW-at-tuple-granularity inside an otherwise in-place engine. The new tuple is a copy; the old tuple survives as long as any snapshot might need it (MVCC). The commit mechanism is still the WAL, so HOT doesn't change Postgres's crash story — but it does reduce write amplification and index-update traffic for common update patterns (UPDATE ... SET last_seen_at = NOW()).
HOT is the kind of hybrid you get when a mature engine has room to adopt good ideas from the other camp without rewriting from scratch. The lesson: "CoW vs in-place" is the textbook dichotomy, but real engines mix and match at different granularities.
Where this leads next
A page, in both schemes, is still just 4 KiB of bytes. So far we have treated it as a fixed-layout structure — 16-byte header, then keys, then pointers, fixed widths throughout. But real databases store variable-length keys (strings, blobs) and variable-length values (rows of varying column lengths). The next chapter introduces the slotted page — a layout that packs variable-length records into a page by growing a slot array from one end and data from the other, meeting in the middle. It is the standard layout used by Postgres, SQLite, InnoDB, and every other page-oriented engine.
- Slotted pages and variable-length records — chapter 26: how one 4 KiB page holds a heterogeneous set of records without wasting space on padding, and how a row-id like
(page_id, slot)stays stable across compactions.
After that, Build 4 picks up the remaining machinery: free-space maps, latching for concurrency, and the bridge to Build 5's write-ahead log. The storage engine is becoming a real storage engine, one page-layout and one lock at a time.
Choose your mutation strategy — in-place with WAL, or CoW with meta-flip — and everything else in the storage stack hangs off that choice.
References
- Howard Chu, LMDB: Lightning Memory-Mapped Database, LinuxCon Europe (2012) — the original design talk. Short, punchy, the best introduction by the author.
- Jim Gray, Notes on Database Operating Systems, in "Operating Systems: An Advanced Course" (Springer LNCS, 1978) — the paper that described shadow paging before anyone called it CoW.
- Ohad Rodeh, B-trees, Shadowing, and Clones, ACM TOS (2008) — the design paper for the Btrfs B-tree, a CoW B+ tree on which the filesystem is built. Bridges the database and filesystem takes on the idea.
- Mohan, Haderle, Lindsay, Pirahesh, Schwarz, ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS (1992) — the foundational in-place + WAL recovery paper. Build 5 is this paper.
- Jeff Bonwick, Matt Ahrens, The Zettabyte File System, technical report (2003) — the ZFS design paper; CoW applied to a filesystem, with integrity checksums and snapshots.
- LMDB source,
libraries/liblmdb/mdb.cin the OpenLDAP repository — the canonical CoW B+ tree, about 10,000 lines of C, readable over a weekend.