In short
When your code says "read one byte at offset 1,000,000," the disk does not read one byte. The operating system, and every layer beneath it, speaks in pages — fixed-size chunks, typically 4,096 bytes on modern Linux, 4 KiB to 16 KiB on the databases you have heard of. A one-byte read pulls an entire page into the kernel's page cache, pays the full seek and transfer cost of that page, and throws the other 4,095 bytes away unless you ask for them. The corollary is simple and brutal: every data structure that lives on disk is either page-oriented or slow. The append-only log of Build 1 got away with byte-level thinking because its access pattern — big sequential reads — made the page mismatch invisible. The B+ tree cannot get away with it: every lookup is a small random read, and "small random read" means "one page", period. This chapter walks the page boundary from both sides: what the hardware actually transfers on a one-byte read (os.pread(fd, 1, off) still fetches 4,096 bytes of silicon), what your tools are (pread/pwrite, mmap, O_DIRECT), and what changes the moment you start designing data structures that fit a page rather than spill across it. By the end you will see a B+ tree node as what it is — a 4 KiB block, sized to the hardware, with everything it needs packed inside — and you will be ready to build one.
You finished Build 3 with a working LSM-tree: memtables, SSTables, Bloom filters, compaction, the works. It handles a million writes per second and serves most reads from RAM. It is an honest database. It has one problem you cannot compact your way out of: the read path is not bounded. A key that misses every Bloom filter and lives in the deepest level still costs you one seek per SSTable in the worst case. Most of the time this is fine. For the workloads where it is not, you need a data structure where the worst-case read is written on the box — four hops, no more, no less.
That data structure is the B+ tree. And the B+ tree does not live in the abstract; it lives on the disk, and the disk has opinions about how it is organised. Before you can build a B+ tree you have to learn to think like the disk thinks. The disk thinks in pages.
Why disks and operating systems talk in pages
Start with the hardware. A NAND flash chip — the thing inside your SSD — is organised in rows of cells. The smallest unit you can read is called a page on the device: typically 4 KiB or 16 KiB on a modern consumer SSD, up to 32 KiB on enterprise drives. The smallest unit you can write is also a page. The smallest unit you can erase is a much larger block — typically 256 KiB to 4 MiB. There is no hardware operation that reads one byte, or writes one byte, or erases one byte. The chip's command set does not have such a verb.
A spinning disk works differently but arrives at the same answer. The drive has sectors, historically 512 bytes, increasingly 4,096 bytes ("4Kn" or "Advanced Format"). The head positions over a track, waits for the platter to rotate the right sector under it, and streams 512 or 4,096 bytes in one go. Asking for one byte means seeking, waiting, transferring a whole sector, and the drive hands you 4 KiB minus the one you wanted.
Between your program and the silicon sits the operating system, and the operating system standardises. On Linux (and macOS, and Windows, and every Unix), the kernel's page cache is organised in fixed-size pages matched to the CPU's memory-management unit — 4 KiB on x86_64 and aarch64 by default. When the kernel services a read syscall, it first checks whether the relevant pages are in the cache; if not, it issues a block-device read of a whole page (or several pages — the kernel also prefetches ahead when it detects a sequential pattern). The cache stores pages; the MMU maps pages; the filesystem lays files out in units of filesystem blocks (typically 4 KiB matching the page); every layer agrees.
Why pages, and why 4 KiB specifically: a small page size means more pages to track, more metadata per gigabyte of RAM, more TLB pressure on the CPU. A large page size means more waste on small accesses. 4 KiB is the compromise that x86 chose in 1985 and has propagated to every general-purpose system since. Databases that know their workload better sometimes pick larger pages — Postgres uses 8 KiB, InnoDB uses 16 KiB, Oracle offers 2 KiB to 32 KiB — because their pages are bigger than the hardware pages, not smaller. No serious storage engine uses pages smaller than 4 KiB.
The takeaway is not "4,096 is a magic number." The takeaway is that there is a unit, and you cannot read less than it. Once you believe that, every other choice in this build follows.
Reading one byte is reading a page
Let us prove the claim with the smallest piece of code that can. Python's os.pread(fd, n, offset) reads n bytes from position offset without changing the file pointer. It is the cleanest way to see what a syscall-level read actually does.
import os
fd = os.open("big.bin", os.O_RDONLY)
byte = os.pread(fd, 1, 1_000_000) # one byte at offset 1 MB
os.close(fd)
print(byte) # b'\x17' (or whatever)
On the surface this reads one byte. Measured at the syscall, the kernel handled the request by checking whether the page containing offset 1,000,000 was in the page cache. If not — a cold cache — it issued a block-layer read for that 4 KiB page (and possibly a prefetch for the next few). The whole page now lives in RAM, indexed by (inode, page_number). Your program sees one byte. The page cache holds all 4,096.
Now read the byte beside it.
fd = os.open("big.bin", os.O_RDONLY)
a = os.pread(fd, 1, 1_000_000) # cold: disk read, 4,096 bytes in
b = os.pread(fd, 1, 1_000_001) # warm: served from page cache
c = os.pread(fd, 1, 1_004_000) # warm: same page, bytes 999_424..1_003_519
d = os.pread(fd, 1, 1_005_000) # cold again: next page
os.close(fd)
The first pread pays the disk cost. The second and third are microsecond page-cache hits: the data is already in RAM. The fourth crosses into the next page and pays the disk cost again.
You cannot observe this from inside Python except by timing. But the effect is dramatic and repeatable. On a warm NVMe, a page-cache read takes about 1 microsecond; a cold disk read takes 50 to 200. The ratio is one hundred. If you organise your data so that successive reads stay inside the same page — spatial locality — you get 100× more throughput than if you scatter them across pages.
Measure the page boundary yourself
# bench_pageboundary.py — observe that reads inside the same page are cheap.
import os, time
# First make a 64 MiB file of random-ish bytes.
path = "/tmp/big.bin"
with open(path, "wb") as f:
f.write(os.urandom(64 * 1024 * 1024))
# Drop the page cache (Linux: echo 3 > /proc/sys/vm/drop_caches, needs root).
# In an unprivileged environment, open with O_DIRECT for a fresh read each time.
fd = os.open(path, os.O_RDONLY)
def time_reads(label, offsets, n=50_000):
t0 = time.perf_counter()
for _ in range(n):
for off in offsets:
os.pread(fd, 1, off)
dt = time.perf_counter() - t0
print(f"{label:<28} {(len(offsets)*n)/dt:>10,.0f} reads/s")
# 8 reads that all live in one 4 KiB page — should be fast.
time_reads("same page", [1_000_000 + i for i in range(0, 4096, 512)])
# 8 reads spread across 8 different pages — should be slower once cache is warm
# and dramatically slower when cold.
time_reads("different pages", [1_000_000 + i * 4096 for i in range(8)])
os.close(fd)
Typical output, cache fully warm:
same page 19,400,000 reads/s
different pages 4,100,000 reads/s
Same page: about 50 nanoseconds per read — essentially a memcpy from the page cache. Different pages: about 240 nanoseconds per read — still RAM, but each read touches a different cache line, different TLB entry, different page-cache bucket. And if you drop the OS page cache between runs, the "different pages" number falls by another factor of 100, because now you are paying real disk latency eight times per iteration instead of once.
The lesson you want to internalise. The cost of a read is not proportional to the bytes you asked for. It is proportional to the number of distinct pages your reads touch. Design your data structures to pack hot bytes into the same page, and you win.
The append-only log from Build 1 did not care about any of this. A scan reads the file from start to end in sequential pages; the kernel prefetches aggressively; every page is used fully. The mismatch between byte semantics and page reality was invisible because the access pattern happened to align. B+ trees cannot rely on that alignment. A point lookup of a random key touches log(N) pages, each one separately, with no locality between them — the worst possible pattern for page caching. So the B+ tree has to design around the page: every node is a page, every edge is a page-to-page pointer, every key-value lives inside the page that owns it.
pread/pwrite, mmap, and the choices between them
You have two serious ways to move page-sized chunks of a file between disk and your program: explicit syscalls (pread, pwrite) and memory mapping (mmap). They give you the same data in the end. They have wildly different ergonomics, performance profiles, and failure modes.
pread(fd, n, offset) reads n bytes into a buffer you own, starting at offset in the file. No internal file pointer is updated. No seek/read race — the offset is part of the syscall. pwrite(fd, buf, n, offset) is the writing counterpart. These are the stock tools of database code. Python exposes them as os.pread and os.pwrite.
import os
fd = os.open("store.db", os.O_RDWR | os.O_CREAT, 0o644)
# Write a 4 KiB page at page index 12.
page = b"X" * 4096
os.pwrite(fd, page, 12 * 4096)
# Read it back.
data = os.pread(fd, 4096, 12 * 4096)
assert data == page
os.close(fd)
Two things to notice. First, the offset is page_index * PAGE_SIZE. A database built on pread/pwrite almost always treats the file as an array of pages, identifies each page by its index (the "page number" or "page id"), and computes byte offsets only at the syscall boundary. Second, pread and pwrite are safe to call from multiple threads on the same fd, because they do not touch the shared file-position pointer. The older read/lseek/write trio is not, and race bugs on a shared fd are one of the classic mistakes of storage-engine code.
mmap(fd, length) maps the file into your process's address space. The kernel sets up a range of virtual memory; when you touch a byte at address base + offset, the MMU checks whether the page is resident, and if not, triggers a page fault that the kernel services by reading from disk. You get the file as if it were a bytearray.
import os, mmap
fd = os.open("store.db", os.O_RDWR)
size = os.fstat(fd).st_size
m = mmap.mmap(fd, size, prot=mmap.PROT_READ | mmap.PROT_WRITE)
# Read a 4 KiB page — no syscall, just a memory access (maybe a page fault).
page = bytes(m[12 * 4096 : 13 * 4096])
# Write the page back.
m[12 * 4096 : 13 * 4096] = b"Y" * 4096
m.flush() # msync — push dirty pages to disk
m.close()
os.close(fd)
mmap is seductive. Reads become slice operations; writes become assignments; no syscalls per access once the pages are resident. The kernel's page cache is literally visible as memory. A whole school of databases (LMDB is the most famous) is built around mmap for exactly this reason.
It also has sharp edges. You cannot map a file larger than your address space (no problem on 64-bit; a hard limit on 32-bit). A read that touches an unresident page is a blocking syscall disguised as a memory access — your thread stalls on disk I/O with no warning in the code. A write to a read-only page is a segfault, not an exception. And handling I/O errors on mmap is a nightmare: a failed disk read on a page fault delivers SIGBUS, which your Python code cannot reasonably catch and recover from. The mmap-based code assumes the disk never errors, and when it does, the process dies.
The pread/pwrite-based code assumes nothing and deals with every error explicitly.
Here is the honest comparison.
| property | pread / pwrite | mmap |
|---|---|---|
| API shape | explicit syscalls with buffers | memory access through a pointer |
| cost per access (warm) | ~500 ns syscall overhead + copy | ~5 ns — just a memory access |
| cost per access (cold) | real disk I/O, visible in code | real disk I/O, hidden as a page fault |
| data flows through the page cache | yes (unless `O_DIRECT`) | yes — page cache is the map |
| extra copy user ↔ kernel | yes (one memcpy per call) | no — the map is the kernel's page cache |
| thread-safe on shared fd | yes (offset is an argument) | yes (address is independent) |
| I/O error handling | syscall returns -1 / `OSError` | `SIGBUS` — process dies |
| max file size | unbounded | process virtual address space (~128 TiB on Linux x86_64) |
| alignment required | no (unless `O_DIRECT`) | no |
| control over when dirty pages flush | full — you call `fsync` when ready | partial — `msync` plus kernel's own schedule |
| durability primitive | `fsync(fd)` | `msync(addr, len, MS_SYNC)` |
| used by | Postgres, MySQL InnoDB, SQLite, RocksDB | LMDB, Boltdb, older MongoDB WiredTiger |
Most modern production databases use pread/pwrite. They pay the syscall overhead for the error-handling clarity, the explicit control over caching, and the ability to use O_DIRECT to run their own buffer pool. The mmap camp exists — LMDB is an exceptionally good database — but it runs in a different design space where crash-on-I/O-error is considered acceptable because the deployment is simple and the hardware is trusted.
For Build 4, you are going to use pread and pwrite. They match how real B+ trees are built and they leave you in control of the buffer pool you will design two chapters from now.
Alignment and O_DIRECT — what happens when you bypass the kernel
There is a third option, and it is the one professional databases reach for when they want every last microsecond of control: O_DIRECT. You met it briefly in chapter 3 on fsync. It is the open-time flag that says skip the kernel page cache entirely — read directly from the disk controller into my buffer, and write directly from my buffer to the disk controller.
import os, mmap
# O_DIRECT requires an aligned buffer and aligned offsets and aligned lengths.
PAGE = 4096
buf = mmap.mmap(-1, PAGE) # anonymous page-aligned buffer
buf.write(b"hello" + b"\x00" * (PAGE - 5))
fd = os.open("direct.db", os.O_WRONLY | os.O_CREAT | os.O_DIRECT, 0o644)
os.pwrite(fd, buf, 0) # offset 0, length 4096, buffer 4 KiB-aligned — ok
os.fsync(fd) # still needed — O_DIRECT is not durable
os.close(fd)
The alignment rules on Linux are strict:
- The buffer address in memory must be aligned to the filesystem's logical block size (almost always 512 bytes, often 4,096).
- The offset in the file must be aligned to the same boundary.
- The length must be a multiple of the same boundary.
Violate any of these and the kernel returns EINVAL — OSError: [Errno 22] Invalid argument. There is no "it works but is slow" mode; you are either aligned or you are rejected.
Why alignment is mandatory: with O_DIRECT, the DMA controller on the disk talks directly to the physical RAM pages that back your buffer. DMA engines cannot start mid-sector on a physical read, cannot handle buffers that straddle unaligned boundaries, and cannot transfer partial sectors. Without O_DIRECT, the kernel absorbs these restrictions by copying through an aligned page in the page cache on your behalf. O_DIRECT is the kernel saying you wanted to skip the copy; now you carry the restriction.
O_DIRECT is not a durability primitive. Data written with O_DIRECT still sits in the disk controller's DRAM cache until you fsync; the only thing you skipped was the kernel's copy. What O_DIRECT buys you is cache control: the kernel page cache no longer holds a duplicate of your hot pages. For a database that runs its own buffer pool (chapter 24 — buffer pool design), duplicating a 16 GB buffer pool in the kernel page cache as well would waste 16 GB of RAM. O_DIRECT reclaims that RAM for the application.
Trade-offs:
- Pro: no page-cache duplication; deterministic I/O latencies; your buffer pool is the only cache.
- Pro: works well with
fadviseandio_uringfor asynchronous I/O patterns. - Con: you lose the kernel's read-ahead prefetch and write-coalescing.
- Con: every read is at least one round-trip to the SSD controller — no hot-path memory access.
- Con: every buffer you use has to be page-aligned, which is a constraint on your memory allocator.
- Con: if you get it wrong, you get
EINVAL, no warning.
Build 4 will start with the page cache (no O_DIRECT) because it is simpler and the performance is already good. Build 5 will revisit O_DIRECT when we introduce an async buffer pool with its own prefetcher.
Designing around pages — the B+ tree node preview
Now the payoff. Every data structure in Build 4 is going to treat the page as its fundamental unit, not the byte. Let us sketch what that means for the central data structure, so the rest of the build has a concrete image to aim at.
A B+ tree node sized to fit exactly one page
A B+ tree internal node is a list of (key, child_pointer) pairs plus a rightmost child pointer. If you fit it into a 4,096-byte page, the shape is:
+------------------------------------------------------------+
| header (16 bytes) |
| - magic (2 B) node type + version |
| - n_keys (2 B) number of keys in this node |
| - is_leaf (1 B) 0 for internal, 1 for leaf |
| - padding (3 B) |
| - right_sibling (8 B) page-id of next leaf (leaf only) |
+------------------------------------------------------------+
| keys[0..n-1] fixed 8 B each — up to ~340 keys |
+------------------------------------------------------------+
| child_ptrs[0..n] fixed 8 B each — page-ids |
+------------------------------------------------------------+
| free space (shrinks as keys are inserted) |
+------------------------------------------------------------+
With 8-byte keys and 8-byte child pointers, the arithmetic is:
One page holds up to 254 keys and 255 child pointers. That number — the fanout — is the single most important parameter of a B+ tree. A fanout of 255 means a tree with 4 levels can index
records. Four billion. That is why B+ trees have the reputation they do: four disk reads, worst case, for any key in a four-billion-record table, with each read being exactly one page.
The moment you internalise that the node is a page, a lot of design decisions fall out:
- Keys are fixed-size when possible, because variable-size keys complicate the packing arithmetic and the binary search. Many engines compress variable keys (prefix compression) but keep the total page size fixed.
- Pointers are page-ids, not byte offsets, because pointers are smaller and faster to compare than offsets.
- Lookups inside a node are a binary search over the keys array — pure in-RAM work once the page is loaded. The disk cost of a lookup is only the page fetches along the root-to-leaf path.
- Modifications write a whole page at a time. You do not rewrite one byte of a node; you rebuild the 4,096 bytes,
pwritethem in one syscall, andfsyncwhen durability demands it. This is exactly aligned to how the hardware wants writes. - Splits happen when a page is full, not when a byte count is exceeded. Full is a property of the page, not of the data.
Every subsequent chapter in Build 4 — insertion and split, buffer pool, write-ahead log, concurrency — is working inside this constraint. The page is the atom of the design. Everything above the page is data structure; everything below it is the operating system's job.
You have just glimpsed the architecture of every relational database in serious production use. Postgres, MySQL, SQL Server, Oracle, SQLite — they all have this shape. They differ in their page size (Postgres 8 KiB, InnoDB 16 KiB, SQLite 4 KiB, Oracle configurable), their key encoding, their concurrency strategy, their locking — but the page-as-node-as-atom is common to all of them.
Common confusions
-
"Do my Python
preadcalls always touch the disk?" No — almost never, in fact. The kernel's page cache sits between you and the disk. If the page you are reading was read recently (by you or by another process, or prefetched by the kernel), it lives in RAM and your "disk read" is a memcpy. On a warm system doing random reads over a working set smaller than physical RAM, most pread calls hit zero disk I/O. The disk only enters the picture when the page is not already cached. This is why benchmarks that do not drop the cache between runs show absurdly high throughput and do not predict production behaviour. -
"Isn't
mmapalways faster thanpread?" It can be, once the cache is warm — no syscall per access. But the first access to a cold page is a page fault, which is roughly as expensive as apreadcall (the kernel does the same work either way), and you have no way to batch those faults. Andmmaphas no way to tell you "this read failed" — it raisesSIGBUSand kills your process. For correctness-critical systems, the extra syscall ofpreadis usually worth it. -
"Why 4 KiB and not 512 bytes like old disks used?" Because the kernel's page cache is organised in 4 KiB pages to match the MMU, and the kernel would never issue a sub-page read even if the disk could do it. The hardware caught up: modern SSDs and modern "4Kn" HDDs report their native block size as 4 KiB. Drives that still report 512 B logical blocks (many SATA drives) actually have 4 KiB physical blocks and silently do the translation, which is why writes that straddle 512 B but not 4 KiB boundaries can be slower than aligned 4 KiB writes.
-
"
O_DIRECTmakes my writes durable." No.O_DIRECTbypasses the kernel page cache; it does not bypass the disk controller cache. Bytes written withO_DIRECTstill live in the SSD's DRAM until the nextfsyncor FUA write. Covered in detail in chapter 3. -
"Larger pages are always better for databases." No — there is a sweet spot. Larger pages mean higher fanout and fewer levels (good), but also more wasted space on half-full pages, larger write-amplification on small updates (you rewrite the whole page for a one-row change), and more CPU work per page (binary searches over 16 KiB of keys cost more than over 4 KiB). 4–16 KiB is the sweet spot, and no serious engine goes above 64 KiB.
-
"The filesystem block size and the database page size must match." They should be compatible — the database page size should be a multiple of the filesystem block size — but they do not have to be equal. Postgres uses 8 KiB pages on a filesystem with 4 KiB blocks: each Postgres page is two filesystem blocks. What you do not want is a database page smaller than a filesystem block, because then a database write forces the filesystem to read-modify-write (read the full block, patch the sub-block, write the full block back).
Going deeper
If the picture is clear — pages are the atom of disk I/O, pread/pwrite/mmap are the three API styles, O_DIRECT bypasses the page cache at the cost of alignment — the rest of this section is about the hardware realities hiding beneath that picture.
Sectors vs pages vs blocks — three words for adjacent things
The vocabulary is confusing because three communities coined terms at different times. Here is the disambiguation.
- Sector. The hardware unit of a spinning disk, historically 512 bytes, now often 4,096 on "4Kn" drives. The disk exposes this as the logical block size in
blockdev --getss. Some drives with 4 KiB physical sectors still report 512 B logical sectors for compatibility ("512e" drives); they handle the translation in firmware. - Block. The filesystem's allocation unit. Typically 4,096 bytes on ext4, XFS, ZFS. Chosen to match the kernel page size. The number you get from
stat -forblockdev --getbsz. - Page. Two meanings, context-dependent. (1) The kernel's memory-management unit, typically 4 KiB on x86_64. (2) The database's logical I/O unit — Postgres page, InnoDB page, SQLite page. The two meanings usually match by design but are not required to.
- Extent. A run of contiguous blocks, variable-sized, allocated as a unit. ext4 and XFS allocate in extents internally; Postgres allocates tablespaces in 1 MiB "extents" of 128 pages.
A database page is typically one or more filesystem blocks, allocated contiguously so the kernel can issue one I/O for the whole page. All modern filesystems try hard to keep a database's pages contiguous; this is why databases that pre-allocate their files (fallocate) perform better than ones that grow incrementally.
Atomicity at the page level — how much can the disk actually write atomically?
When the power dies mid-write, what survives? The answer depends on the level of the stack.
- NVMe SSDs typically guarantee atomicity of a single 4 KiB write: either all 4,096 bytes are the old content or all 4,096 are the new content, never a mix. Some enterprise drives guarantee atomicity of larger units (16 KiB or 32 KiB) via specific vendor extensions, but the portable guarantee is 4 KiB.
- SATA SSDs usually guarantee 4 KiB atomicity in practice, but the SATA specification does not require it. Some cheap drives can tear a write at the 512 B boundary within a 4 KiB write.
- HDDs only guarantee 512 B (or 4 KiB on 4Kn) atomicity. A 4 KiB logical write to a 512 B-sector drive is eight separate sector writes internally, and a power cut can tear it.
This has real consequences for database design. Postgres's 8 KiB page is larger than the 4 KiB atomicity guarantee of the hardware, so Postgres cannot assume a page write is atomic. Instead, Postgres's write-ahead log contains a full-page image the first time each page is modified within a checkpoint, so recovery can patch up any torn page. The cost is massive extra WAL volume — sometimes the majority of WAL writes. InnoDB's double-write buffer solves the same problem by writing every page twice, first to a staging area, then to its real location. Either way, you are paying for atomicity the hardware does not give you.
Build 4 will use 4 KiB pages matching the hardware's guaranteed atomicity, which lets us skip full-page images — one fewer thing to engineer.
Write amplification at the page level
You change one byte in a database row. How many bytes are actually written to flash? This chain is brutal.
- Database level. You modify one byte in a row, so the B+ tree rewrites the 4 KiB page that holds the row. That is 4,096 bytes at the database layer. Amplification so far: 4,096×.
- WAL level. The WAL logs the change. If this is the first modification of the page within the checkpoint, it logs a full-page image (another 4 KiB) plus the delta (~100 bytes). Amplification: ~8,200×.
- Filesystem level. Modern filesystems with checksums (ZFS, btrfs, XFS with metadata CRC) update their own metadata on every write. Small additional amplification.
- SSD level. The SSD's garbage collector, when it consolidates the partially-valid block that holds your page, rewrites every other valid page in that block. If the block is 256 KiB and 90% valid, that is an additional 230 KiB written for nothing. This is the flash-translation-layer (FTL) write amplification and is the dominant source of wear on a consumer SSD.
End to end, a one-byte database change can easily produce 100 KiB of flash writes. Hot-spot rows on a badly-designed schema can wear out a consumer SSD in months. Industrial databases obsess over write amplification for this reason: every buffer-pool optimisation, every choice of flush strategy, is in part a choice about how much flash to burn per logical write.
The B+ tree you build in Build 4 will have a write amplification of roughly the page size divided by the average update size — worse than an LSM-tree for small updates, much better for reads. LSM-trees trade the write amplification of compaction for the read amplification of level merging; B+ trees trade the read amplification for higher write amplification. The decision which to build starts from which direction your workload skews.
Huge pages and the TLB — when the MMU itself becomes a bottleneck
The CPU's Translation Lookaside Buffer (TLB) caches page-table entries. On x86_64, a typical TLB has 1,024 to 3,072 entries covering 4 KiB pages — so at most 12 MiB of memory has its translation cached. A database with a 100 GiB buffer pool is guaranteed to thrash the TLB: most page accesses pay a full page-table walk (an extra 4 memory accesses) on top of the data access.
Huge pages — 2 MiB and 1 GiB on x86_64 — are the fix. One TLB entry covers 2 MiB of memory, so the same 1,024-entry TLB now covers 2 GiB. Databases that allocate their buffer pool on huge pages can see 10–30% throughput improvements on large working sets, purely from fewer TLB misses.
Postgres has huge_pages; Oracle has USE_LARGE_PAGES. Setting these up is a systems-administration job, and Build 4 will not touch them — but if you ever profile a database and see page-table walks dominating the profile, this is the knob.
Where this leads next
You now have the vocabulary of disk I/O: pages, sectors, alignment, page cache, pread versus mmap, O_DIRECT. More importantly, you have the habit of mind: every disk access is a page access, and every data structure worth putting on disk is designed around that fact.
The next chapter takes the abstract "a node is a page" sketch and makes it real.
- B+ Tree Insertion, Split, and Merge — chapter 23: lay out a 4 KiB page as a B+ tree node; implement binary search within the page, insertion, the split that happens when the page is full, and the merge that happens when it falls below half-full. By the end of chapter 23 you will have a working B+ tree on disk; everything after that is about making it fast and safe.
The page is the atom. Build 4 is the first time you are working inside an atom the hardware cares about. Everything from here on fits inside, or splits across, pages.
References
- Gray, Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann (1992), chapter 13 — the canonical treatment of page-oriented storage and its relationship to recovery.
- Love, Linux Kernel Development, 3rd ed., chapter on the Page Cache and Page Writeback — how the kernel's page cache actually works, in the kernel's own terms.
- Graefe, Modern B-Tree Techniques, Foundations and Trends in Databases (2011) — an exhaustive survey of how real databases lay out B+ tree pages.
- PostgreSQL documentation, Database Physical Storage — how one production database divides files into 8 KiB pages and arranges nodes within them.
- Jonas Bonér, Latency Numbers Every Programmer Should Know — the back-of-envelope table that makes page-cache vs disk latency visceral.
- Linux
open(2)manual page,O_DIRECTsection — the canonical description of alignment requirements, with the warnings that follow.