In short

When your code says "read one byte at offset 1,000,000," the disk does not read one byte. The operating system, and every layer beneath it, speaks in pages — fixed-size chunks, typically 4,096 bytes on modern Linux, 4 KiB to 16 KiB on the databases you have heard of. A one-byte read pulls an entire page into the kernel's page cache, pays the full seek and transfer cost of that page, and throws the other 4,095 bytes away unless you ask for them. The corollary is simple and brutal: every data structure that lives on disk is either page-oriented or slow. The append-only log of Build 1 got away with byte-level thinking because its access pattern — big sequential reads — made the page mismatch invisible. The B+ tree cannot get away with it: every lookup is a small random read, and "small random read" means "one page", period. This chapter walks the page boundary from both sides: what the hardware actually transfers on a one-byte read (os.pread(fd, 1, off) still fetches 4,096 bytes of silicon), what your tools are (pread/pwrite, mmap, O_DIRECT), and what changes the moment you start designing data structures that fit a page rather than spill across it. By the end you will see a B+ tree node as what it is — a 4 KiB block, sized to the hardware, with everything it needs packed inside — and you will be ready to build one.

You finished Build 3 with a working LSM-tree: memtables, SSTables, Bloom filters, compaction, the works. It handles a million writes per second and serves most reads from RAM. It is an honest database. It has one problem you cannot compact your way out of: the read path is not bounded. A key that misses every Bloom filter and lives in the deepest level still costs you one seek per SSTable in the worst case. Most of the time this is fine. For the workloads where it is not, you need a data structure where the worst-case read is written on the box — four hops, no more, no less.

That data structure is the B+ tree. And the B+ tree does not live in the abstract; it lives on the disk, and the disk has opinions about how it is organised. Before you can build a B+ tree you have to learn to think like the disk thinks. The disk thinks in pages.

Why disks and operating systems talk in pages

Start with the hardware. A NAND flash chip — the thing inside your SSD — is organised in rows of cells. The smallest unit you can read is called a page on the device: typically 4 KiB or 16 KiB on a modern consumer SSD, up to 32 KiB on enterprise drives. The smallest unit you can write is also a page. The smallest unit you can erase is a much larger block — typically 256 KiB to 4 MiB. There is no hardware operation that reads one byte, or writes one byte, or erases one byte. The chip's command set does not have such a verb.

A spinning disk works differently but arrives at the same answer. The drive has sectors, historically 512 bytes, increasingly 4,096 bytes ("4Kn" or "Advanced Format"). The head positions over a track, waits for the platter to rotate the right sector under it, and streams 512 or 4,096 bytes in one go. Asking for one byte means seeking, waiting, transferring a whole sector, and the drive hands you 4 KiB minus the one you wanted.

Between your program and the silicon sits the operating system, and the operating system standardises. On Linux (and macOS, and Windows, and every Unix), the kernel's page cache is organised in fixed-size pages matched to the CPU's memory-management unit — 4 KiB on x86_64 and aarch64 by default. When the kernel services a read syscall, it first checks whether the relevant pages are in the cache; if not, it issues a block-device read of a whole page (or several pages — the kernel also prefetches ahead when it detects a sequential pattern). The cache stores pages; the MMU maps pages; the filesystem lays files out in units of filesystem blocks (typically 4 KiB matching the page); every layer agrees.

Logical bytes vs physical pagesTwo parallel views. Top: a long byte array labelled "your view: file bytes 0, 1, 2, ... N" with an arrow pointing at byte 1,000,000. Bottom: the same file sliced into 4,096-byte pages, with page 244 highlighted, showing that byte 1,000,000 falls inside page 244 (bytes 999,424 to 1,003,519). A box on the right labelled "what actually transfers: 4,096 bytes" shows the whole highlighted page being pulled from disk.Your view of the filebyte 0byte Nread 1 byte at offset 1,000,000How the disk sees the filep0p1p2...p243p244p245...pN-1pNbyte 1,000,000 sits inside page 244(covers bytes 999,424 to 1,003,519)What actually moves from disk → RAM: 4,096 bytes
Your program asks for one byte. The kernel asks the disk for the 4 KiB page containing that byte. Everything else in the page comes along whether you want it or not.

Why pages, and why 4 KiB specifically: a small page size means more pages to track, more metadata per gigabyte of RAM, more TLB pressure on the CPU. A large page size means more waste on small accesses. 4 KiB is the compromise that x86 chose in 1985 and has propagated to every general-purpose system since. Databases that know their workload better sometimes pick larger pages — Postgres uses 8 KiB, InnoDB uses 16 KiB, Oracle offers 2 KiB to 32 KiB — because their pages are bigger than the hardware pages, not smaller. No serious storage engine uses pages smaller than 4 KiB.

The takeaway is not "4,096 is a magic number." The takeaway is that there is a unit, and you cannot read less than it. Once you believe that, every other choice in this build follows.

Reading one byte is reading a page

Let us prove the claim with the smallest piece of code that can. Python's os.pread(fd, n, offset) reads n bytes from position offset without changing the file pointer. It is the cleanest way to see what a syscall-level read actually does.

import os

fd = os.open("big.bin", os.O_RDONLY)
byte = os.pread(fd, 1, 1_000_000)   # one byte at offset 1 MB
os.close(fd)
print(byte)                         # b'\x17' (or whatever)

On the surface this reads one byte. Measured at the syscall, the kernel handled the request by checking whether the page containing offset 1,000,000 was in the page cache. If not — a cold cache — it issued a block-layer read for that 4 KiB page (and possibly a prefetch for the next few). The whole page now lives in RAM, indexed by (inode, page_number). Your program sees one byte. The page cache holds all 4,096.

Now read the byte beside it.

fd = os.open("big.bin", os.O_RDONLY)
a = os.pread(fd, 1, 1_000_000)      # cold: disk read, 4,096 bytes in
b = os.pread(fd, 1, 1_000_001)      # warm: served from page cache
c = os.pread(fd, 1, 1_004_000)      # warm: same page, bytes 999_424..1_003_519
d = os.pread(fd, 1, 1_005_000)      # cold again: next page
os.close(fd)

The first pread pays the disk cost. The second and third are microsecond page-cache hits: the data is already in RAM. The fourth crosses into the next page and pays the disk cost again.

You cannot observe this from inside Python except by timing. But the effect is dramatic and repeatable. On a warm NVMe, a page-cache read takes about 1 microsecond; a cold disk read takes 50 to 200. The ratio is one hundred. If you organise your data so that successive reads stay inside the same page — spatial locality — you get 100× more throughput than if you scatter them across pages.

Measure the page boundary yourself

# bench_pageboundary.py — observe that reads inside the same page are cheap.
import os, time

# First make a 64 MiB file of random-ish bytes.
path = "/tmp/big.bin"
with open(path, "wb") as f:
    f.write(os.urandom(64 * 1024 * 1024))

# Drop the page cache (Linux: echo 3 > /proc/sys/vm/drop_caches, needs root).
# In an unprivileged environment, open with O_DIRECT for a fresh read each time.
fd = os.open(path, os.O_RDONLY)

def time_reads(label, offsets, n=50_000):
    t0 = time.perf_counter()
    for _ in range(n):
        for off in offsets:
            os.pread(fd, 1, off)
    dt = time.perf_counter() - t0
    print(f"{label:<28}  {(len(offsets)*n)/dt:>10,.0f} reads/s")

# 8 reads that all live in one 4 KiB page — should be fast.
time_reads("same page",  [1_000_000 + i for i in range(0, 4096, 512)])

# 8 reads spread across 8 different pages — should be slower once cache is warm
# and dramatically slower when cold.
time_reads("different pages", [1_000_000 + i * 4096 for i in range(8)])

os.close(fd)

Typical output, cache fully warm:

same page                     19,400,000 reads/s
different pages                4,100,000 reads/s

Same page: about 50 nanoseconds per read — essentially a memcpy from the page cache. Different pages: about 240 nanoseconds per read — still RAM, but each read touches a different cache line, different TLB entry, different page-cache bucket. And if you drop the OS page cache between runs, the "different pages" number falls by another factor of 100, because now you are paying real disk latency eight times per iteration instead of once.

The lesson you want to internalise. The cost of a read is not proportional to the bytes you asked for. It is proportional to the number of distinct pages your reads touch. Design your data structures to pack hot bytes into the same page, and you win.

The append-only log from Build 1 did not care about any of this. A scan reads the file from start to end in sequential pages; the kernel prefetches aggressively; every page is used fully. The mismatch between byte semantics and page reality was invisible because the access pattern happened to align. B+ trees cannot rely on that alignment. A point lookup of a random key touches log(N) pages, each one separately, with no locality between them — the worst possible pattern for page caching. So the B+ tree has to design around the page: every node is a page, every edge is a page-to-page pointer, every key-value lives inside the page that owns it.

pread/pwrite, mmap, and the choices between them

You have two serious ways to move page-sized chunks of a file between disk and your program: explicit syscalls (pread, pwrite) and memory mapping (mmap). They give you the same data in the end. They have wildly different ergonomics, performance profiles, and failure modes.

pread(fd, n, offset) reads n bytes into a buffer you own, starting at offset in the file. No internal file pointer is updated. No seek/read race — the offset is part of the syscall. pwrite(fd, buf, n, offset) is the writing counterpart. These are the stock tools of database code. Python exposes them as os.pread and os.pwrite.

import os

fd = os.open("store.db", os.O_RDWR | os.O_CREAT, 0o644)

# Write a 4 KiB page at page index 12.
page = b"X" * 4096
os.pwrite(fd, page, 12 * 4096)

# Read it back.
data = os.pread(fd, 4096, 12 * 4096)
assert data == page

os.close(fd)

Two things to notice. First, the offset is page_index * PAGE_SIZE. A database built on pread/pwrite almost always treats the file as an array of pages, identifies each page by its index (the "page number" or "page id"), and computes byte offsets only at the syscall boundary. Second, pread and pwrite are safe to call from multiple threads on the same fd, because they do not touch the shared file-position pointer. The older read/lseek/write trio is not, and race bugs on a shared fd are one of the classic mistakes of storage-engine code.

mmap(fd, length) maps the file into your process's address space. The kernel sets up a range of virtual memory; when you touch a byte at address base + offset, the MMU checks whether the page is resident, and if not, triggers a page fault that the kernel services by reading from disk. You get the file as if it were a bytearray.

import os, mmap

fd = os.open("store.db", os.O_RDWR)
size = os.fstat(fd).st_size
m = mmap.mmap(fd, size, prot=mmap.PROT_READ | mmap.PROT_WRITE)

# Read a 4 KiB page — no syscall, just a memory access (maybe a page fault).
page = bytes(m[12 * 4096 : 13 * 4096])

# Write the page back.
m[12 * 4096 : 13 * 4096] = b"Y" * 4096
m.flush()                           # msync — push dirty pages to disk
m.close()
os.close(fd)

mmap is seductive. Reads become slice operations; writes become assignments; no syscalls per access once the pages are resident. The kernel's page cache is literally visible as memory. A whole school of databases (LMDB is the most famous) is built around mmap for exactly this reason.

It also has sharp edges. You cannot map a file larger than your address space (no problem on 64-bit; a hard limit on 32-bit). A read that touches an unresident page is a blocking syscall disguised as a memory access — your thread stalls on disk I/O with no warning in the code. A write to a read-only page is a segfault, not an exception. And handling I/O errors on mmap is a nightmare: a failed disk read on a page fault delivers SIGBUS, which your Python code cannot reasonably catch and recover from. The mmap-based code assumes the disk never errors, and when it does, the process dies.

The pread/pwrite-based code assumes nothing and deals with every error explicitly.

Here is the honest comparison.

propertypread / pwritemmap
API shapeexplicit syscalls with buffersmemory access through a pointer
cost per access (warm)~500 ns syscall overhead + copy~5 ns — just a memory access
cost per access (cold)real disk I/O, visible in codereal disk I/O, hidden as a page fault
data flows through the page cacheyes (unless `O_DIRECT`)yes — page cache is the map
extra copy user ↔ kernelyes (one memcpy per call)no — the map is the kernel's page cache
thread-safe on shared fdyes (offset is an argument)yes (address is independent)
I/O error handlingsyscall returns -1 / `OSError``SIGBUS` — process dies
max file sizeunboundedprocess virtual address space (~128 TiB on Linux x86_64)
alignment requiredno (unless `O_DIRECT`)no
control over when dirty pages flushfull — you call `fsync` when readypartial — `msync` plus kernel's own schedule
durability primitive`fsync(fd)``msync(addr, len, MS_SYNC)`
used byPostgres, MySQL InnoDB, SQLite, RocksDBLMDB, Boltdb, older MongoDB WiredTiger

Most modern production databases use pread/pwrite. They pay the syscall overhead for the error-handling clarity, the explicit control over caching, and the ability to use O_DIRECT to run their own buffer pool. The mmap camp exists — LMDB is an exceptionally good database — but it runs in a different design space where crash-on-I/O-error is considered acceptable because the deployment is simple and the hardware is trusted.

For Build 4, you are going to use pread and pwrite. They match how real B+ trees are built and they leave you in control of the buffer pool you will design two chapters from now.

Alignment and O_DIRECT — what happens when you bypass the kernel

There is a third option, and it is the one professional databases reach for when they want every last microsecond of control: O_DIRECT. You met it briefly in chapter 3 on fsync. It is the open-time flag that says skip the kernel page cache entirely — read directly from the disk controller into my buffer, and write directly from my buffer to the disk controller.

import os, mmap

# O_DIRECT requires an aligned buffer and aligned offsets and aligned lengths.
PAGE = 4096
buf = mmap.mmap(-1, PAGE)            # anonymous page-aligned buffer
buf.write(b"hello" + b"\x00" * (PAGE - 5))

fd = os.open("direct.db", os.O_WRONLY | os.O_CREAT | os.O_DIRECT, 0o644)
os.pwrite(fd, buf, 0)                # offset 0, length 4096, buffer 4 KiB-aligned — ok
os.fsync(fd)                         # still needed — O_DIRECT is not durable
os.close(fd)

The alignment rules on Linux are strict:

Violate any of these and the kernel returns EINVALOSError: [Errno 22] Invalid argument. There is no "it works but is slow" mode; you are either aligned or you are rejected.

Why alignment is mandatory: with O_DIRECT, the DMA controller on the disk talks directly to the physical RAM pages that back your buffer. DMA engines cannot start mid-sector on a physical read, cannot handle buffers that straddle unaligned boundaries, and cannot transfer partial sectors. Without O_DIRECT, the kernel absorbs these restrictions by copying through an aligned page in the page cache on your behalf. O_DIRECT is the kernel saying you wanted to skip the copy; now you carry the restriction.

O_DIRECT is not a durability primitive. Data written with O_DIRECT still sits in the disk controller's DRAM cache until you fsync; the only thing you skipped was the kernel's copy. What O_DIRECT buys you is cache control: the kernel page cache no longer holds a duplicate of your hot pages. For a database that runs its own buffer pool (chapter 24 — buffer pool design), duplicating a 16 GB buffer pool in the kernel page cache as well would waste 16 GB of RAM. O_DIRECT reclaims that RAM for the application.

Trade-offs:

Build 4 will start with the page cache (no O_DIRECT) because it is simpler and the performance is already good. Build 5 will revisit O_DIRECT when we introduce an async buffer pool with its own prefetcher.

Designing around pages — the B+ tree node preview

Now the payoff. Every data structure in Build 4 is going to treat the page as its fundamental unit, not the byte. Let us sketch what that means for the central data structure, so the rest of the build has a concrete image to aim at.

A B+ tree node sized to fit exactly one page

A B+ tree internal node is a list of (key, child_pointer) pairs plus a rightmost child pointer. If you fit it into a 4,096-byte page, the shape is:

+------------------------------------------------------------+
| header (16 bytes)                                          |
|   - magic (2 B)          node type + version               |
|   - n_keys (2 B)         number of keys in this node       |
|   - is_leaf (1 B)        0 for internal, 1 for leaf        |
|   - padding (3 B)                                          |
|   - right_sibling (8 B)  page-id of next leaf (leaf only)  |
+------------------------------------------------------------+
| keys[0..n-1]        fixed 8 B each — up to ~340 keys       |
+------------------------------------------------------------+
| child_ptrs[0..n]    fixed 8 B each — page-ids              |
+------------------------------------------------------------+
| free space (shrinks as keys are inserted)                  |
+------------------------------------------------------------+

With 8-byte keys and 8-byte child pointers, the arithmetic is:

\text{max keys per page} = \left\lfloor \frac{4096 - 16 - 8}{8 + 8} \right\rfloor = \left\lfloor \frac{4072}{16} \right\rfloor = 254

One page holds up to 254 keys and 255 child pointers. That number — the fanout — is the single most important parameter of a B+ tree. A fanout of 255 means a tree with 4 levels can index

255^4 \approx 4.2 \times 10^9

records. Four billion. That is why B+ trees have the reputation they do: four disk reads, worst case, for any key in a four-billion-record table, with each read being exactly one page.

The moment you internalise that the node is a page, a lot of design decisions fall out:

  1. Keys are fixed-size when possible, because variable-size keys complicate the packing arithmetic and the binary search. Many engines compress variable keys (prefix compression) but keep the total page size fixed.
  2. Pointers are page-ids, not byte offsets, because pointers are smaller and faster to compare than offsets.
  3. Lookups inside a node are a binary search over the keys array — pure in-RAM work once the page is loaded. The disk cost of a lookup is only the page fetches along the root-to-leaf path.
  4. Modifications write a whole page at a time. You do not rewrite one byte of a node; you rebuild the 4,096 bytes, pwrite them in one syscall, and fsync when durability demands it. This is exactly aligned to how the hardware wants writes.
  5. Splits happen when a page is full, not when a byte count is exceeded. Full is a property of the page, not of the data.

Every subsequent chapter in Build 4 — insertion and split, buffer pool, write-ahead log, concurrency — is working inside this constraint. The page is the atom of the design. Everything above the page is data structure; everything below it is the operating system's job.

You have just glimpsed the architecture of every relational database in serious production use. Postgres, MySQL, SQL Server, Oracle, SQLite — they all have this shape. They differ in their page size (Postgres 8 KiB, InnoDB 16 KiB, SQLite 4 KiB, Oracle configurable), their key encoding, their concurrency strategy, their locking — but the page-as-node-as-atom is common to all of them.

Common confusions

Going deeper

If the picture is clear — pages are the atom of disk I/O, pread/pwrite/mmap are the three API styles, O_DIRECT bypasses the page cache at the cost of alignment — the rest of this section is about the hardware realities hiding beneath that picture.

Sectors vs pages vs blocks — three words for adjacent things

The vocabulary is confusing because three communities coined terms at different times. Here is the disambiguation.

A database page is typically one or more filesystem blocks, allocated contiguously so the kernel can issue one I/O for the whole page. All modern filesystems try hard to keep a database's pages contiguous; this is why databases that pre-allocate their files (fallocate) perform better than ones that grow incrementally.

Atomicity at the page level — how much can the disk actually write atomically?

When the power dies mid-write, what survives? The answer depends on the level of the stack.

This has real consequences for database design. Postgres's 8 KiB page is larger than the 4 KiB atomicity guarantee of the hardware, so Postgres cannot assume a page write is atomic. Instead, Postgres's write-ahead log contains a full-page image the first time each page is modified within a checkpoint, so recovery can patch up any torn page. The cost is massive extra WAL volume — sometimes the majority of WAL writes. InnoDB's double-write buffer solves the same problem by writing every page twice, first to a staging area, then to its real location. Either way, you are paying for atomicity the hardware does not give you.

Build 4 will use 4 KiB pages matching the hardware's guaranteed atomicity, which lets us skip full-page images — one fewer thing to engineer.

Write amplification at the page level

You change one byte in a database row. How many bytes are actually written to flash? This chain is brutal.

  1. Database level. You modify one byte in a row, so the B+ tree rewrites the 4 KiB page that holds the row. That is 4,096 bytes at the database layer. Amplification so far: 4,096×.
  2. WAL level. The WAL logs the change. If this is the first modification of the page within the checkpoint, it logs a full-page image (another 4 KiB) plus the delta (~100 bytes). Amplification: ~8,200×.
  3. Filesystem level. Modern filesystems with checksums (ZFS, btrfs, XFS with metadata CRC) update their own metadata on every write. Small additional amplification.
  4. SSD level. The SSD's garbage collector, when it consolidates the partially-valid block that holds your page, rewrites every other valid page in that block. If the block is 256 KiB and 90% valid, that is an additional 230 KiB written for nothing. This is the flash-translation-layer (FTL) write amplification and is the dominant source of wear on a consumer SSD.

End to end, a one-byte database change can easily produce 100 KiB of flash writes. Hot-spot rows on a badly-designed schema can wear out a consumer SSD in months. Industrial databases obsess over write amplification for this reason: every buffer-pool optimisation, every choice of flush strategy, is in part a choice about how much flash to burn per logical write.

The B+ tree you build in Build 4 will have a write amplification of roughly the page size divided by the average update size — worse than an LSM-tree for small updates, much better for reads. LSM-trees trade the write amplification of compaction for the read amplification of level merging; B+ trees trade the read amplification for higher write amplification. The decision which to build starts from which direction your workload skews.

Huge pages and the TLB — when the MMU itself becomes a bottleneck

The CPU's Translation Lookaside Buffer (TLB) caches page-table entries. On x86_64, a typical TLB has 1,024 to 3,072 entries covering 4 KiB pages — so at most 12 MiB of memory has its translation cached. A database with a 100 GiB buffer pool is guaranteed to thrash the TLB: most page accesses pay a full page-table walk (an extra 4 memory accesses) on top of the data access.

Huge pages — 2 MiB and 1 GiB on x86_64 — are the fix. One TLB entry covers 2 MiB of memory, so the same 1,024-entry TLB now covers 2 GiB. Databases that allocate their buffer pool on huge pages can see 10–30% throughput improvements on large working sets, purely from fewer TLB misses.

Postgres has huge_pages; Oracle has USE_LARGE_PAGES. Setting these up is a systems-administration job, and Build 4 will not touch them — but if you ever profile a database and see page-table walks dominating the profile, this is the knob.

Where this leads next

You now have the vocabulary of disk I/O: pages, sectors, alignment, page cache, pread versus mmap, O_DIRECT. More importantly, you have the habit of mind: every disk access is a page access, and every data structure worth putting on disk is designed around that fact.

The next chapter takes the abstract "a node is a page" sketch and makes it real.

The page is the atom. Build 4 is the first time you are working inside an atom the hardware cares about. Everything from here on fits inside, or splits across, pages.

References

  1. Gray, Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann (1992), chapter 13 — the canonical treatment of page-oriented storage and its relationship to recovery.
  2. Love, Linux Kernel Development, 3rd ed., chapter on the Page Cache and Page Writeback — how the kernel's page cache actually works, in the kernel's own terms.
  3. Graefe, Modern B-Tree Techniques, Foundations and Trends in Databases (2011) — an exhaustive survey of how real databases lay out B+ tree pages.
  4. PostgreSQL documentation, Database Physical Storage — how one production database divides files into 8 KiB pages and arranges nodes within them.
  5. Jonas Bonér, Latency Numbers Every Programmer Should Know — the back-of-envelope table that makes page-cache vs disk latency visceral.
  6. Linux open(2) manual page, O_DIRECT section — the canonical description of alignment requirements, with the warnings that follow.