Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
In short
A B+ tree is a balanced search tree designed to sit on disk with every node being exactly one page. The shape has three invariants that matter: every node except the root holds between \lceil b/2 \rceil and b keys (so it is always at least half full), all data lives in the leaves (internal nodes hold only routing keys and child pointers), and the leaves are stitched together as a linked list so a range scan is one descent plus a sequential walk along the leaf level. Insertion is a straight descent to the correct leaf, a local insert, and — if the leaf now overflows — a split that promotes the median up to the parent; if the parent overflows, it splits too, and the cascade stops at worst at the root, which grows the tree by one level. Deletion is the mirror image: descend, remove, and if the leaf falls below half full, either borrow a key from a neighbour or merge with it, demoting a key down from the parent; cascades stop when a node is still at least half full or the root itself collapses. Range scans ignore the internal structure entirely once you have landed on the first leaf — you walk right along the leaf linked list until the upper bound is crossed. In this chapter you write all of it in about sixty lines of Python (an in-memory model, so you can watch the shape directly) and trace one insert that splits and one delete that merges, end to end. The next chapter makes the nodes actually live on disk.
You left the last chapter with a picture of a 4 KiB page laid out as a B+ tree node — 16 bytes of header, a run of fixed-width keys, a run of page-id pointers, and a few hundred bytes of free space. You computed the fanout: 254 keys per page, which means a four-level tree can index 4.2 billion records, any one of which is reachable in four disk reads. That is the advertising copy. This chapter is the engineering.
The engineering has two halves. One half is the shape: what a B+ tree actually looks like, what invariants the shape has to maintain, and why those invariants are the ones they are. The other half is the operations: insert, search, range scan, delete — the four things a storage engine asks of its tree, each of which has to preserve the shape while running in O(\log N) time against a disk that charges a round-trip for every page touched.
We will build the whole thing in Python, in RAM, without touching the file system yet. The point of this chapter is the algorithm; Build 4 chapter 24 puts it on disk. You will find, as always, that once you can see the shape move, the code is short.
The shape — four rules and a linked list
A B+ tree of order b (also called the branch factor or fanout) has four structural rules and one that is specific to B+ as opposed to a plain B-tree.
- Every node is a page. No node straddles pages; no page holds a partial node. Whatever splitting or merging the algorithm does, the unit of allocation stays the page.
- Every non-root node holds between \lceil b/2 \rceil and b keys. A leaf with fewer than \lceil b/2 \rceil keys is underflowing; a leaf with b+1 keys is overflowing. The algorithm's job is to handle overflow (by splitting) and underflow (by merging or borrowing) so that this invariant is restored before any operation returns.
- The tree is always balanced. Every path from the root to a leaf is the same length. You do not get to have one leaf two levels deep and another five levels deep — insertion and deletion both preserve this.
- Internal nodes hold only routing information. An internal node with k keys has k+1 child pointers. The keys partition the key space: the i-th child contains every key \ge the (i-1)-th key and < the i-th key. No values, no data — just signposts.
- All records live in the leaves, and the leaves form a linked list. This is what makes it a B+ tree and not a plain B-tree. Every
(key, value)pair is stored in a leaf node; internal nodes do not store values. And every leaf has aright_siblingpointer to the next leaf in sorted order, so a range scan, once it has found the first leaf, walks the leaf level horizontally without ever going back up.
Why "all data in the leaves"? In a plain B-tree, some keys live in internal nodes and some in leaves, so a lookup can stop early if the key is found on the way down. It saves hops on average. The B+ tree gives that up on purpose, for three reasons. First, internal nodes become smaller — no values to store — so fanout goes up and the tree becomes shallower; with 8-byte keys and 8-byte pointers, a 4 KiB internal node fits 255 child pointers, whereas a B-tree node carrying values fits maybe 30. Second, range scans are much cheaper: once you find the left edge of a range in a B+ tree, you just walk the leaf linked list. In a plain B-tree a range scan has to zig-zag up and down the internal nodes. Third, every lookup costs exactly the tree's height, never less — which gives you a predictable worst case, and predictability is what database operators pay for. Every serious relational database uses B+, not B.
[11, 29] is found by descending once to the leaf containing 11 and then walking right through leaf boundaries until the first key above 29 appears.The picture shows the invariants in action. Every leaf has between 2 and 3 records (order 4 allows up to 3 keys per node and requires at least 2). Every internal node has between 2 and 4 child pointers. The leaves chain from left to right, which is how every range query in your career will be served.
Now watch the same tree after one insert.
insert(20). The leaf that would have held 20 was already full, so it split into two half-full leaves and pushed the median key 20 up to its parent. The parent absorbed the new key without overflowing itself, so the split stopped there. If the parent had also been full, the split would have cascaded one level further up, and so on — at most as far as the root, which is the only way a B+ tree's height ever increases.Two pictures are enough to tell you what the algorithm has to do: descend to the right leaf, do the insert locally, and if the leaf overflowed, split it and hand the median up to the parent. The parent does the same dance. In the worst case the cascade reaches the root; when the root splits, the tree gains a new root above it, one level taller. That is the only way a B+ tree's height grows.
Insertion, search, range — the core Python
Time to write it. Sixty lines of Python, in memory, branch factor configurable — the real implementation will replace lists with 4 KiB page buffers and refs with page ids, but the algorithm is exactly this.
# bplustree.py — a minimal B+ tree. Nodes are Python objects; leaves form a linked list.
class _Node:
__slots__ = ("keys", "children", "values", "is_leaf", "next")
def __init__(self, is_leaf):
self.keys: list = [] # sorted list of routing/data keys
self.children: list = [] # child nodes (internal) — len = len(keys) + 1
self.values: list = [] # values (leaf) — len = len(keys)
self.is_leaf = is_leaf
self.next: "_Node|None" = None # leaf linked-list pointer
class BPlusTree:
def __init__(self, branch=4):
if branch < 3:
raise ValueError("branch factor must be >= 3")
self.b = branch
self.root = _Node(is_leaf=True)
# ---- search: descend to the leaf, linear (or binary) search inside ----
def search(self, key):
node = self.root
while not node.is_leaf:
# first child whose separator key > target; else last child
i = 0
while i < len(node.keys) and key >= node.keys[i]:
i += 1
node = node.children[i]
for k, v in zip(node.keys, node.values):
if k == key:
return v
return None
# ---- insert: descend, insert at leaf, split and bubble up as needed ----
def insert(self, key, value):
split = self._insert(self.root, key, value)
if split is not None:
promoted_key, right = split
new_root = _Node(is_leaf=False)
new_root.keys = [promoted_key]
new_root.children = [self.root, right]
self.root = new_root # tree grows one level
def _insert(self, node, key, value):
if node.is_leaf:
i = 0
while i < len(node.keys) and node.keys[i] < key:
i += 1
if i < len(node.keys) and node.keys[i] == key:
node.values[i] = value # update in place, no split
return None
node.keys.insert(i, key)
node.values.insert(i, value)
if len(node.keys) <= self.b - 1: # leaf allows b-1 keys; split if >
return None
return self._split_leaf(node)
# internal node
i = 0
while i < len(node.keys) and key >= node.keys[i]:
i += 1
split = self._insert(node.children[i], key, value)
if split is None:
return None
promoted_key, right = split
node.keys.insert(i, promoted_key)
node.children.insert(i + 1, right)
if len(node.keys) <= self.b - 1:
return None
return self._split_internal(node)
def _split_leaf(self, leaf):
mid = len(leaf.keys) // 2
right = _Node(is_leaf=True)
right.keys = leaf.keys[mid:]
right.values = leaf.values[mid:]
leaf.keys = leaf.keys[:mid]
leaf.values = leaf.values[:mid]
right.next = leaf.next # re-stitch leaf linked list
leaf.next = right
return (right.keys[0], right) # promote the right's first key
def _split_internal(self, node):
mid = len(node.keys) // 2
promoted = node.keys[mid] # the median goes up
right = _Node(is_leaf=False)
right.keys = node.keys[mid + 1:]
right.children = node.children[mid + 1:]
node.keys = node.keys[:mid]
node.children = node.children[:mid + 1]
return (promoted, right)
# ---- range: descend to the lower bound's leaf, walk right ----
def range(self, lo, hi):
node = self.root
while not node.is_leaf:
i = 0
while i < len(node.keys) and lo >= node.keys[i]:
i += 1
node = node.children[i]
while node is not None:
for k, v in zip(node.keys, node.values):
if k < lo:
continue
if k > hi:
return
yield (k, v)
node = node.next
About sixty lines of real code (headers, whitespace, and docstrings aside). Read it once end to end and notice the shape.
Why promote the first key of the right half from a leaf split, but the median itself from an internal split? In a leaf split we need to keep every key accessible from the tree — if we took the median out, we would lose it. So we keep all keys in the leaves and use a copy of the right half's first key as the separator in the parent. In an internal split the promoted key is just routing information; it does not "belong" to any child, so lifting it out entirely is correct. This small asymmetry is easy to get wrong; it is one of the two or three bugs every first-time implementer hits.
Why split at the median and not at the end? Because splitting in the middle guarantees both halves are at least half full — which is exactly the invariant you need the new nodes to satisfy. Splitting at the end (take the last key out) would leave one half with b-1 keys and the other with 1 key, immediately violating the \lceil b/2 \rceil lower bound on the right half. Median splits are the only choice that preserves the invariant with one stroke.
Four things deserve a closer look.
Search. Descent is just a linear scan of the routing keys at each node (binary search in the real implementation — we use linear here for readability; with 254 keys per node a bisect is a 30% speedup, but not essential to the algorithm). At the leaf, the same linear scan finds the exact key or returns None. The depth of the descent is the tree's height, \log_b N, and that is the only disk-cost term: every operation is bounded by the height.
Insert. The recursion descends, inserts at the leaf, and returns a "split result" back up the call stack. Each frame on the way up checks whether the child split and, if so, inserts the promoted key into itself and possibly splits in turn. The split result threads upward; the recursion stops as soon as some frame does not overflow. If the cascade reaches the root and the root splits, insert() builds a new root above it — the one and only mechanism by which the tree grows in height.
Range. Look at how little work this does. Descend to the leaf that contains lo (one root-to-leaf path). Then walk the leaf linked list, yielding every key in [lo, hi]. When a key exceeds hi, return. No further descent, no back-tracking, no upward navigation. On a billion-record table, a range query that returns 10,000 rows costs four disk reads for the descent plus roughly 10{,}000 / b leaf reads to walk — about 40 leaves on a tree with b = 254. Forty page reads for ten thousand rows. Compare that to the LSM-tree from Build 3, which has to merge-scan across several sorted runs. The range query is the B+ tree's strongest card.
Split leaf vs split internal. The two split functions look almost identical, but note the one-character difference in how they slice: leaf.keys[:mid] vs node.keys[:mid + 1]. The leaf keeps the median on its left half (because every key needs a leaf); the internal pushes the median up and keeps only the keys strictly less than it. Get that wrong and the tree is broken in a way that might only surface on the 100,000th insert.
Trace insert and split on a tiny tree
Run the code, step by step, on the tree from the first diagram.
>>> t = BPlusTree(branch=4)
>>> for k in [17, 3, 5, 7, 9, 11, 13, 15, 19, 21, 23, 25, 29, 31, 37, 41]:
... t.insert(k, f"v{k}")
After that sequence the tree looks like the "before" diagram above. Now insert 20.
>>> t.insert(20, "v20")
What happens, step by step:
searchdescends from root to the leaf that should contain 20. The root has key[17]; 20 ≥ 17, so go tochildren[1]. That internal node has keys[23, 31]; 20 < 23, so go tochildren[0]— the leaf[17, 19, 21].- Insert 20 into that leaf:
[17, 19, 20, 21]. The branch factor is 4, so a leaf holds up to 3 keys.len(keys) == 4 > b-1 == 3— overflow. _split_leafcuts atmid = 4 // 2 = 2. Left half:[17, 19]. Right half:[20, 21]. Re-stitch the linked list: the new right leaf'snextbecomes what the old leaf'snextwas, and the old leaf'snextnow points at the new right leaf.- Return
(20, right_leaf)to the parent. The parent is the internal node[23, 31]with three children. It inserts the promoted key at position 0 (because 20 < 23) and the new right child at position 1. The parent becomes[20, 23, 31]with four children — three keys, four children, still withinb-1 = 3keys. No further split. insertreturnsNoneto_inserton the root, which also returnsNone. Done.
The final tree matches the "after" diagram: root unchanged, right internal expanded from [23, 31] to [20, 23, 31], leaf [17, 19, 21] replaced by [17, 19] → [20, 21] → ... Tree height unchanged.
>>> t.search(20)
'v20'
>>> list(t.range(13, 25))
[(13, 'v13'), (15, 'v15'), (17, 'v17'), (19, 'v19'), (20, 'v20'), (21, 'v21'), (23, 'v23'), (25, 'v25')]
The range query descended once (to the leaf [13, 15]), then walked the linked list through [17, 19], [20, 21], [23, 25, 29], yielding every key in [13, 25] and stopping at the first key > 25. No internal node was touched after the descent.
Deletion — the mirror of insertion
Deletion is where most half-implemented B+ trees fail. The write-up is longer than insertion and the edge cases are sharper. The shape, though, is the mirror of what you already have: descend, remove, and if the node fell below \lceil b/2 \rceil keys, fix it. Two ways to fix: borrow from a sibling (redistribute), or merge with a sibling.
Here is the logic, written as pseudocode for readability — the production version plugs into BPlusTree the same way insert did.
# Pseudocode — deletion with underflow handling.
def delete(tree, key):
_delete(tree.root, key, parent=None, idx_in_parent=None)
# If the root lost its only key and still has one child, shrink the tree.
if not tree.root.is_leaf and len(tree.root.keys) == 0:
tree.root = tree.root.children[0]
def _delete(node, key, parent, idx_in_parent):
if node.is_leaf:
if key in node.keys:
i = node.keys.index(key)
del node.keys[i]
del node.values[i]
else:
i = 0
while i < len(node.keys) and key >= node.keys[i]:
i += 1
_delete(node.children[i], key, parent=node, idx_in_parent=i)
# After recursion returns, check for underflow. Root is exempt.
min_keys = (tree.b + 1) // 2 - 1 # ⌈b/2⌉ - 1 keys minimum
if parent is None or len(node.keys) >= min_keys:
return
# Try to borrow from left sibling, then right sibling; else merge.
left = parent.children[idx_in_parent - 1] if idx_in_parent > 0 else None
right = parent.children[idx_in_parent + 1] if idx_in_parent < len(parent.children) - 1 else None
if left is not None and len(left.keys) > min_keys:
borrow_from_left(node, left, parent, idx_in_parent)
elif right is not None and len(right.keys) > min_keys:
borrow_from_right(node, right, parent, idx_in_parent)
elif left is not None:
merge(left, node, parent, idx_in_parent) # merges node into left
else:
merge(node, right, parent, idx_in_parent + 1)
The three sub-operations are short but must be exact.
- borrow_from_left moves the rightmost key of the left sibling into the node, and updates the parent's separator to reflect the new boundary. For leaves, the key physically moves; for internal nodes, the parent's separator drops down into the node and the sibling's last key climbs up into the parent (a rotation). Either way, after the borrow both nodes are at least half full.
- borrow_from_right is the mirror image.
- merge combines the underflowing node with its sibling into a single node, pulling the separator key from the parent down into the merged node (for internal nodes) or simply concatenating (for leaves, dropping the now-redundant separator from the parent). The parent loses one key and one child pointer. If the parent now underflows, the cascade continues upward, exactly as split cascades during insertion — and when the cascade reaches the root and the root's key count drops to zero, the root is replaced by its only remaining child and the tree shrinks by one level.
Why prefer borrowing to merging? Merging halves the number of nodes at a level and can trigger further merges all the way to the root — costly when it happens. Borrowing is local: two siblings and their parent are the only nodes touched. On workloads with a mix of inserts and deletes, a policy of "borrow if possible, merge otherwise" keeps nodes comfortably above the half-full line and avoids oscillation where the same pair of nodes keeps splitting and merging.
delete(13). Its right sibling also has only two keys, so we cannot borrow. We merge the two leaves into [15, 17, 19], drop the separator key 17 from the parent, and remove one child pointer. The parent now has one key and two children, which is still legal (because the parent is the root; a non-root internal node would have to check its own underflow and possibly cascade). The leaf linked list has one less link, exactly as expected.Trace a delete that triggers a merge
Start with a tree whose root is [13, 17] with three children:
[13 | 17]
/ | \
[9, 11] [13, 15] [17, 19]
Every leaf has exactly 2 keys — the minimum for b = 4. Now run delete(13).
- Descent finds key 13 in the middle leaf.
del node.keys[0]; del node.values[0]leaves the leaf with[15], one key — below the minimum of 2. Underflow. - Check siblings. Left sibling
[9, 11]has 2 keys — the minimum. Borrowing from it would leave it with 1, itself underflowing. So we cannot borrow from the left. Right sibling[17, 19]has 2 keys — same story. So we cannot borrow from either. - Merge with the right sibling. New leaf:
[15, 17, 19]. Thenextpointer of the merged leaf now points to whatever the right sibling used to point at. The left sibling[9, 11]'snextpointer already pointed at this leaf, so nothing changes there. - The parent loses the separator key 17 and the child pointer to the old right sibling. The parent becomes
[13]with two children. That is one key, which is above the root's minimum (the root is allowed to have a single key; non-root internal nodes would need \lceil b/2 \rceil - 1 = 1 key, also satisfied here). - If the parent had underflowed, the cascade would go to its parent. In this case the parent is the root, and it still has one key, so we are done.
If we had also deleted 15 after that, the merged leaf [17, 19] would not underflow (still 2 keys), so no further structural change. If we kept deleting we would eventually trigger a cascade all the way to the root, and the root would be left with zero keys — at which point delete replaces the root with its only remaining child and the tree's height drops by one.
>>> t.delete(13)
>>> list(t.range(0, 100))
[(9, 'v9'), (11, 'v11'), (15, 'v15'), (17, 'v17'), (19, 'v19')]
Common confusions
-
"What is the difference between a B-tree and a B+ tree?" In a plain B-tree, data lives in every node — internal and leaf. A lookup can stop at an internal node if the key is found there. In a B+ tree, all data lives in the leaves; internal nodes hold only keys-as-routing-signposts plus child pointers. Two consequences: internal nodes are smaller (no values) so fanout is higher and the tree is shallower, and range scans are a single descent plus a sequential walk along the leaf linked list instead of a zig-zag through internal nodes. Every serious relational database uses B+.
-
"Branching factor, fanout, order — are these the same thing?" They are used interchangeably in most texts, but careful writing distinguishes them. Order (or branching factor) b is the maximum number of child pointers an internal node can have. Fanout is the average number of children actual nodes have in a populated tree — typically around b \times 0.7 because nodes are rarely perfectly full. On a 4 KiB page with 8 B keys and 8 B pointers, b = 255 and the average fanout is roughly 170. A four-level tree therefore indexes 170^4 \approx 800 million records comfortably, or up to 255^4 \approx 4.2 billion at theoretical maximum.
-
"Why does the leaf split put the right half's first key in the parent but the internal split puts the median itself in the parent?" Because leaves must hold every key — the key is the data. If we lifted the median out of the leaf, we would lose the record. Internal nodes hold only routing keys, which do not belong to any particular child, so lifting the median out entirely is fine. This asymmetry is small but it is the single most common bug in first-time B+ implementations.
-
"What happens on duplicate keys?" In a unique B+ tree (a primary key index), an
insertof an existing key does an in-place update — no structural change. That is what theif k == key: updatebranch in the code does. In a non-unique B+ tree (a secondary index), the implementation allows the same key to appear in multiple leaves and the search returns all of them; the routing keys still work as long as the comparison is well-defined. Postgres, InnoDB, and SQLite all support both modes. -
"Is the leaf linked list double-linked or single-linked?" In almost every production implementation it is double-linked, so range scans can go in either direction and so merges on delete can fix both neighbours' pointers in one step. The code above uses a single-link for brevity; adding a
prevpointer doubles the metadata cost of each leaf but is free at query time. -
"What about concurrent readers and writers?" Nothing in this chapter handles concurrency — every operation assumes exclusive access. Real B+ trees use latch-coupling ("crabbing"): acquire a lock on a node, descend to the child, release the parent. For writes, use a B-link tree variant, which keeps a high-key on each node so readers can follow a right-pointer if a concurrent splitter has moved their target. Build 4 chapter 26 covers this — for now, we are single-threaded.
Going deeper
If the picture is clear — nodes are pages, splits go up, merges bubble up, leaves are a linked list — the below are the performance knobs professional engines tune after the basic algorithm is working.
Branching factor tuning for cache lines
The fanout calculation in the previous chapter gave b = 254 for a 4 KiB page with 8 B keys and 8 B pointers. That is per page, and each page fetch from disk is one unit of I/O. But once the page is in RAM, the binary search across 254 keys has its own cost measured in cache lines.
A CPU cache line is 64 bytes. A run of 8 B keys packs 8 per cache line. A binary search over 254 keys performs \lceil \log_2 254 \rceil = 8 comparisons, each of which probes roughly one cache line on average — but the probes are at unpredictable locations, so most of them miss L1 and hit L2 or L3. On a modern CPU each L3 hit is ~30 cycles. The node's binary search can cost 300+ cycles of RAM access, which is a large fraction of the entire operation on a warm tree.
Optimisations that help:
- Eytzinger layout. Re-order the keys in the node so that an Eytzinger (level-order) ordering replaces the natural sorted ordering. A binary search over an Eytzinger array has better branch prediction and prefetching behaviour, and runs 1.5–2× faster at the same key count. A few production engines (HyperLogLog-era ClickHouse, some Postgres forks) use this.
- SIMD scan. For small b (say b = 16), linear scans with SIMD (one
_mm512_cmpgt_epi64instruction compares 8 keys at once) beat binary search on wall time. Some engines use a smaller b than the page size would allow, precisely to keep scans inside one SIMD instruction. - Larger pages. Going from 4 KiB to 16 KiB pages quadruples b and roughly halves the tree's height. Each page read is more expensive (4× the bytes) but there are 2× fewer of them on a lookup, so wall time improves when disk is the bottleneck. Postgres uses 8 KiB; InnoDB uses 16 KiB.
The tradeoff: larger pages waste more space on half-full nodes, increase write amplification on small updates (you rewrite 16 KiB for a one-byte change), and cost more CPU per page. 4–16 KiB is the sweet spot for general-purpose databases; analytical engines push higher (ClickHouse uses 1 MiB "granules").
Prefix compression
A B+ tree on string keys — say, URLs or sorted UUIDs — has enormous redundancy inside each node. Adjacent keys share long prefixes: /user/00001, /user/00002, /user/00003, ... Storing each key in full wastes most of the node on the same repeated bytes.
Prefix compression stores each key as a pair (shared_length, suffix). The first key in the node is stored in full. Each subsequent key shares some prefix with the previous one and stores only the number of shared bytes plus the diverging suffix. Binary search works the same way — you still compare the target key against the reconstructed keys — but the node holds 3× to 10× more keys, which means 3× to 10× higher fanout, which means a shallower tree.
InnoDB uses prefix compression on secondary indexes. LevelDB and RocksDB use it on every SSTable block. The cost is decompression time on every page access: for large values of b this is amortised across many keys, so the cost per lookup is small.
Right-only inserts optimisation
A huge fraction of real-world B+ tree inserts have a monotonically increasing key — think timestamps, auto-increment primary keys, or log sequence numbers. Every insert goes to the rightmost leaf. Without optimisation, that leaf splits every b-1 inserts, and each split allocates a new leaf, promotes a key, and possibly cascades.
Right-only split (also called biased split or bulk-loading mode) detects this pattern and splits the full rightmost leaf at the very end instead of the middle — leaving the left half full and the right half empty, ready for more right-side inserts. The full left half cannot be inserted into again without splitting, but it is immutable henceforth, so that is fine. The new empty right half absorbs the next b-1 inserts with no further structural change. The effect is that a bulk insert of N sorted rows does O(N / b) splits instead of O(N), and every leaf ends up 100% full instead of 50–70% full — half the disk usage for the same data.
Every production B+ tree has this optimisation. Postgres calls it fastpath insertion; InnoDB has INSERT INTO ... LOG_AT_ONCE; SQLite uses it implicitly when it detects ordered primary-key inserts. If your schema has an auto-increment primary key, every one of your INSERT statements is on the fastpath.
The mirror optimisation exists for left-side deletes (cursor-based purge of old rows, TTL queues) but is rarer in practice.
Where this leads next
You now have an in-RAM B+ tree. Every concept that matters for the disk version — pages as nodes, the split-and-promote recurrence, leaf linked lists for range scans, underflow by borrow-or-merge — is something you can run in sixty lines. The next chapter replaces the RAM _Node objects with real 4 KiB page buffers read from a file, and introduces the buffer pool — the cache that holds recently-touched pages in RAM so that the second access to a hot node does not hit disk.
- Buffer pool design: LRU, Clock, 2Q — chapter 24: a fixed-size RAM cache of page buffers, with a replacement policy (LRU, Clock, 2Q, or LRU-K) that decides which page to evict when the cache fills. This is what makes the B+ tree's four-hop worst case actually four disk hops on cold data and zero hops on hot data.
After that, Build 4 assembles the rest of the storage engine: write-ahead log for crash recovery, latch-coupling and B-link trees for concurrency, copy-on-write for snapshot isolation. Every one of them is a layer on top of the tree you just built.
The tree is the atom; everything else is the machinery that keeps it durable, concurrent, and fast.
References
- Rudolf Bayer, Edward McCreight, Organization and Maintenance of Large Ordered Indexes, Acta DataFlow (1972) — the original B-tree paper.
- Douglas Comer, The Ubiquitous B-Tree, ACM Computing Surveys (1979) — the classic survey that named and catalogued the family, including B+.
- Goetz Graefe, Modern B-Tree Techniques, Foundations and Trends in Databases (2011) — exhaustive treatment of every optimisation in this chapter plus a few dozen more.
- Philip L. Lehman, S. Bing Yao, Efficient Locking for Concurrent Operations on B-Trees, ACM TODS (1981) — the B-link tree paper, the foundation of almost every modern concurrent B+ implementation.
- PostgreSQL source,
src/backend/access/nbtree/— a clean, well-commented production B+ tree implementation in C. - Viktor Leis, Alfons Kemper, Thomas Neumann, The Adaptive Radix Tree, ICDE (2013) — a modern alternative to the B+ tree for in-memory indexes, and a useful frame of comparison.