In short
You have a B+ tree that lives on disk, one page per node, and a worst case of four disk reads per lookup. In practice you do not pay four disk reads on every query — because the top of the tree is touched by every query, the root and the second level belong in RAM, always. That RAM is the buffer pool: a fixed-size array of page-sized slots ("frames"), a hash map from page_id to frame, and a replacement policy that decides which frame to evict when a new page needs to come in and every frame is full. The naive policy is LRU — evict whichever frame was touched longest ago. It works well on typical workloads and catastrophically badly on sequential scans: one full-table scan reads every page of the table exactly once, and because "least recently used" is now the entire hot working set, the scan evicts everything you care about in exchange for pages you will never read again. Clock fixes the metadata cost: instead of a doubly-linked list (expensive to maintain under concurrent access), it uses a circular array with one reference bit per frame. A hand sweeps the circle, clearing reference bits; when it finds a frame whose bit was already zero, that frame is evicted. Clock is what Postgres actually runs and what Linux's page cache runs; it is LRU that is cheap enough for a multi-threaded buffer pool. 2Q solves the scan problem directly: keep two queues, a probationary FIFO for pages seen only once and a protected LRU for pages seen twice or more; new pages enter the probationary queue, get promoted only if they are touched again, and a scan therefore cycles through the small probationary queue without ever touching the large protected working set. LRU-K (K=2 is the most common) keeps the timestamps of the last K accesses to each page and evicts the one whose K-th-most-recent access is oldest — a more principled way to distinguish pages touched once from pages touched repeatedly. Every frame has two pieces of bookkeeping the policy does not know about: the pin count (how many threads currently need this frame not to move) and the dirty flag (is this frame's content newer than the disk copy, so eviction must write first?). Pin-and-unpin is the contract between the buffer pool and the B+ tree; the dirty flag is the contract between the buffer pool and the write-ahead log. In this chapter you write a buffer pool in about sixty lines of Python with LRU and pin/unpin, then modify it into Clock and 2Q to see why each of the three policies exists.
You finished the last chapter with a B+ tree that works — insertion, search, range, split, merge — but every node is a live Python object in RAM. The promise of the B+ tree is that the nodes are pages on disk, so the tree can be arbitrarily larger than memory. To get there you need one new thing: a cache, sitting between the tree's "give me node 473" call and the disk's pread that actually fetches 4 KiB from a file. That cache is the buffer pool, and it is not a detail. In every production database you have heard of, the buffer pool is the component whose tuning knobs — shared_buffers in Postgres, innodb_buffer_pool_size in MySQL, SGA in Oracle — are the first thing a DBA touches and the last thing they stop worrying about.
This chapter is about three decisions, each of which got its own two-decade argument in the literature. First: LRU or something else? Second: LRU is too expensive to implement literally — what cheap approximation do you use? Third: how does the buffer pool cooperate with the code that is actually using pages — the pin/unpin contract, the dirty flag? You will write the whole thing in Python, run it against a synthetic workload, and see each policy's failure mode in real numbers.
The buffer pool — a fixed-size array of frames
The shape of a buffer pool is almost embarrassingly simple. You decide up front how much RAM the database gets — say, 1 GiB — and divide that by the page size — 4 KiB — to get the number of frames: 262,144 slots, each one the size of one page, each one either empty or holding a copy of some disk page. Alongside the frames you keep a page table, a hash map from page_id to the frame that currently holds it. Alongside the page table you keep whatever metadata the replacement policy needs — a linked list for LRU, a reference bit and a hand pointer for Clock, two queues for 2Q — and two numbers per frame: the pin count and the dirty flag.
Why is the buffer pool not just the OS page cache? Linux already caches file pages in RAM for free — every pread is served from the page cache if the page is hot. The answer has three parts. First, eviction policy: the kernel uses a general-purpose LRU that does not know "this page is a B+ tree root that every query touches" from "this page is a one-off scan of a log table." The database knows, and a database-specific policy beats the kernel's. Second, write control: the kernel flushes dirty pages whenever it wants, which breaks the write-ahead log's ordering guarantees (WAL before the page it protects). The database needs to decide when a dirty page can be written back — after the WAL record has reached disk, never before. Third, access control: the buffer pool needs to pin pages (prevent eviction while a read is in progress) and control concurrency (latch a page while modifying it); neither is expressible through the kernel's API. The database runs its own buffer pool for the same reason JVMs run their own garbage collector: the layer below does not know enough.
The interface the B+ tree sees is four methods. You will implement all four in the next section.
fetch(page_id)— get me the frame holding pagepage_id. If it is already in the pool, return its frame and increment the pin count. If not, pick a victim, write it back if dirty, read the new page from disk into the victim's frame, and return.unpin(page_id, dirty=False)— I am done with this page. Decrement the pin count. If I modified it, mark it dirty so the next eviction of this frame writes it back.flush(page_id)— write this page's frame back to disk right now (used by the WAL before commit).new_page()— allocate a fresh page on disk, load a zeroed frame for it, return both.
Four methods. The whole rest of the chapter is the replacement policy.
Strict LRU — the obvious policy, in sixty lines
Let us write it. The naive LRU buffer pool is the canonical reference against which every other policy is measured. In Python, with a dict for the page table and an OrderedDict to simulate the recency list, you can have it working in about sixty lines.
# bufferpool_lru.py — a minimal strict-LRU buffer pool with pin/unpin.
import os
from collections import OrderedDict
PAGE = 4096
class Frame:
__slots__ = ("page_id", "data", "pin_count", "dirty")
def __init__(self):
self.page_id = None
self.data = bytearray(PAGE)
self.pin_count = 0
self.dirty = False
class BufferPool:
def __init__(self, path, n_frames):
self.fd = os.open(path, os.O_RDWR | os.O_CREAT, 0o644)
self.frames = [Frame() for _ in range(n_frames)]
self.page_table: dict[int, int] = {} # page_id -> frame index
self.lru = OrderedDict() # frame index -> None (LRU order)
for i in range(n_frames):
self.lru[i] = None # all frames start as free
# ---- core API ----
def fetch(self, page_id: int) -> Frame:
# Hit: frame already holds the page.
if page_id in self.page_table:
idx = self.page_table[page_id]
self.frames[idx].pin_count += 1
self.lru.move_to_end(idx) # mark most recently used
return self.frames[idx]
# Miss: find a victim frame.
idx = self._pick_victim()
frame = self.frames[idx]
if frame.page_id is not None:
self.page_table.pop(frame.page_id)
if frame.dirty:
os.pwrite(self.fd, frame.data, frame.page_id * PAGE)
frame.dirty = False
# Load the requested page into the victim frame.
data = os.pread(self.fd, PAGE, page_id * PAGE)
if len(data) < PAGE:
data = data + bytes(PAGE - len(data)) # zero-fill beyond EOF
frame.data[:] = data
frame.page_id = page_id
frame.pin_count = 1
self.page_table[page_id] = idx
self.lru.move_to_end(idx)
return frame
def unpin(self, page_id: int, dirty: bool = False) -> None:
idx = self.page_table[page_id]
f = self.frames[idx]
f.pin_count -= 1
if dirty:
f.dirty = True
def flush(self, page_id: int) -> None:
idx = self.page_table[page_id]
f = self.frames[idx]
if f.dirty:
os.pwrite(self.fd, f.data, f.page_id * PAGE)
f.dirty = False
def _pick_victim(self) -> int:
# Walk the LRU from oldest to newest, skip pinned frames.
for idx in self.lru:
if self.frames[idx].pin_count == 0:
self.lru.move_to_end(idx) # will be repurposed
return idx
raise RuntimeError("all frames pinned — increase buffer pool size")
def close(self) -> None:
for f in self.frames:
if f.dirty and f.page_id is not None:
os.pwrite(self.fd, f.data, f.page_id * PAGE)
os.close(self.fd)
That is a buffer pool. Sixty lines, including headers and whitespace, with the four-method interface exactly as advertised and a strict LRU replacement policy implemented as an OrderedDict.
Why an OrderedDict and not a plain deque or a list? Because LRU needs three operations: "mark frame X as most recently used" (move to tail), "find the oldest unpinned frame" (iterate from head, skipping pinned), and "record a new entry" (insert at tail). OrderedDict.move_to_end is O(1), iteration from head is O(\text{position}), and insertion is O(1). A plain list would be O(n) for move-to-end. A doubly-linked list hand-rolled in Python would be O(1) everywhere but is clumsier than the standard-library OrderedDict.
Two things in this code deserve a second look.
The pin/unpin contract. Every call to fetch increments the pin count; every call to unpin decrements it. The eviction loop in _pick_victim skips any frame with pin_count > 0. This is how the B+ tree code tells the buffer pool "do not evict this page while I am using it" without taking out any kind of lock: the pin count is a reference count. Code that calls fetch must call unpin with the same page_id before the page can be evicted, just as code that allocates memory must free it. Forgetting to unpin is the buffer-pool equivalent of a memory leak — the frame is stuck, eventually every frame is pinned, and the pool runs out of victims.
The dirty flag. Writes to the page are done by the caller, not by the buffer pool (the caller holds the frame.data bytearray and modifies it in place). When the caller unpins, they pass dirty=True if they wrote. The buffer pool remembers that the frame is now dirty. At eviction time, if the frame is dirty, the pool writes it back to disk before reusing the frame. This is the write-back policy: dirty pages are written lazily, piggybacked on eviction, not immediately. The alternative (write-through) would pwrite on every unpin — simpler, but you would lose most of the buffer pool's benefit on write-heavy workloads, because every write would still hit the disk.
LRU's failure mode — the sequential scan catastrophe
LRU is good at typical database workloads and bad at one specific workload. Consider a 10,000-page table. Your buffer pool has 1,000 frames. Most of the time, 800 of those frames hold the "hot" pages — frequently-touched indexes, small lookup tables, the B+ tree's top three levels — and the rest cycle among various leaves.
Now someone runs SELECT * FROM big_table. The query engine reads every one of the 10,000 pages once, in order. Each read is a miss, so each one picks an LRU victim. Because the scan is the thing most recently used, every page in the scan becomes "most recently used" at the moment it is read, and every one of your hot pages gets pushed down the LRU list. Within 1,000 page reads (one buffer-pool-full), all 1,000 frames are scan pages. Within 10,000 page reads, you have cycled the scan through the buffer pool ten times, each page used once and then discarded — and every hot page is long gone.
The scan just evicted your entire working set. The next query — a simple primary-key lookup — now takes four disk reads instead of zero, because the top of the B+ tree has to be re-read from disk.
This is not a theoretical concern. Every production database has a story about a background analytics job blowing up the buffer pool and causing a latency spike on the main workload. The technical name is the scan problem or the sequential-flooding problem. It has two possible fixes: change the policy (use Clock, 2Q, LRU-K, ARC — they all help in different ways) or change the eviction behaviour for scans specifically (hint to the buffer pool that "these reads are scan reads, do not let them evict hot pages"). Postgres does the latter (ring buffer for sequential scans). InnoDB does the former (its LRU has a "midpoint insertion" rule that is essentially 2Q). Both are reasonable; both acknowledge that pure LRU alone is not enough.
Clock — LRU approximation with a reference bit
Clock solves a different problem than the scan problem: implementation cost. Every call to fetch moves a frame to the end of the LRU list. In a single-threaded buffer pool that is fine; in a multi-threaded one, every fetch is now taking a lock on the shared LRU list. The lock becomes the bottleneck — sometimes more expensive than the disk I/O the buffer pool was meant to avoid.
Clock replaces the linked list with a circular array of frames and one reference bit per frame. On a hit, instead of moving the frame, you set its reference bit to 1 — a single-byte atomic store, no locking. Eviction uses a clock hand that sweeps the circle:
for each candidate (starting from hand position, wrapping around):
if frame is pinned: skip
elif frame's reference bit is 1: clear it and advance hand (a second chance)
else: this frame is the victim; advance hand
A frame's reference bit is set whenever the frame is touched. The clock hand walks around clearing bits; when it finds a frame whose bit is already zero — meaning the frame has not been touched since the last sweep — that frame gets evicted. The policy approximates LRU: frequently-touched frames have their bits re-set before the hand comes around again and are spared; rarely-touched frames get cleared on one sweep and evicted on the next.
# bufferpool_clock.py — Clock replacement policy. Same interface as LRU.
class ClockPool(BufferPool):
def __init__(self, path, n_frames):
super().__init__(path, n_frames)
self.ref_bits = [0] * n_frames
self.hand = 0
def fetch(self, page_id: int) -> Frame:
if page_id in self.page_table:
idx = self.page_table[page_id]
self.frames[idx].pin_count += 1
self.ref_bits[idx] = 1 # set instead of move
return self.frames[idx]
idx = self._pick_victim()
frame = self.frames[idx]
if frame.page_id is not None:
self.page_table.pop(frame.page_id)
if frame.dirty:
os.pwrite(self.fd, frame.data, frame.page_id * PAGE)
frame.dirty = False
data = os.pread(self.fd, PAGE, page_id * PAGE)
if len(data) < PAGE:
data = data + bytes(PAGE - len(data))
frame.data[:] = data
frame.page_id = page_id
frame.pin_count = 1
self.ref_bits[idx] = 1
self.page_table[page_id] = idx
return frame
def _pick_victim(self) -> int:
n = len(self.frames)
for _ in range(2 * n): # at most two full sweeps
idx = self.hand
self.hand = (self.hand + 1) % n
if self.frames[idx].pin_count > 0:
continue
if self.ref_bits[idx] == 1:
self.ref_bits[idx] = 0 # second chance
continue
return idx
raise RuntimeError("all frames pinned")
On a typical workload Clock gives you within a few percent of LRU's hit rate, with none of LRU's locking overhead. Clock is what Linux's page cache runs. Clock is what Postgres runs (with the variant described below). Clock is what every multi-core database reaches for when strict LRU's contention becomes the bottleneck.
Why two full sweeps maximum? In the worst case, every frame's reference bit is 1 when the hand starts. The first sweep clears all of them to 0. The second sweep finds the first unpinned frame (bits are now 0) and evicts it. If every frame is pinned, even two sweeps will not find a victim — that is the "all frames pinned" error, and it means your buffer pool is too small for your concurrent workload.
Clock with multiple reference bits (Clock-Pro, Postgres's variant)
Postgres uses a refinement called Clock with usage counts. Instead of a single reference bit per frame, each frame has a small counter (Postgres uses 0 to 5). A hit increments the counter (capped at 5). The clock hand decrements the counter; only when the counter reaches zero does the hand consider the frame for eviction. This means a frame touched five times needs five hand-passes before it becomes evictable — heavily-used pages stay in the pool longer, lightly-used pages leave quickly. It is LRU-K ideas grafted onto Clock, and it is what shared_buffers runs in every Postgres installation you have used.
2Q — two queues, one for probation, one for protection
Clock fixes the locking cost of LRU but does not fix the scan problem. 2Q does the opposite: it fixes the scan problem and stays at Clock-ish metadata cost.
The insight is simple. In pure LRU, there is one "most recently used" slot, and a one-off scan jumps every scanned page straight to it. 2Q separates pages that have been touched once from pages that have been touched twice or more, on the theory that pages touched only once are scan pages (use it, drop it) and pages touched twice or more are working-set pages (keep them).
The structure: two queues.
- A1 (or "probationary"): a FIFO of pages that have been seen exactly once. New pages enter A1 at the tail. When A1 fills, pages fall off the head — evicted without ever being promoted.
- Am (or "protected" / "main"): an LRU of pages that have been seen at least twice. When a page in A1 is accessed again, it is promoted to Am — removed from A1, inserted at the tail of Am. When Am fills, its head (least recently used) is evicted.
The two queues share the buffer pool's frames. A common split is 25% of frames for A1 and 75% for Am — the probationary queue is small, the protected queue is big.
fetch(page_id):
if page_id in Am:
move to tail of Am (recency bump)
elif page_id in A1:
remove from A1, insert at tail of Am (promotion)
else:
load from disk into a fresh frame, insert at tail of A1
Now run the scan. The scan reads 10,000 pages, each one exactly once. Every page enters A1. A1 holds at most 250 frames (25% of 1000), so after the first 250 pages the scan is cycling through A1 — each page enters at the tail, gets pushed to the head, and falls off. Am is completely untouched. The working set in Am survives the scan unharmed.
This is why 2Q exists. It fixes the exact failure mode LRU has on the exact workload (analytics scans mixed with transactional queries) that made LRU unworkable for real databases.
# Sketch of 2Q. Full implementation is an exercise; the structural pieces are:
class TwoQPool:
def __init__(self, path, n_frames, a1_ratio=0.25):
self.a1_size = int(n_frames * a1_ratio)
self.am_size = n_frames - self.a1_size
self.a1 = OrderedDict() # page_id -> frame_idx, FIFO
self.am = OrderedDict() # page_id -> frame_idx, LRU
# ... pin/unpin/dirty same as LRU ...
def fetch(self, page_id):
if page_id in self.am:
self.am.move_to_end(page_id) # Am is LRU
return self.frames[self.am[page_id]]
if page_id in self.a1:
# Promotion: A1 -> Am. Evict Am's LRU if full.
idx = self.a1.pop(page_id)
if len(self.am) >= self.am_size:
self._evict_from_am()
self.am[page_id] = idx
return self.frames[idx]
# New page: enter A1. Evict A1's FIFO head if full.
if len(self.a1) >= self.a1_size:
self._evict_from_a1()
idx, frame = self._load_fresh(page_id)
self.a1[page_id] = idx
return frame
The two-queue structure pays for itself on mixed workloads. For pure transactional workloads with no scans, 2Q is roughly as good as LRU; the 25% of frames spent on A1 is wasted, but the protected queue functions as a slightly smaller LRU. For workloads with any scan activity, 2Q massively outperforms LRU. That is the exact tradeoff you want: "not worse on the easy case, hugely better on the hard case."
LRU-K — remember the last K accesses, not the last one
The most principled of the four policies. Instead of tracking "when was this page last touched?", LRU-K tracks "when were the last K touches to this page?" and evicts the page whose K-th-most-recent touch is oldest. K=2 is the usual choice.
Why does K=2 matter? Because it captures the same distinction 2Q captures — "has this page been touched more than once?" — in a single ordering. A page touched once has no K-th-most-recent access; it is infinitely old, evicted first. A page touched many times with long gaps between touches has a more recent K-th-most-recent access than a page touched many times in quick succession — which is exactly the "frequency, not just recency" signal you want.
LRU-K achieves slightly better hit rates than 2Q on most workloads. It is used in a few high-end engines (the original LRU-K paper was motivated by a relational engine). Its cost — you need to store K timestamps per page — is small but not zero, and the bookkeeping is more complex than 2Q's two-queue scheme. Most production engines chose 2Q or a Clock variant instead for the simplicity.
Comparison of the four policies
| policy | hit rate (typical) | hit rate under scan | metadata per frame | concurrency cost | used by |
|---|---|---|---|---|---|
| Strict LRU | baseline | catastrophic | 2 pointers (list node) | high (list lock on every hit) | teaching examples, SQLite |
| Clock (1 bit) | within 2–5% of LRU | catastrophic | 1 bit | low (atomic bit set) | Linux page cache |
| Clock with usage count | better than LRU | partial protection | ~3 bits (0–5 counter) | low | Postgres shared_buffers |
| 2Q | similar to LRU | protected (scan isolated to A1) | 2 pointers + queue tag | medium (two lists) | InnoDB (midpoint-insertion variant) |
| LRU-K (K=2) | slightly better than LRU/2Q | protected | K timestamps | medium | O'Neil et al.'s paper, a few research engines |
| ARC (adaptive) | matches or beats LRU-K | protected (tuned adaptively) | 4 queues + ghost entries | high | ZFS, originally IBM DB2 |
| CLOCK-Pro | matches ARC | protected | Clock + history list | low | NetBSD, early Linux variants |
Read across the rows: every policy except strict LRU is designed to protect the working set against one-off scans, and every policy except strict LRU pays in extra metadata or extra structure. There is no free lunch; what changes between policies is which cost you are paying.
Pin/unpin and the dirty flag — the protocol beneath every policy
Everything above was about the policy — which frame to evict. There is a separate concern: the protocol between the buffer pool and its caller. That protocol has two pieces, and both are easy to get wrong.
The pin count. Every fetch returns a pinned frame. The caller is promising: I am looking at this data. Do not let the buffer pool move the page while I am still reading it. Every caller must unpin the page when they are done. The pin count is incremented on each fetch and decremented on each unpin, so a page fetched twice must be unpinned twice. The buffer pool's eviction logic skips any frame with pin_count > 0. If every frame is pinned (more concurrent readers than the pool has frames), fetch cannot find a victim and the system is stuck — so buffer pools are typically sized at 10× to 100× the concurrent thread count.
# The pin/unpin contract in use.
frame = pool.fetch(473) # pin_count: 0 → 1
key = read_key(frame.data, 12) # use the data
pool.unpin(473) # pin_count: 1 → 0, eligible for eviction again
# For a write:
frame = pool.fetch(473)
write_key(frame.data, 12, new_key)
pool.unpin(473, dirty=True) # same as above, plus marks frame dirty
The pin count is the only thing preventing the buffer pool's replacement policy from swapping out a page under a reader's feet. It is not a mutex (two readers can both pin the same page — that is fine) and it is not a latch (two writers pinning the same page is a concurrency bug, but the buffer pool does not check that; the tree's own latching does). It is a reference count, nothing more.
The dirty flag. When a caller modifies a frame's data and unpins with dirty=True, the buffer pool remembers that the page's RAM copy is newer than its disk copy. At eviction time the frame is written to disk before the new page is loaded. This is the write-back policy; its benefit is that a page repeatedly modified (a hot leaf in a write-heavy index) is written to disk once per eviction rather than once per modification.
But the dirty flag alone is not enough for durability. The WAL (chapter 27) layers on top of the dirty flag: before a dirty page can be written to disk, the WAL record describing the change must already be on disk. The buffer pool therefore exposes flush(page_id) so the WAL code can force a specific page out at a specific moment. And the buffer pool's eviction code, in a full implementation, consults the WAL to make sure it does not evict a dirty page whose WAL record has not yet been persisted. That interplay is the subject of the next two chapters.
A transaction through the buffer pool
Here is the flow of one B+ tree lookup, start to finish, against the buffer pool. Pool size: 4 frames. Disk has 8 pages. No page is cached yet; the working set is page 1 (root), pages 2 and 3 (internal nodes), pages 4–8 (leaves).
>>> pool = BufferPool("tree.db", n_frames=4)
>>> # Query 1: search for key that lives in page 5.
>>> root = pool.fetch(1) # miss, load p1 into frame 0. frames: [p1*,_,_,_]
>>> decide_child(root.data, 42)
>>> pool.unpin(1) # frames: [p1,_,_,_]
>>> mid = pool.fetch(2) # miss, load p2 into frame 1. frames: [p1, p2*,_,_]
>>> decide_child(mid.data, 42)
>>> pool.unpin(2)
>>> leaf = pool.fetch(5) # miss, load p5 into frame 2. frames: [p1, p2, p5*,_]
>>> value = find_key(leaf.data, 42)
>>> pool.unpin(5)
>>> # Three misses, three pread calls. Hit rate: 0%.
>>> # Query 2: search for key that lives in page 4.
>>> root = pool.fetch(1) # HIT. frames unchanged, LRU bumps p1.
>>> pool.unpin(1)
>>> mid = pool.fetch(2) # HIT.
>>> pool.unpin(2)
>>> leaf = pool.fetch(4) # miss. frames: [p1, p2, p5, p4*]. Hit rate so far: 2/6.
>>> pool.unpin(4)
>>> # Query 3: search for key in page 7. Pool is now full.
>>> root = pool.fetch(1) # HIT.
>>> pool.unpin(1)
>>> mid = pool.fetch(3) # miss. Pool full → evict LRU (p5).
# frames: [p1, p2, p3*, p4]
>>> pool.unpin(3)
>>> leaf = pool.fetch(7) # miss. Evict LRU (p4).
# frames: [p1, p2, p3, p7*]
>>> pool.unpin(7)
Look at what happened. The first query cost three disk reads (cold cache). The second query cost only one: the root and the internal node were cached from the first query. The third query hit the root from cache and paid for two new leaves. Across three queries we did 6 disk reads instead of 9 — a 33% saving, which grows as the working set fits better into the pool. With a buffer pool sized for the top two levels of a billion-record tree (a few megabytes), a point lookup is one disk read, not four.
The buffer pool is not magic. It is just an array of page slots with a replacement policy. But the difference between "one disk read per query" and "four disk reads per query" is the difference between a database that serves 100,000 queries per second and one that serves 25,000. That factor of four is where every buffer-pool engineering hour goes.
Common confusions
-
"Why not just use the OS page cache? Linux already caches file pages for free." Three reasons, each of which matters. First, the kernel's LRU does not know which pages the database considers hot; it treats your B+ tree root the same as a one-off log read. Second, the kernel flushes dirty pages on its own schedule, which breaks the WAL's ordering requirement (WAL record before page write). Third, the buffer pool needs to pin pages to prevent eviction during concurrent modification; the kernel has no such API. Postgres, interestingly, does also rely on the OS page cache — its
shared_buffersis a first-level cache on top of the kernel's cache, a debated design. InnoDB, Oracle, SQL Server all runO_DIRECTand skip the kernel cache entirely. -
"How big should the buffer pool be?" Big enough to hold the working set — the pages your queries actually touch over a second or so — plus the top levels of every index. For a billion-record table with a four-level B+ tree, the top three levels are roughly 254^3 / 254 = 64{,}000 pages or 256 MiB; the leaves are 1 GB to 16 GB. Sizing the pool for the top three levels gives three-out-of-four hops in RAM; sizing for the whole index gives every lookup in RAM. Postgres typically runs with
shared_buffersat 25% of system RAM (leaving the rest for the OS page cache). InnoDB typically runs at 60–80% of system RAM (because it skips the OS cache and needs it all itself). -
"Dirty flag vs pin count — are they related?" No. The pin count says "is this frame currently in use by any thread"; the dirty flag says "is this frame's content newer than disk". A frame can be dirty and unpinned (modified and released; waiting for eviction or a checkpoint to write it back). It can be pinned and clean (actively being read, nothing written). It can be dirty and pinned (being modified right now). It can be clean and unpinned (just-loaded from disk, nobody using it, LRU victim).
-
"What if all frames are pinned?"
fetchcannot find a victim and fails. In a real database this is an error that either deadlocks or kills the query. The defence is to size the buffer pool larger than the maximum number of pages any single operation can pin at once — typically "buffer-pool size > 10 × (max concurrent threads) × (max tree depth)". With 64 threads and a four-level tree, that is 2,560 frames minimum. Real pools have tens of thousands. -
"Why is write-back better than write-through?" Because hot pages are modified many times between evictions. Write-through writes on every modification: N writes per N modifications. Write-back writes once per eviction: 1 write per modification plus all subsequent modifications of the same page until eviction. On a write-heavy workload that is orders of magnitude less disk I/O. The cost of write-back is that a crash loses in-memory modifications — which is why the WAL exists, exactly to provide the durability that write-back sacrifices.
-
"Can I just use
functools.lru_cachefor this?" No.functools.lru_cachehas no concept of pinning, dirty flags, eviction-with-writeback, or fixed size in bytes. It is a decorator, not a cache manager. The machinery above exists because database caches have four more requirements than application caches.
Going deeper
If the picture is clear — buffer pool is frames + page table + policy, pin/unpin is a reference count, dirty flag triggers writeback on eviction — the rest of this section is the twenty years of argument about which policy is best.
ARC — the adaptive replacement cache
Published by Megiddo and Modha at IBM in 2003, ARC (Adaptive Replacement Cache) is widely considered the best general-purpose replacement policy ever designed. Its insight: workloads vary between "recency-biased" (hot pages stay hot) and "frequency-biased" (pages touched many times over long intervals matter most). 2Q picks one ratio of recency-protection to frequency-protection at design time. ARC tunes the ratio on the fly.
ARC maintains four lists. Two are real caches — one for pages seen once (T1, recency), one for pages seen twice or more (T2, frequency). Two are ghost lists — B1 and B2 — that hold the page identifiers (not the data) of pages recently evicted from T1 and T2. When a request hits a ghost list, ARC concludes that the corresponding real list was too small, and shifts the recency/frequency split in that direction. The ratio drifts toward whatever the workload actually needs.
ARC's hit rate is measurably better than 2Q and LRU-K on workloads that change character — OLTP during the day, analytics at night. ZFS uses it. IBM DB2 originally used it. A patent on ARC (held by IBM) kept it out of Linux for about a decade; CLOCK-Pro (a Clock-based approximation) was developed partly to sidestep the patent while capturing the idea.
CLOCK-Pro and the modern Clock family
CLOCK-Pro, published by Jiang and Zhang in 2005, takes ARC's "cold page, hot page, history" distinction and implements it on top of a Clock-style circular array. Three kinds of pages: hot (frequently accessed, treated like Am in 2Q), cold resident (currently in cache but only seen once — like A1), and cold non-resident (ghost entries like B1). The hand sweep does the work of promotion, demotion, and eviction, all in one pass. Storage cost: a few bits per frame.
NetBSD uses CLOCK-Pro. Early Linux versions used a CLOCK-Pro-ish algorithm before settling on the current multi-list LRU-approximation. Every major file system or database built after 2010 that writes its own cache manager considers CLOCK-Pro (or its descendant, TwoHandClock) a default starting point.
The Postgres shared_buffers debate
Postgres's shared_buffers setting has been debated for two decades. The dominant advice in 2010 was "25% of RAM, no more" — because beyond that, Postgres's buffer pool starts fighting the kernel's page cache (both cache the same pages; you lose the benefit of the larger cache for no gain). The dominant advice in 2024 is closer to "50% of RAM" — because Postgres's buffer pool has gotten smarter (the usage-count Clock) and because modern SSDs make the kernel's prefetching less critical than it used to be.
But there is a deeper argument: should Postgres skip the kernel cache entirely (like InnoDB does, with O_DIRECT)? Doing so would let Postgres size shared_buffers at 90% of RAM and stop worrying about double-caching. The reason Postgres has not: its buffer pool's algorithms and its I/O layer were designed assuming the kernel did the prefetching, the write-coalescing, and a lot of the tricky bits. Rewriting to use O_DIRECT properly has been on the wishlist for a decade; it may ship in Postgres 18 or 19. When it does, the entire shared_buffers sizing conversation will change.
Scan resistance without policy changes — ring buffers
Postgres's approach to scans is different from 2Q's. Rather than redesigning the whole replacement policy, Postgres uses a ring buffer for sequential scans: a small fixed-size set of frames (typically 256 KiB) that sequential scans use exclusively. The scan fills the ring, cycles it, and never promotes its pages to the main shared_buffers LRU. The main buffer pool is untouched.
The ring buffer is effectively a one-line hack that captures the 2Q idea for one specific workload (sequential scans) without changing the underlying LRU. It works very well for the workload it is designed for and does nothing for other scan-like patterns (hash joins spilling to disk, index-only scans of a secondary index). The B+ tree we are building will not include a ring buffer; instead, Build 5 will implement 2Q or a variant, which generalises the scan-resistance idea beyond sequential scans.
Concurrent buffer pools — beyond LRU's list lock
The biggest engineering challenge in modern buffer pool design is concurrency. A buffer pool serving 256 threads cannot afford a global lock on every fetch. Three common strategies:
- Sharded page tables. Split the hash from
page_idto frame into N independent tables, each with its own lock. Lock contention drops by N×. Used by nearly every modern engine. - Lock-free Clock. The reference bit is an atomic bit. The hand pointer is an atomic integer. Multiple threads can sweep simultaneously; contention is limited to the rare eviction event. Used by RocksDB's block cache and a few others.
- Versioned reads. Each frame has a version number; readers record the version before using the frame's data and check that it did not change. If it did, retry. No read locks at all. Used in high-end research systems (LeanStore, Umbra) and in some columnar engines.
Build 4's buffer pool is single-threaded for pedagogical clarity. Production buffer pools are thousand-line modules of concurrent code. The algorithms in this chapter are the skeleton those modules hang on.
Where this leads next
You now have a working buffer pool — four methods, a replacement policy of your choice, a pin/unpin contract, a dirty flag. Together with the B+ tree from chapter 23, you have the top half of a real storage engine: data structure and cache, operating on disk pages, bounded in memory.
What is missing is the second half of durability. The buffer pool writes dirty pages back at eviction time (and on flush), which is good for throughput but bad for crash safety. If the process dies between modifying a page in RAM and writing it to disk, the modification is lost. And the modification that reaches disk might be a torn write — half old, half new — which corrupts the B+ tree in ways that range from "lost last row" to "tree is now unreadable."
The next chapter addresses the write half of the problem by introducing copy-on-write, an alternative to the buffer pool's in-place update that makes torn writes literally impossible and gives you snapshots for free:
- In-place update vs copy-on-write (LMDB) — chapter 25: instead of modifying a page in place and worrying about crash safety, allocate a new page, write the new content, and atomically swap the pointer in the parent. The tree becomes immutable; old versions linger as long as any reader holds them; crashes cannot corrupt anything. This is what LMDB does, and it is beautiful. The tradeoff is write amplification — you rewrite every ancestor of every modification — but for read-heavy workloads it is an excellent design.
After that, the write-ahead log (chapter 27) gives us the in-place counterpart: modify the page in RAM, log the modification first, and recover from the log if the process dies before the page is flushed. That is what every relational database does. Both designs — copy-on-write and WAL — depend on the buffer pool you just built.
The buffer pool is the hinge. Every access through it; every modification through it; every policy decision about RAM versus disk resolved inside it. Get it right and everything above feels instant on hot data; get it wrong and the rest of the stack is carrying the buffer pool's mistakes.
References
- Elizabeth J. O'Neil, Patrick E. O'Neil, Gerhard Weikum, The LRU-K Page Replacement Algorithm For Database Disk Buffering, SIGMOD (1993) — the paper that introduced LRU-K and made buffer-pool policy a first-class engineering topic.
- Theodore Johnson, Dennis Shasha, 2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm, VLDB (1994) — 2Q, and the clearest statement of why scan resistance matters.
- Nimrod Megiddo, Dharmendra S. Modha, ARC: A Self-Tuning, Low Overhead Replacement Cache, FAST (2003) — the adaptive replacement cache, still the benchmark.
- Song Jiang, Xiaodong Zhang, CLOCK-Pro: An Effective Improvement of the CLOCK Replacement, USENIX ATC (2005) — ARC-level hit rates with Clock-level overhead.
- PostgreSQL source,
src/backend/storage/buffer/— a clean, production implementation of Clock-with-usage-count plus the ring-buffer scan optimisation. - Viktor Leis, Michael Haubenschild, Alfons Kemper, Thomas Neumann, LeanStore: In-Memory Data Management Beyond Main Memory, ICDE (2018) — a modern lock-free buffer pool design that points at where this field is going next.