In short
Build 3 is finished. Your LSM has a memtable, sorted SSTables, Bloom filters, leveled and tiered compaction, snapshots, and a merging iterator — a real engine. On a clean benchmark it hums. Then you put it under a 20 000-writes-per-second load for an hour, plot the read-latency histogram, and see the graph that defines an LSM's reputation in production: a p50 under a millisecond, a p95 around two milliseconds, and a p99 that spikes to tens of milliseconds every few minutes. The spikes are not random — they line up with compaction passes. This chapter names the three walls of Build 3 that are not about what the engine can do but about how it behaves under sustained load. Wall 1 — compaction I/O contends with foreground I/O: leveled compaction writes 10–30× more bytes than the user does (chapter 17), all to the same SSD that reads and writes use, and a sustained 200 MB/s compaction leaves only half of a 400 MB/s SSD for your foreground traffic. Wall 2 — tail latency: a read issued during a compaction pass waits behind the compaction's queued I/Os; p99 balloons from ~1 ms to 20–40 ms; p99.9 can hit 100 ms. Wall 3 — write stalls: if the workload sustains writes faster than compaction can drain L0, the engine starts throttling and eventually stops accepting writes entirely (RocksDB's level0_stop_writes_trigger). Three walls, one root cause: compaction is a background process competing with the foreground for the same scarce resource (SSD bandwidth), and the LSM design has baked that contention into its architecture. This chapter draws the spike profile, quantifies each wall, contrasts with a B+ tree's much smoother latency curve, and hands the baton to Build 4. It also — importantly — names when LSM is still the right choice: write-heavy, SSD-friendly, dataset much larger than RAM. The walls are real, but so are the strengths.
You have finished Build 3. Ten chapters of it. The engine in your head right now looks like: a memtable absorbs writes, flushes produce SSTables into L0, leveled compaction drains L0 down through a pyramid of levels, Bloom filters keep reads from drowning in SSTable probes, snapshots let long-running scans coexist with ongoing compactions, and universal compaction gives you three tuning knobs for the write/read/space trilemma.
Spin up a benchmark. 64-byte keys, 256-byte values, a read-write mix of 80/20 — the "YCSB B" shape, which is the closest thing the industry has to a standard social-media workload. On your laptop SSD, with a cold cache, you see:
p50 read latency: 380 µs
p95 read latency: 1.2 ms
p99 read latency: 2.5 ms
throughput: 38 000 ops/sec
Good numbers. You commit, push, and your engineering manager nods approvingly.
Now run the same benchmark for an hour. Same workload, same hardware. Plot the p99 latency not as a single number but as a time series — p99 over each 10-second window. The graph has spikes.
The spikes are not noise, not flaky hardware, not a bug in your code. They are the designed behaviour of an LSM under load. Compaction is a background process — but "background" does not mean "free." The compactor reads SSTables and writes new ones, and those reads and writes go to the same SSD your foreground traffic uses. When a compaction pass fires, it reserves a chunk of the SSD's I/O bandwidth for itself, and every read or write that happens to land during that window waits in the OS's block-layer queue behind the compaction's in-flight I/Os.
This chapter names three walls. Not functional walls like Build 2's two — those were about things Bitcask couldn't do at all. These are operational walls: things the LSM can do, but pays for in latency, throughput, or smoothness every time it does them. Stare at them long enough and the motivation for Build 4 (B+ trees, in-place updates, no compaction) becomes obvious.
Wall 1 — compaction competes with foreground writes for I/O bandwidth
Your SSD has a bandwidth number on its datasheet. Call it 500 MB/s for a mid-range NVMe drive in 2026. That is the total read-plus-write bandwidth the drive can sustain at saturation, across all queues, shared by every process on the machine.
The LSM's foreground write path is tiny. A 20 000-writes-per-second workload at 256-byte values pushes only 20 000 × 256 = 5 MB/s of user data. Plus WAL writes (same 5 MB/s), plus the occasional memtable flush (~5 MB/s averaged). Call it 15 MB/s of foreground write traffic. 3% of the SSD's bandwidth.
Now add compaction. Leveled compaction's write amplification is ~10–30× in practice (see the leveled compaction derivation). At 15× amp, 5 MB/s of user writes produce 75 MB/s of compaction writes. Plus the compaction reads — to merge, the compactor must read the input SSTables, which is another ~75 MB/s (roughly equal to the write side, since compaction rewrites what it reads plus dedup deltas). Total compaction I/O: ~150 MB/s, 30% of your SSD's bandwidth, sustained.
Why the 10× compaction-to-user-write ratio hits SSD bandwidth so hard: the SSD does not care whether the bytes are "user" or "compaction." Every block write counts against the same bandwidth budget, and SSDs enforce a roughly-symmetric read/write ceiling — if your writes already use 40% of the bandwidth, reads compete for the remaining 60%. Compaction's 150 MB/s of sustained I/O displaces 150 MB/s of potential foreground traffic, one-for-one.
Plot it.
Compaction is not competing with reads and writes in the sense of holding a lock; it is competing for the one resource nobody can timeshare cleanly — SSD I/O. The Linux block layer gives every I/O request a queue position, and a compaction that has 32 outstanding writes to the SSD pushes every foreground read to position 33.
The RocksDB team measured this directly. In a 2018 paper [1], they showed that a workload with 40 MB/s of user writes on a 500 MB/s SSD saw foreground read throughput collapse from 80 000 reads/sec to 25 000 reads/sec the moment a leveled L2→L3 compaction kicked in — a 3× read-throughput reduction driven entirely by I/O contention. Their fix was the rate limiter (see Going deeper), which caps compaction's bandwidth at a configured fraction of the SSD's total — but the fix just moves the trade-off; it does not dissolve it. Limit compaction's bandwidth and it falls behind, L0 grows, and you hit Wall 3.
Wall 2 — tail latency: a background compaction stall stalls a user read
Wall 1 is about average throughput. Wall 2 is about the tails — the p99 and p99.9 latencies that turn a benchmark-friendly engine into a production horror story.
Here is what happens, step by step, when a read arrives during a compaction pass.
- The read reaches the read path and starts checking SSTables (memtable → L0 → L1 → ... in newest-first order).
- The read hits a Bloom-filter-positive in some SSTable at L3 and needs to load a 16 KB block from the SSD.
- It issues a
pread()to the kernel. - The kernel's block layer sees the compaction's I/O queue — say 32 writes and 16 reads in flight for the compactor thread — and queues the user's
pread()behind them. - The SSD processes the compaction's I/Os in order. Each takes ~100 µs. The user read waits ~32 × 100 µs = 3.2 ms in queue before its I/O even starts.
- The user read's I/O completes. Total latency: ~3.2 ms queue + ~100 µs SSD + ~50 µs software overhead ≈ 3.35 ms. This single read has eaten your millisecond-SLO budget.
Zoom out. At p50 a typical read hits the memtable or block cache and is a 0.3 ms operation. At p99, during a compaction pass, the same read can see 30+ ms. That is a 100× tail-to-median ratio, and it is the signature of an LSM under load.
Why p50 is unaffected but p99 explodes: compaction passes are rare but long. Most reads happen when no compaction is active, so p50 and p95 look clean. A small fraction of reads — the ones that happen to land during a compaction — pay the queue penalty, and that fraction lands at the tail of the histogram. If compactions fire every 10 minutes and each lasts 30 seconds, then 30/(10 × 60) = 5% of reads overlap a compaction — exactly the tail where p99 and p99.9 live.
Quantify it across production-tier engines. The numbers below come from published benchmarks (RocksDB [1], Cassandra [2], ScyllaDB [3]) on NVMe SSDs with sustained write-heavy workloads.
| Engine | p50 read | p99 read | p99.9 read | p50:p99 ratio |
|---|---|---|---|---|
| RocksDB (leveled, default) | ~0.3 ms | ~15 ms | ~60 ms | 50× |
| RocksDB (leveled, rate-limited) | ~0.4 ms | ~6 ms | ~25 ms | 15× |
| Cassandra (STCS tiered) | ~0.5 ms | ~30 ms | ~120 ms | 60× |
| ScyllaDB (DTCS / shard-per-core) | ~0.2 ms | ~4 ms | ~15 ms | 20× |
| Postgres (B+ tree, for reference) | ~0.3 ms | ~1.5 ms | ~4 ms | 5× |
Two things stand out. First, every LSM entry has a p50:p99 ratio of 15× or worse; the B+ tree (Postgres) has 5×. Second, mitigations help but don't dissolve the wall — rate limiting cuts RocksDB's p99 from 15 ms to 6 ms (a 2.5× improvement) at the cost of ~30% throughput headroom. ScyllaDB's shard-per-core design (one thread per CPU, one compaction queue per shard) avoids cross-shard contention and brings p99 down to 4 ms — impressive, but still 3× worse than the B+ tree.
The 2020 ACM paper "Characterising, Modelling, and Benchmarking RocksDB Key-Value Workloads at Facebook" [4] traced 14 days of Facebook's production RocksDB fleet and found that compaction caused 60–85% of all p99 read latency spikes above 10 ms. The engineers' mitigations — rate limiting, I/O priority classes, sub-compactions, separate SSDs for logs and data — reduced the spikes but could not eliminate them. Compaction contention is not a bug you can fix; it is a property of the design.
Wall 3 — write stalls and back-pressure when compaction can't keep up
Wall 1 is average throughput; Wall 2 is tail latency; Wall 3 is the nightmare scenario — the engine stops accepting writes entirely.
Leveled compaction has a hard invariant: L0 file count must stay below some threshold (default 4). If flushes keep arriving faster than L0→L1 compactions can drain L0, the file count rises. At 20 files, RocksDB starts throttling writes — artificially sleeping each put() for a few hundred microseconds to give compaction time to catch up. At 36 files, RocksDB stops accepting writes entirely until L0 falls below the threshold. This is the level0_slowdown_writes_trigger / level0_stop_writes_trigger pair from the leveled compaction chapter.
The write path, from the user's perspective, looks like this in the bad case:
t=0 ms: put("k", "v") → OK, 0.1 ms
t=1 ms: put("k2", "v2") → OK, 0.1 ms
...
t=30 s: put("kN", "vN") → OK, 0.1 ms
t=30.1 s: put("kN+1", "vN+1") → OK, 0.5 ms (throttle: +400 µs sleep)
t=31.0 s: put("kN+2", "vN+2") → OK, 1.5 ms (throttle: +1.4 ms sleep)
t=32.0 s: put("kN+3", "vN+3") → OK, 5 ms (throttle: +4.9 ms sleep)
t=33.0 s: put("kN+4", "vN+4") → BLOCKED — 800 ms (stop-writes trigger)
t=33.8 s: put("kN+4", "vN+4") → OK, 0.1 ms (compaction drained L0)
To a caller that expected sub-millisecond writes, an 800 ms block is catastrophic. Request timeouts fire, queues back up, retries amplify load, and the cascading-failure pattern that every postmortem ends up describing begins.
Three remarks on Wall 3.
Stall is correct behaviour. The alternative is worse — if L0 is allowed to grow unboundedly, reads get unboundedly slow (every read must probe every L0 file), Bloom filters exhaust RAM, and eventually a single key lookup costs tens of seconds. Stopping writes until L0 drains is the designed response; the designers chose "reject writes temporarily" over "degrade reads permanently." That is the right call, but the user feels it as an outage.
Stall is preventable — at a cost. Tune compaction to keep up with any realistic burst (more threads, higher bandwidth allocation, deeper L0 threshold). Each knob you turn eats into foreground capacity; you are rebalancing the Wall-1 vs Wall-3 trade-off, not escaping it.
Stall tells you your capacity ceiling. If your workload sustains enough writes to cause stalls, you have discovered your engine's actual write-throughput ceiling — it is the rate at which the compactor can drain L0, minus a safety margin. RocksDB's community documentation says a reasonable rule of thumb is: sustained write throughput = 0.2 × SSD write bandwidth / write amplification. On a 500 MB/s SSD with 15× amp: 0.2 × 500 / 15 = 6.7 MB/s of sustained user writes. Peak bursts can go higher (the memtable absorbs them), but sustained any higher than that and you eventually stall.
Why this motivates an in-place engine
Stare at the three walls together and a hypothesis forms. All three are caused by the same thing: compaction. Remove compaction, and all three walls dissolve.
But you cannot just "remove compaction" from an LSM — compaction is what turns the log-structured appends into a queryable sorted structure. The design requires it. To remove compaction you have to change the design: don't write data out-of-place and compact it into position, write it in-place to begin with.
That is exactly what a B+ tree does.
A B+ tree put writes exactly two things: a WAL record (sequential append, cheap), and a modified leaf page (one random 4 KB page write). Occasionally a page splits, cascading a few more page writes up the tree. Total amplification: ~2–5×. No compaction. No Wall 1, no Wall 2, no Wall 3.
What the B+ tree pays for this is different. Every page write is a random write (the leaf to be modified lives at some arbitrary SSD location, not at the tail of an append-only log), so the raw I/O pattern is less SSD-friendly. Every read must traverse the tree from root to leaf — 3–5 page reads instead of a single memtable probe + 1 SSTable block. And in-place updates make crash recovery harder: a torn page write (half the page old, half new) can corrupt the index if not protected by a careful WAL protocol (ARIES, Build 4 chapters 7–8).
These are real costs. But they are smooth costs — a B+ tree's p99 is a small multiple of its p50 because there is no bursty background process. Contrast with the LSM's 50× p99/p50 ratio under load. Different workloads, different winners. That is the whole point of Build 4 existing: a second engine family with a genuinely different latency profile.
When LSM is still the right choice
The three walls are real. None of them kills the LSM — they constrain it. For a large and important class of workloads, the LSM's strengths outweigh them by a wide margin.
Use LSM when:
-
The workload is write-heavy. If writes outnumber reads by more than 5:1, the LSM's append-mostly write path (no in-place random writes, no page splits) dominates any B+ tree. Time-series databases, event logs, telemetry, write-ahead fact tables — all of these are natural LSM territory. InfluxDB, Prometheus, and most of the observability stack are LSMs for this reason.
-
The dataset is much larger than RAM. LSM's sparse index (one entry per SSTable block, ~4 KB of index per MB of data) is far more RAM-efficient than a B+ tree's buffer pool. A 10 TB LSM needs ~40 MB of sparse index RAM; a 10 TB B+ tree with a 10% hot set needs ~1 TB of buffer pool to avoid disk reads on every request. For very large datasets on a single node, the LSM wins on RAM economy even when its latency tail is worse.
-
You can pay for SSDs (not HDDs). Compaction is a sequential-read, sequential-write workload, and SSDs deliver 500–3000 MB/s of sequential I/O with sub-millisecond block latency. On HDDs (100 MB/s sequential, 10 ms seek), compaction is far more painful — a random-access B+ tree with a strong buffer pool can outperform an LSM on spinning rust. LSM is the SSD-era design; B+ tree predates SSDs by decades and works on both.
-
The workload tolerates p99 spikes. Analytics pipelines, bulk loaders, user-facing feeds where a 30 ms spike is invisible inside a 2-second page load — all of these tolerate Wall 2. Transactional workloads (payments, order books, inventory) where a 30 ms tail becomes a user-visible slowdown do not, and prefer B+ trees.
-
The workload has spatial locality in writes. If recently written keys are likely to be read soon (which is true for most caching, session, and working-set workloads), the memtable and L0 absorb nearly all reads with no SSTable probes. The LSM's read path is cheapest exactly when the read hits recent data, which is the common case for most applications.
Production LSM deployments at global scale in 2026: Facebook/Meta's entire social graph is on RocksDB. Google's Bigtable (Spanner's bottom layer), Apple's FoundationDB, Cassandra at Netflix (for viewing history, ~10 PB), DynamoDB (AWS's serverless KV), TiDB, CockroachDB, YugabyteDB — all LSMs. The walls are visible to every one of these teams, and every one has mitigation tooling (rate limiters, I/O priority classes, careful compaction scheduling, SSD-local storage). None has switched to B+ trees for their primary storage, because for their workloads the LSM's strengths — write throughput at scale, RAM efficiency, SSD-friendliness — win over its tail latency.
Production B+ tree deployments: Postgres, MySQL (InnoDB), SQLite, SQL Server, Oracle — every major transactional relational database. When the workload is read-mixed-with-write, OLTP-shaped, latency-sensitive, and datasets fit in reasonable hardware, the B+ tree wins. Build 4 develops this engine.
The choice between LSM and B+ tree is not a "which is better" — it is a "which shape of trade-off fits my workload." Build 3 and Build 4 are the two canonical answers.
Common confusions
-
"Can't I just give compaction a lower OS priority so it doesn't interfere with reads?" You can, and production systems do. Linux has ionice classes (
ionice -c 3for idle,-c 2 -n 7for best-effort low-priority), and RocksDB'senv->LowerThreadPoolIOPriority()sets compaction threads to this class. This helps some — the scheduler does defer compaction I/Os when foreground I/Os are queued — but the SSD itself does not honour priority classes. Once the block layer dispatches an I/O to the drive, the drive processes it in the order it received them. So priority classes reduce but do not eliminate the queue-behind-compaction problem. The real knob is bandwidth-based rate limiting (Going deeper), which throttles compaction's I/O rate directly. -
"Doesn't the OS page cache absorb compaction reads?" Partially. The compactor's input files are warm from the previous compaction, and the OS page cache can serve compaction's reads without hitting the SSD — so compaction's read side is often cheap in practice. The write side, however, is not cacheable: compaction must flush its outputs to disk for durability (else a crash loses data), and those flushes are real I/Os against the SSD's write bandwidth. The asymmetry is why "compaction writes contend; compaction reads partially don't" is a real tuning observation.
-
"Universal compaction (Build 3's best compactor) should have solved the write-amp problem. Why are the walls still there?" Universal compaction reduces leveled's write amplification by relaxing the non-overlap invariant, at the cost of higher read amp and transient space amp. The total I/O budget compaction consumes still scales with workload write rate — you have moved the trade-off point on the trilemma triangle, not escaped it. Walls 1 and 3 are about the total compaction I/O relative to SSD bandwidth, which universal reduces but does not eliminate.
-
"Why not just add more SSDs?" Adding SSDs (one for WAL, one for SSTables, one for compaction scratch) is a standard production mitigation and does help — compaction I/O on a dedicated disk does not contend with foreground I/O on the main disk. This raises the effective bandwidth ceiling by roughly 2× at the cost of twice the hardware spend. It also increases cost and complexity; not every deployment can justify it. And for workloads that saturate even two SSDs, the underlying trade-off is still there.
-
"My workload is small. Do I have to worry about any of this?" No. On a 10 GB dataset with 1000 writes/sec, compaction is a few seconds of activity per hour — negligible. The walls become visible at sustained multi-thousand-writes-per-second workloads on datasets that exceed RAM, which is a real operational regime (and the one every distributed-database team faces), but it is not where most small services live. If your workload fits the walls' precondition, worry about them; if it doesn't, an LSM like RocksDB is a near-perfect default.
-
"Does B+ tree have tail latency issues of its own?" Yes, different ones. Page splits cascade up the tree when a leaf is full, causing a brief write amplification burst (sometimes called a "cascade split"). Checkpointing — the process of writing dirty pages from the buffer pool to disk — is a periodic background I/O event that does contend with foreground traffic, and can cause small tail spikes (typically 3–10× p99 over p50, vs the LSM's 50×). The B+ tree has milder operational walls, not zero walls. Build 4 names them explicitly.
Going deeper
The main text gave you the three walls and the LSM-versus-B-plus-tree contrast. This section goes a level down into the production mitigations that every serious LSM deployment runs — RocksDB's rate limiter, I/O priority classes, sub-compactions, and the newer SSD-specific write-stall patterns.
RocksDB's rate limiter
The RocksDB::RateLimiter is a token-bucket-shaped throttle that caps compaction and flush I/O at a configured bytes-per-second rate. Each compaction thread, before issuing a write, requests tokens proportional to the write size; if the bucket is empty the thread sleeps until tokens refill. The token refill rate is the configured bandwidth cap — say, 100 MB/s — which means compaction cannot exceed that on any SSD.
// RocksDB: cap compaction at 100 MB/s
auto rate_limiter = NewGenericRateLimiter(100 * 1024 * 1024); // bytes/s
options.rate_limiter = rate_limiter;
The cap is usually set at 2–4× the sustained user write rate, which gives compaction headroom to catch up during bursts but prevents it from monopolising the SSD. Get the cap wrong in either direction and you hit a wall: too low and compaction falls behind, L0 grows, Wall 3 fires; too high and foreground I/O starves, Wall 2 fires. The cap is the single most impactful operational knob in a RocksDB deployment.
Advanced rate limiters are auto-tuning — they watch L0 file count and adjust the cap dynamically. When L0 is below its target, the rate limiter is strict (compaction is slow, foreground gets the bandwidth); when L0 is growing, the rate limiter relaxes (compaction speeds up to drain L0). RocksDB's kAutoTuned mode does this with a PID-style control loop.
I/O priority classes
The Linux block layer has I/O priority classes that the kernel uses to order requests in the dispatch queue:
IOPRIO_CLASS_RT: real-time. Almost nothing uses this.IOPRIO_CLASS_BE(best-effort, the default), with eight priority levels 0-7.IOPRIO_CLASS_IDLE: only dispatched when no other class has pending I/Os.
Setting compaction threads to IOPRIO_CLASS_IDLE makes them yield to any foreground I/O — a compaction write is scheduled only when the drive has no user I/O pending. This is aggressive and can starve compaction entirely under heavy read load. More commonly, compaction runs at BE 7 (best-effort, lowest priority) so it shares the queue with foreground I/O but gets served last.
The limitation, already noted: the SSD itself does not honour priority classes. Once the OS dispatches an I/O to the drive, the drive's internal scheduler processes it in the order received. So the ioprio knob is useful for shaping the OS-level queue but does not help when the drive's own queue is saturated.
Sub-compactions
A single compaction of one L1 file against ~10 L2 files is a serial merge. For large files (256 MB inputs), the merge takes seconds. During those seconds, a single thread is driving all the I/O and all the dedup work. That thread is a bottleneck for compaction throughput and a single point of I/O contention.
Sub-compactions (RocksDB's max_subcompactions) split the compaction's key range into N disjoint slices and run each slice on its own thread. The slices share the input files (reading different byte ranges) but produce independent output files. This parallelises the compaction and — crucially — spreads its I/O across multiple queues, which behaves better with multi-queue SSDs (modern NVMe drives have 8–64 submission queues, and multiple threads can use them simultaneously).
Typical sub-compaction settings: 4–8 threads per compaction. Cuts wall-clock compaction time in half or better. Cost: 4–8× the peak I/O instantaneous rate during the compaction, which can worsen the Wall 2 spike if the rate limiter is not set carefully. Modern RocksDB deployments enable sub-compactions with an auto-tuning rate limiter so the total I/O stays bounded even as parallelism rises.
SSD write stalls
A surprising production issue is that SSDs themselves can stall. Modern SSDs use garbage collection internally — they erase large blocks (~256 KB) before they can be rewritten, and GC runs when the drive's free-block pool is low. During SSD GC the drive's write latency can spike from 100 µs to 10 ms or more, independent of what the LSM is doing.
Worse, SSD GC is correlated with LSM compaction: compaction's sustained high write rate eats through the SSD's free-block pool, triggering GC, which then slows down compaction, which makes L0 grow, which triggers LSM write stalls (Wall 3). The two stall types — LSM stall and SSD stall — can feed each other, and the cascade is visible in production as bimodal latency distributions where a small fraction of writes take 100+ ms.
Mitigations: enterprise SSDs with larger GC-reserve pools, write-amplification-aware storage layers that migrate cold data to cheaper storage (Meta's XDP layer does this), and NVMe ZNS (Zoned Namespace) drives that expose the SSD's internal zones directly to the LSM, allowing the LSM to align its SSTable file sizes with the drive's zone size and avoid triggering GC. ZNS is a 2022-era technology and is increasingly used by Percona-tuned RocksDB deployments [5].
Separation of logs and SSTables
A deployment pattern common in production: put the WAL on one SSD and the SSTables on another. The WAL's I/O pattern is sequential-append, which is the SSD's best case (high throughput, low amplification). The SSTable I/O pattern is mixed — reads for foreground, reads+writes for compaction. Separating them:
- Gives the WAL a dedicated SSD with low latency and no compaction contention. WAL fsyncs are fast and predictable — helpful for write-path p99.
- Halves the effective bandwidth contention on the SSTable SSD (no WAL traffic competing).
- Costs: two SSDs instead of one, plus operational complexity around per-device failure modes.
Cassandra, MongoDB, and Postgres all support separate WAL/data disks, and at scale every one of them recommends it.
Where this leads next
You have Build 3 in your head. A full LSM engine, built from first principles over ten chapters, with snapshots and universal compaction and the trilemma you can tune inside of. And you have the three walls — the operational ceilings that show up under sustained load and that no amount of LSM tuning fully dissolves.
Build 4 — B+ trees answers the walls from a different architectural starting point. Instead of "write appends then compact them into position," the B+ tree writes updates directly to the position they should live at. The engine has:
- No compaction. Walls 1, 2, and 3 evaporate. The background has no I/O burden to compete with the foreground.
- In-place updates. A put finds the leaf page for the key and modifies it; write amplification drops to ~3×.
- Smooth latency. p99 is a small multiple of p50 — usually 3–5×, not 15–50×.
- Range queries for free. Leaf pages are linked left-to-right by sibling pointers; a range scan is a binary search to the start key followed by a leaf-to-leaf walk. No merging across levels.
But the B+ tree takes on its own costs: random page writes (worse sequential-I/O efficiency than the LSM), buffer pool management (careful LRU eviction, dirty-page tracking, checkpointing), and a far more delicate crash-recovery protocol because in-place updates break if interrupted mid-write. Build 4 spends ten chapters developing all of this.
The next chapter — Pages: the fundamental unit of disk I/O — opens Build 4 by naming the shift in unit of work. An LSM thinks in SSTables (~64 MB files, sorted, immutable). A B+ tree thinks in pages (~4–16 KB fixed-size blocks, mutable, indexed by page-id). That one change in the unit — from file to page, from append-only to in-place — is the whole architectural pivot. The chapter derives the page size from the hardware (SSD page size, DRAM cache line, filesystem block), shows how the buffer pool becomes the central data structure of the engine, and begins laying the foundation for the tree-of-pages structure that Build 4 will build up to a full relational-grade storage engine.
Close the hood on Build 3. It is a real, production-shaped LSM — the architecture of LevelDB, RocksDB, Cassandra, and half the modern data infrastructure. The walls are known, the mitigations are known, and for write-heavy workloads with datasets much larger than RAM, it is still the engine of choice.
Now turn the page. Build 4 starts with one idea: what if every I/O the engine does is one page, and the engine's main job is to decide which pages to keep in RAM?
References
- Siying Dong, Mark Callaghan, Leonidas Galanis, Dhruba Borthakur, Tony Savor, and Michael Strum, Optimizing Space Amplification in RocksDB (CIDR 2017) — measures compaction I/O contention and foreground throughput collapse on production RocksDB; introduces the rate limiter analysis used in this chapter.
- Apache Cassandra, Read and Write Latency Under Compaction — Cassandra's operational documentation on STCS p99 spikes and mitigations, including throughput throttles and compaction schedulers.
- Avi Kivity et al., ScyllaDB's Shard-per-Core Architecture — how per-core shards with independent compaction queues cut p99 tail latency 3–5× relative to a shared-compactor LSM.
- Zhichao Cao, Siying Dong, Sagar Vemuri, and David H.C. Du, Characterising, Modelling, and Benchmarking RocksDB Key-Value Workloads at Facebook (FAST 2020) — 14-day trace of Meta's production RocksDB fleet; quantifies the fraction of p99 spikes caused by compaction.
- Matias Bjørling et al., ZNS: Avoiding the Block Interface Tax for Flash-based SSDs (USENIX ATC 2021) — NVMe Zoned Namespace design, and how LSMs align SSTable sizes to SSD zones to eliminate internal GC-induced write stalls.
- Mark Callaghan, Write Amplification, Space Amplification, and Read Amplification (Small Datum blog, 2015) — the canonical practitioner essay on the LSM trilemma, with production-derived amplification numbers for RocksDB, WiredTiger, and InnoDB.