In short

Build 3 is finished. Your LSM has a memtable, sorted SSTables, Bloom filters, leveled and tiered compaction, snapshots, and a merging iterator — a real engine. On a clean benchmark it hums. Then you put it under a 20 000-writes-per-second load for an hour, plot the read-latency histogram, and see the graph that defines an LSM's reputation in production: a p50 under a millisecond, a p95 around two milliseconds, and a p99 that spikes to tens of milliseconds every few minutes. The spikes are not random — they line up with compaction passes. This chapter names the three walls of Build 3 that are not about what the engine can do but about how it behaves under sustained load. Wall 1 — compaction I/O contends with foreground I/O: leveled compaction writes 10–30× more bytes than the user does (chapter 17), all to the same SSD that reads and writes use, and a sustained 200 MB/s compaction leaves only half of a 400 MB/s SSD for your foreground traffic. Wall 2 — tail latency: a read issued during a compaction pass waits behind the compaction's queued I/Os; p99 balloons from ~1 ms to 20–40 ms; p99.9 can hit 100 ms. Wall 3 — write stalls: if the workload sustains writes faster than compaction can drain L0, the engine starts throttling and eventually stops accepting writes entirely (RocksDB's level0_stop_writes_trigger). Three walls, one root cause: compaction is a background process competing with the foreground for the same scarce resource (SSD bandwidth), and the LSM design has baked that contention into its architecture. This chapter draws the spike profile, quantifies each wall, contrasts with a B+ tree's much smoother latency curve, and hands the baton to Build 4. It also — importantly — names when LSM is still the right choice: write-heavy, SSD-friendly, dataset much larger than RAM. The walls are real, but so are the strengths.

You have finished Build 3. Ten chapters of it. The engine in your head right now looks like: a memtable absorbs writes, flushes produce SSTables into L0, leveled compaction drains L0 down through a pyramid of levels, Bloom filters keep reads from drowning in SSTable probes, snapshots let long-running scans coexist with ongoing compactions, and universal compaction gives you three tuning knobs for the write/read/space trilemma.

Spin up a benchmark. 64-byte keys, 256-byte values, a read-write mix of 80/20 — the "YCSB B" shape, which is the closest thing the industry has to a standard social-media workload. On your laptop SSD, with a cold cache, you see:

  p50 read latency:   380 µs
  p95 read latency:  1.2 ms
  p99 read latency:  2.5 ms
  throughput:        38 000 ops/sec

Good numbers. You commit, push, and your engineering manager nods approvingly.

Now run the same benchmark for an hour. Same workload, same hardware. Plot the p99 latency not as a single number but as a time series — p99 over each 10-second window. The graph has spikes.

p99 read latency over time showing compaction-aligned spikesA time series chart with time on the x-axis from 0 to 60 minutes and p99 latency on the y-axis from 0 to 50 milliseconds. A baseline green line wiggles between 2 and 4 milliseconds. Five vertical shaded red bands span a few minutes each, labelled "compaction pass L1→L2", "compaction pass L0→L1", "compaction pass L2→L3", "compaction pass L1→L2", and "compaction pass L0→L1". Within each red band the latency line jumps to 20–45 milliseconds and then falls back to baseline. A dashed horizontal line at 10 ms is labelled "SLO target". The latency line crosses above the SLO during every spike.01020304050 msp99 read (ms)015304560 minL1→L2L0→L1L2→L3L1→L2L0→L1SLO target: p99 < 10 msp99 read latency over one hour of steady 80/20 YCSB load
The hallmark latency graph of an LSM under sustained load. Baseline p99 sits comfortably around 2–4 milliseconds. Every few minutes a compaction pass fires and p99 spikes to 20–45 ms, easily overshooting a typical 10 ms SLO. The engine is not broken — the spikes are compaction contending with foreground I/O for the SSD, exactly as designed. This is Wall 2 in miniature.

The spikes are not noise, not flaky hardware, not a bug in your code. They are the designed behaviour of an LSM under load. Compaction is a background process — but "background" does not mean "free." The compactor reads SSTables and writes new ones, and those reads and writes go to the same SSD your foreground traffic uses. When a compaction pass fires, it reserves a chunk of the SSD's I/O bandwidth for itself, and every read or write that happens to land during that window waits in the OS's block-layer queue behind the compaction's in-flight I/Os.

This chapter names three walls. Not functional walls like Build 2's two — those were about things Bitcask couldn't do at all. These are operational walls: things the LSM can do, but pays for in latency, throughput, or smoothness every time it does them. Stare at them long enough and the motivation for Build 4 (B+ trees, in-place updates, no compaction) becomes obvious.

Wall 1 — compaction competes with foreground writes for I/O bandwidth

Your SSD has a bandwidth number on its datasheet. Call it 500 MB/s for a mid-range NVMe drive in 2026. That is the total read-plus-write bandwidth the drive can sustain at saturation, across all queues, shared by every process on the machine.

The LSM's foreground write path is tiny. A 20 000-writes-per-second workload at 256-byte values pushes only 20 000 × 256 = 5 MB/s of user data. Plus WAL writes (same 5 MB/s), plus the occasional memtable flush (~5 MB/s averaged). Call it 15 MB/s of foreground write traffic. 3% of the SSD's bandwidth.

Now add compaction. Leveled compaction's write amplification is ~10–30× in practice (see the leveled compaction derivation). At 15× amp, 5 MB/s of user writes produce 75 MB/s of compaction writes. Plus the compaction reads — to merge, the compactor must read the input SSTables, which is another ~75 MB/s (roughly equal to the write side, since compaction rewrites what it reads plus dedup deltas). Total compaction I/O: ~150 MB/s, 30% of your SSD's bandwidth, sustained.

Why the 10× compaction-to-user-write ratio hits SSD bandwidth so hard: the SSD does not care whether the bytes are "user" or "compaction." Every block write counts against the same bandwidth budget, and SSDs enforce a roughly-symmetric read/write ceiling — if your writes already use 40% of the bandwidth, reads compete for the remaining 60%. Compaction's 150 MB/s of sustained I/O displaces 150 MB/s of potential foreground traffic, one-for-one.

Plot it.

SSD bandwidth budget split between foreground and compaction trafficA horizontal stacked bar showing 500 MB/s total SSD bandwidth. The left 15 MB/s is labelled "foreground writes (user + WAL)", coloured green. The next 75 MB/s is labelled "compaction writes (15× amp)", coloured orange. The next 75 MB/s is labelled "compaction reads (merge inputs)", coloured amber. The remaining 335 MB/s is labelled "headroom for foreground reads and bursts", coloured pale. Beneath is a second bar showing the same split at 50× amp (a pathological workload) with compaction consuming 500 MB/s and foreground writes stalling, labelled "SATURATED — writes must stall".SSD 500 MB/s bandwidth budget — two workload regimeshealthy1575 compaction W75 compaction R335 MB/s headroom — reads served heresaturated15250 compaction W (50× amp)250 compaction Rwrites stall500 MB/sforeground writescompaction writescompaction readsstall / over budget
Two regimes of SSD bandwidth sharing. At a healthy 15× write amplification, compaction consumes 30% of the drive; 335 MB/s of headroom remains for foreground reads and bursts. If the workload or configuration pushes write amplification up toward 50× — a long tail of overwrites, deep levels, poorly-tuned compaction — the drive saturates: compaction consumes everything, writes stall, and the engine enters backpressure. Wall 1 is the shape of this budget.

Compaction is not competing with reads and writes in the sense of holding a lock; it is competing for the one resource nobody can timeshare cleanly — SSD I/O. The Linux block layer gives every I/O request a queue position, and a compaction that has 32 outstanding writes to the SSD pushes every foreground read to position 33.

The RocksDB team measured this directly. In a 2018 paper [1], they showed that a workload with 40 MB/s of user writes on a 500 MB/s SSD saw foreground read throughput collapse from 80 000 reads/sec to 25 000 reads/sec the moment a leveled L2→L3 compaction kicked in — a 3× read-throughput reduction driven entirely by I/O contention. Their fix was the rate limiter (see Going deeper), which caps compaction's bandwidth at a configured fraction of the SSD's total — but the fix just moves the trade-off; it does not dissolve it. Limit compaction's bandwidth and it falls behind, L0 grows, and you hit Wall 3.

Wall 2 — tail latency: a background compaction stall stalls a user read

Wall 1 is about average throughput. Wall 2 is about the tails — the p99 and p99.9 latencies that turn a benchmark-friendly engine into a production horror story.

Here is what happens, step by step, when a read arrives during a compaction pass.

  1. The read reaches the read path and starts checking SSTables (memtable → L0 → L1 → ... in newest-first order).
  2. The read hits a Bloom-filter-positive in some SSTable at L3 and needs to load a 16 KB block from the SSD.
  3. It issues a pread() to the kernel.
  4. The kernel's block layer sees the compaction's I/O queue — say 32 writes and 16 reads in flight for the compactor thread — and queues the user's pread() behind them.
  5. The SSD processes the compaction's I/Os in order. Each takes ~100 µs. The user read waits ~32 × 100 µs = 3.2 ms in queue before its I/O even starts.
  6. The user read's I/O completes. Total latency: ~3.2 ms queue + ~100 µs SSD + ~50 µs software overhead ≈ 3.35 ms. This single read has eaten your millisecond-SLO budget.

Zoom out. At p50 a typical read hits the memtable or block cache and is a 0.3 ms operation. At p99, during a compaction pass, the same read can see 30+ ms. That is a 100× tail-to-median ratio, and it is the signature of an LSM under load.

Why p50 is unaffected but p99 explodes: compaction passes are rare but long. Most reads happen when no compaction is active, so p50 and p95 look clean. A small fraction of reads — the ones that happen to land during a compaction — pay the queue penalty, and that fraction lands at the tail of the histogram. If compactions fire every 10 minutes and each lasts 30 seconds, then 30/(10 × 60) = 5% of reads overlap a compaction — exactly the tail where p99 and p99.9 live.

Quantify it across production-tier engines. The numbers below come from published benchmarks (RocksDB [1], Cassandra [2], ScyllaDB [3]) on NVMe SSDs with sustained write-heavy workloads.

Engine p50 read p99 read p99.9 read p50:p99 ratio
RocksDB (leveled, default) ~0.3 ms ~15 ms ~60 ms 50×
RocksDB (leveled, rate-limited) ~0.4 ms ~6 ms ~25 ms 15×
Cassandra (STCS tiered) ~0.5 ms ~30 ms ~120 ms 60×
ScyllaDB (DTCS / shard-per-core) ~0.2 ms ~4 ms ~15 ms 20×
Postgres (B+ tree, for reference) ~0.3 ms ~1.5 ms ~4 ms

Two things stand out. First, every LSM entry has a p50:p99 ratio of 15× or worse; the B+ tree (Postgres) has . Second, mitigations help but don't dissolve the wall — rate limiting cuts RocksDB's p99 from 15 ms to 6 ms (a 2.5× improvement) at the cost of ~30% throughput headroom. ScyllaDB's shard-per-core design (one thread per CPU, one compaction queue per shard) avoids cross-shard contention and brings p99 down to 4 ms — impressive, but still 3× worse than the B+ tree.

The 2020 ACM paper "Characterising, Modelling, and Benchmarking RocksDB Key-Value Workloads at Facebook" [4] traced 14 days of Facebook's production RocksDB fleet and found that compaction caused 60–85% of all p99 read latency spikes above 10 ms. The engineers' mitigations — rate limiting, I/O priority classes, sub-compactions, separate SSDs for logs and data — reduced the spikes but could not eliminate them. Compaction contention is not a bug you can fix; it is a property of the design.

Wall 3 — write stalls and back-pressure when compaction can't keep up

Wall 1 is average throughput; Wall 2 is tail latency; Wall 3 is the nightmare scenario — the engine stops accepting writes entirely.

Leveled compaction has a hard invariant: L0 file count must stay below some threshold (default 4). If flushes keep arriving faster than L0→L1 compactions can drain L0, the file count rises. At 20 files, RocksDB starts throttling writes — artificially sleeping each put() for a few hundred microseconds to give compaction time to catch up. At 36 files, RocksDB stops accepting writes entirely until L0 falls below the threshold. This is the level0_slowdown_writes_trigger / level0_stop_writes_trigger pair from the leveled compaction chapter.

The write path, from the user's perspective, looks like this in the bad case:

  t=0 ms:     put("k", "v")      →  OK, 0.1 ms
  t=1 ms:     put("k2", "v2")    →  OK, 0.1 ms
  ...
  t=30 s:     put("kN", "vN")    →  OK, 0.1 ms
  t=30.1 s:   put("kN+1", "vN+1") → OK, 0.5 ms       (throttle: +400 µs sleep)
  t=31.0 s:   put("kN+2", "vN+2") → OK, 1.5 ms       (throttle: +1.4 ms sleep)
  t=32.0 s:   put("kN+3", "vN+3") → OK, 5 ms         (throttle: +4.9 ms sleep)
  t=33.0 s:   put("kN+4", "vN+4") → BLOCKED — 800 ms (stop-writes trigger)
  t=33.8 s:   put("kN+4", "vN+4") → OK, 0.1 ms       (compaction drained L0)

To a caller that expected sub-millisecond writes, an 800 ms block is catastrophic. Request timeouts fire, queues back up, retries amplify load, and the cascading-failure pattern that every postmortem ends up describing begins.

L0 file count versus time, crossing the throttle and stall thresholdsA time series chart with time on the x-axis from 0 to 5 minutes and L0 file count on the y-axis from 0 to 40. A sawtooth green line oscillates between 2 and 5 in the first minute labelled "healthy — flushes drain faster than they arrive". Then a sustained load begins at 1 minute and the line climbs steadily through 10, 15, 20. A horizontal dashed orange line at 20 is labelled "slowdown trigger"; above it the line is drawn in orange. A horizontal dashed red line at 36 is labelled "stop-writes trigger". The line crosses into red at 36 at around 3 minutes, flattens briefly as writes are blocked, then drops sharply back down to 4 as a burst compaction drains L0. Below, a second track shows the user's write latency: sub-millisecond in the green zone, rising through milliseconds in the orange zone, and spiking to 800 ms in the red zone.L0 file count → throttle → stall → recovery010203040L0 files01234 minslowdown at 20stop at 36healthyload startsthrottledstoppeddraineduser write latency (log scale)0.1 ms800 ms blocked
The write-stall cascade. During the first minute of healthy load, flushes and compactions stay in balance and L0 oscillates between 2 and 5 files. A sustained load burst beginning at t=1 min drives L0 past the slowdown trigger at 20, where each write gets an artificial sleep, then past the stop trigger at 36, where writes block entirely. Below: the corresponding user-visible write latency, jumping from 0.1 ms to 5 ms to 800 ms as the cascade progresses. Recovery begins when a burst compaction drains L0 back below the threshold — but the damage (timeouts, retries, upstream queue backup) is already done.

Three remarks on Wall 3.

Stall is correct behaviour. The alternative is worse — if L0 is allowed to grow unboundedly, reads get unboundedly slow (every read must probe every L0 file), Bloom filters exhaust RAM, and eventually a single key lookup costs tens of seconds. Stopping writes until L0 drains is the designed response; the designers chose "reject writes temporarily" over "degrade reads permanently." That is the right call, but the user feels it as an outage.

Stall is preventable — at a cost. Tune compaction to keep up with any realistic burst (more threads, higher bandwidth allocation, deeper L0 threshold). Each knob you turn eats into foreground capacity; you are rebalancing the Wall-1 vs Wall-3 trade-off, not escaping it.

Stall tells you your capacity ceiling. If your workload sustains enough writes to cause stalls, you have discovered your engine's actual write-throughput ceiling — it is the rate at which the compactor can drain L0, minus a safety margin. RocksDB's community documentation says a reasonable rule of thumb is: sustained write throughput = 0.2 × SSD write bandwidth / write amplification. On a 500 MB/s SSD with 15× amp: 0.2 × 500 / 15 = 6.7 MB/s of sustained user writes. Peak bursts can go higher (the memtable absorbs them), but sustained any higher than that and you eventually stall.

Why this motivates an in-place engine

Stare at the three walls together and a hypothesis forms. All three are caused by the same thing: compaction. Remove compaction, and all three walls dissolve.

But you cannot just "remove compaction" from an LSM — compaction is what turns the log-structured appends into a queryable sorted structure. The design requires it. To remove compaction you have to change the design: don't write data out-of-place and compact it into position, write it in-place to begin with.

That is exactly what a B+ tree does.

LSM versus B+ tree write pathsTwo panels side by side. Left panel titled "LSM — out-of-place, compact later" shows a memtable receiving a put, flushing to an SSTable in L0, compactions drawn as arrows merging L0 into L1 and L1 into L2. Write amplification labelled 15x. Right panel titled "B+ tree — in-place, write once" shows a tree with a root, two branch pages, and four leaf pages. An arrow shows a put travelling from the root through one branch to one leaf, modifying that leaf page in place. A WAL entry is written first. Write amplification labelled ~3x.LSM — out-of-place, compact latermemtableL0: many SSTablescompactL1 (partitioned)compactL2write amp ≈ 15×three walls as a consequenceB+ tree — in-place, write oncerootbranchbranchleafleaf ✎leafleafput walks tree,modifies one leaf+ WAL entry for crash recoverywrite amp ≈ 3× (leaf + WAL + splits)no compaction, no walls 1–3
Out-of-place versus in-place. The LSM writes every put to the memtable, flushes it to a fresh SSTable, then spends the next minutes or hours rewriting it through compaction — a 15× amplification for a 1× logical operation. The B+ tree walks the tree to find the leaf page that should hold the key and modifies that page in place. One page write for the leaf, one sequential append for the WAL — roughly 3× amplification, and no background compaction process at all. Build 4 develops this path.

A B+ tree put writes exactly two things: a WAL record (sequential append, cheap), and a modified leaf page (one random 4 KB page write). Occasionally a page splits, cascading a few more page writes up the tree. Total amplification: ~2–5×. No compaction. No Wall 1, no Wall 2, no Wall 3.

What the B+ tree pays for this is different. Every page write is a random write (the leaf to be modified lives at some arbitrary SSD location, not at the tail of an append-only log), so the raw I/O pattern is less SSD-friendly. Every read must traverse the tree from root to leaf — 3–5 page reads instead of a single memtable probe + 1 SSTable block. And in-place updates make crash recovery harder: a torn page write (half the page old, half new) can corrupt the index if not protected by a careful WAL protocol (ARIES, Build 4 chapters 7–8).

These are real costs. But they are smooth costs — a B+ tree's p99 is a small multiple of its p50 because there is no bursty background process. Contrast with the LSM's 50× p99/p50 ratio under load. Different workloads, different winners. That is the whole point of Build 4 existing: a second engine family with a genuinely different latency profile.

When LSM is still the right choice

The three walls are real. None of them kills the LSM — they constrain it. For a large and important class of workloads, the LSM's strengths outweigh them by a wide margin.

Use LSM when:

  1. The workload is write-heavy. If writes outnumber reads by more than 5:1, the LSM's append-mostly write path (no in-place random writes, no page splits) dominates any B+ tree. Time-series databases, event logs, telemetry, write-ahead fact tables — all of these are natural LSM territory. InfluxDB, Prometheus, and most of the observability stack are LSMs for this reason.

  2. The dataset is much larger than RAM. LSM's sparse index (one entry per SSTable block, ~4 KB of index per MB of data) is far more RAM-efficient than a B+ tree's buffer pool. A 10 TB LSM needs ~40 MB of sparse index RAM; a 10 TB B+ tree with a 10% hot set needs ~1 TB of buffer pool to avoid disk reads on every request. For very large datasets on a single node, the LSM wins on RAM economy even when its latency tail is worse.

  3. You can pay for SSDs (not HDDs). Compaction is a sequential-read, sequential-write workload, and SSDs deliver 500–3000 MB/s of sequential I/O with sub-millisecond block latency. On HDDs (100 MB/s sequential, 10 ms seek), compaction is far more painful — a random-access B+ tree with a strong buffer pool can outperform an LSM on spinning rust. LSM is the SSD-era design; B+ tree predates SSDs by decades and works on both.

  4. The workload tolerates p99 spikes. Analytics pipelines, bulk loaders, user-facing feeds where a 30 ms spike is invisible inside a 2-second page load — all of these tolerate Wall 2. Transactional workloads (payments, order books, inventory) where a 30 ms tail becomes a user-visible slowdown do not, and prefer B+ trees.

  5. The workload has spatial locality in writes. If recently written keys are likely to be read soon (which is true for most caching, session, and working-set workloads), the memtable and L0 absorb nearly all reads with no SSTable probes. The LSM's read path is cheapest exactly when the read hits recent data, which is the common case for most applications.

Production LSM deployments at global scale in 2026: Facebook/Meta's entire social graph is on RocksDB. Google's Bigtable (Spanner's bottom layer), Apple's FoundationDB, Cassandra at Netflix (for viewing history, ~10 PB), DynamoDB (AWS's serverless KV), TiDB, CockroachDB, YugabyteDB — all LSMs. The walls are visible to every one of these teams, and every one has mitigation tooling (rate limiters, I/O priority classes, careful compaction scheduling, SSD-local storage). None has switched to B+ trees for their primary storage, because for their workloads the LSM's strengths — write throughput at scale, RAM efficiency, SSD-friendliness — win over its tail latency.

Production B+ tree deployments: Postgres, MySQL (InnoDB), SQLite, SQL Server, Oracle — every major transactional relational database. When the workload is read-mixed-with-write, OLTP-shaped, latency-sensitive, and datasets fit in reasonable hardware, the B+ tree wins. Build 4 develops this engine.

The choice between LSM and B+ tree is not a "which is better" — it is a "which shape of trade-off fits my workload." Build 3 and Build 4 are the two canonical answers.

Common confusions

Going deeper

The main text gave you the three walls and the LSM-versus-B-plus-tree contrast. This section goes a level down into the production mitigations that every serious LSM deployment runs — RocksDB's rate limiter, I/O priority classes, sub-compactions, and the newer SSD-specific write-stall patterns.

RocksDB's rate limiter

The RocksDB::RateLimiter is a token-bucket-shaped throttle that caps compaction and flush I/O at a configured bytes-per-second rate. Each compaction thread, before issuing a write, requests tokens proportional to the write size; if the bucket is empty the thread sleeps until tokens refill. The token refill rate is the configured bandwidth cap — say, 100 MB/s — which means compaction cannot exceed that on any SSD.

// RocksDB: cap compaction at 100 MB/s
auto rate_limiter = NewGenericRateLimiter(100 * 1024 * 1024);  // bytes/s
options.rate_limiter = rate_limiter;

The cap is usually set at 2–4× the sustained user write rate, which gives compaction headroom to catch up during bursts but prevents it from monopolising the SSD. Get the cap wrong in either direction and you hit a wall: too low and compaction falls behind, L0 grows, Wall 3 fires; too high and foreground I/O starves, Wall 2 fires. The cap is the single most impactful operational knob in a RocksDB deployment.

Advanced rate limiters are auto-tuning — they watch L0 file count and adjust the cap dynamically. When L0 is below its target, the rate limiter is strict (compaction is slow, foreground gets the bandwidth); when L0 is growing, the rate limiter relaxes (compaction speeds up to drain L0). RocksDB's kAutoTuned mode does this with a PID-style control loop.

I/O priority classes

The Linux block layer has I/O priority classes that the kernel uses to order requests in the dispatch queue:

Setting compaction threads to IOPRIO_CLASS_IDLE makes them yield to any foreground I/O — a compaction write is scheduled only when the drive has no user I/O pending. This is aggressive and can starve compaction entirely under heavy read load. More commonly, compaction runs at BE 7 (best-effort, lowest priority) so it shares the queue with foreground I/O but gets served last.

The limitation, already noted: the SSD itself does not honour priority classes. Once the OS dispatches an I/O to the drive, the drive's internal scheduler processes it in the order received. So the ioprio knob is useful for shaping the OS-level queue but does not help when the drive's own queue is saturated.

Sub-compactions

A single compaction of one L1 file against ~10 L2 files is a serial merge. For large files (256 MB inputs), the merge takes seconds. During those seconds, a single thread is driving all the I/O and all the dedup work. That thread is a bottleneck for compaction throughput and a single point of I/O contention.

Sub-compactions (RocksDB's max_subcompactions) split the compaction's key range into N disjoint slices and run each slice on its own thread. The slices share the input files (reading different byte ranges) but produce independent output files. This parallelises the compaction and — crucially — spreads its I/O across multiple queues, which behaves better with multi-queue SSDs (modern NVMe drives have 8–64 submission queues, and multiple threads can use them simultaneously).

Typical sub-compaction settings: 4–8 threads per compaction. Cuts wall-clock compaction time in half or better. Cost: 4–8× the peak I/O instantaneous rate during the compaction, which can worsen the Wall 2 spike if the rate limiter is not set carefully. Modern RocksDB deployments enable sub-compactions with an auto-tuning rate limiter so the total I/O stays bounded even as parallelism rises.

SSD write stalls

A surprising production issue is that SSDs themselves can stall. Modern SSDs use garbage collection internally — they erase large blocks (~256 KB) before they can be rewritten, and GC runs when the drive's free-block pool is low. During SSD GC the drive's write latency can spike from 100 µs to 10 ms or more, independent of what the LSM is doing.

Worse, SSD GC is correlated with LSM compaction: compaction's sustained high write rate eats through the SSD's free-block pool, triggering GC, which then slows down compaction, which makes L0 grow, which triggers LSM write stalls (Wall 3). The two stall types — LSM stall and SSD stall — can feed each other, and the cascade is visible in production as bimodal latency distributions where a small fraction of writes take 100+ ms.

Mitigations: enterprise SSDs with larger GC-reserve pools, write-amplification-aware storage layers that migrate cold data to cheaper storage (Meta's XDP layer does this), and NVMe ZNS (Zoned Namespace) drives that expose the SSD's internal zones directly to the LSM, allowing the LSM to align its SSTable file sizes with the drive's zone size and avoid triggering GC. ZNS is a 2022-era technology and is increasingly used by Percona-tuned RocksDB deployments [5].

Separation of logs and SSTables

A deployment pattern common in production: put the WAL on one SSD and the SSTables on another. The WAL's I/O pattern is sequential-append, which is the SSD's best case (high throughput, low amplification). The SSTable I/O pattern is mixed — reads for foreground, reads+writes for compaction. Separating them:

Cassandra, MongoDB, and Postgres all support separate WAL/data disks, and at scale every one of them recommends it.

Where this leads next

You have Build 3 in your head. A full LSM engine, built from first principles over ten chapters, with snapshots and universal compaction and the trilemma you can tune inside of. And you have the three walls — the operational ceilings that show up under sustained load and that no amount of LSM tuning fully dissolves.

Build 4 — B+ trees answers the walls from a different architectural starting point. Instead of "write appends then compact them into position," the B+ tree writes updates directly to the position they should live at. The engine has:

But the B+ tree takes on its own costs: random page writes (worse sequential-I/O efficiency than the LSM), buffer pool management (careful LRU eviction, dirty-page tracking, checkpointing), and a far more delicate crash-recovery protocol because in-place updates break if interrupted mid-write. Build 4 spends ten chapters developing all of this.

The next chapter — Pages: the fundamental unit of disk I/O — opens Build 4 by naming the shift in unit of work. An LSM thinks in SSTables (~64 MB files, sorted, immutable). A B+ tree thinks in pages (~4–16 KB fixed-size blocks, mutable, indexed by page-id). That one change in the unit — from file to page, from append-only to in-place — is the whole architectural pivot. The chapter derives the page size from the hardware (SSD page size, DRAM cache line, filesystem block), shows how the buffer pool becomes the central data structure of the engine, and begins laying the foundation for the tree-of-pages structure that Build 4 will build up to a full relational-grade storage engine.

Close the hood on Build 3. It is a real, production-shaped LSM — the architecture of LevelDB, RocksDB, Cassandra, and half the modern data infrastructure. The walls are known, the mitigations are known, and for write-heavy workloads with datasets much larger than RAM, it is still the engine of choice.

Now turn the page. Build 4 starts with one idea: what if every I/O the engine does is one page, and the engine's main job is to decide which pages to keep in RAM?

References

  1. Siying Dong, Mark Callaghan, Leonidas Galanis, Dhruba Borthakur, Tony Savor, and Michael Strum, Optimizing Space Amplification in RocksDB (CIDR 2017) — measures compaction I/O contention and foreground throughput collapse on production RocksDB; introduces the rate limiter analysis used in this chapter.
  2. Apache Cassandra, Read and Write Latency Under Compaction — Cassandra's operational documentation on STCS p99 spikes and mitigations, including throughput throttles and compaction schedulers.
  3. Avi Kivity et al., ScyllaDB's Shard-per-Core Architecture — how per-core shards with independent compaction queues cut p99 tail latency 3–5× relative to a shared-compactor LSM.
  4. Zhichao Cao, Siying Dong, Sagar Vemuri, and David H.C. Du, Characterising, Modelling, and Benchmarking RocksDB Key-Value Workloads at Facebook (FAST 2020) — 14-day trace of Meta's production RocksDB fleet; quantifies the fraction of p99 spikes caused by compaction.
  5. Matias Bjørling et al., ZNS: Avoiding the Block Interface Tax for Flash-based SSDs (USENIX ATC 2021) — NVMe Zoned Namespace design, and how LSMs align SSTable sizes to SSD zones to eliminate internal GC-induced write stalls.
  6. Mark Callaghan, Write Amplification, Space Amplification, and Read Amplification (Small Datum blog, 2015) — the canonical practitioner essay on the LSM trilemma, with production-derived amplification numbers for RocksDB, WiredTiger, and InnoDB.