B-Tree vs LSM — When to Pick Which

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Two engines, two philosophies. An LSM turns random writes into sequential ones by appending to a memtable, flushing to immutable SSTables, and compacting in the background. It is a write-optimised structure: cheap ingest, expensive merges, p99 spikes every few minutes when a compaction pass fires. A B+ tree updates pages in place, keeps the working set in a buffer pool, and pays write amp only in proportion to tree depth — four to five random page writes per update. It is a read-optimised structure: predictable latency, short tails, but every random write hits the disk (modulo the WAL), and fragmentation over time requires a VACUUM that has its own stalls. Neither dominates. The write/read/space trilemma applies to both, but they sit on different faces of the triangle: LSM lives near the low-write-amp vertex; B-tree lives near the low-read-amp vertex; both pay on space. The decision is workload-first and hardware-second. Write-heavy, dataset >> RAM, SSD, tolerant of occasional 40 ms tail → LSM (RocksDB, Cassandra, ScyllaDB). Read-heavy, mixed OLTP, tight tail-latency SLO, complex secondary indexes and range scans → B+ tree (Postgres, MySQL InnoDB, SQL Server). Time-series with TTL → LSM with tiered or FIFO compaction. This chapter draws the decision surface as a triangle with both engines plotted, gives a flowchart that takes a workload description and drops out an engine choice, and walks three real-world scenarios. It closes with the one consideration that trumps all others — operational familiarity: the engine your team can debug at 3 a.m. is usually the right one, regardless of what the triangle says.

You have read eleven chapters about LSM trees. You are in the middle of six about B+ trees. A voice in the back of your head has been asking the whole time: which one wins?

The answer, like most answers in systems engineering, is that the question is the wrong shape. Nothing wins. Both are mature engines deployed at planet scale — RocksDB runs Sociogram, Cassandra runs Streamora, Postgres runs Picstrand, MySQL runs half the internet. The question an engineer asks is not "which engine is best" but "which engine does my workload want, on my hardware, at my team's operational maturity".

This chapter is a decision chapter. It presumes you have built intuitions for both engines — the LSM walls, the B-tree split mechanics, the compaction trilemma, the torn-write hazard — and compresses them into a decision surface. You should finish the chapter able to look at a workload sketch and name the right engine in thirty seconds.

Recap: the three amplifications, across both engines

Every storage engine pays three taxes. Name them once more, precisely, because the whole decision hinges on how each engine pays them.

Write amplification is the ratio of bytes written to storage over bytes the application asked to write. An LSM pays it during compaction — the same key gets rewritten every time its level compacts upward. A B-tree pays it at split time, when a full page becomes two half-full pages, and at every page write because a 200-byte row update writes a full 8 or 16 KiB page.

Read amplification is the number of I/O operations a point lookup needs. An LSM pays it per level — check the memtable, then L0's files, then one file per deeper level; Bloom filters prune most of those probes to RAM-only work. A B-tree pays it per tree level — root, internal, internal, leaf; typically three or four pages for a billion-row table, and the top levels are almost always cached.

Space amplification is the ratio of bytes on disk to logical bytes of live data. An LSM pays it until the next major compaction clears obsolete versions; space amp sits at 1.1× (leveled) to 2× (tiered) depending on the policy. A B-tree pays it as fragmentation — pages half-full after random deletes, free space that never gets reclaimed until a full rebuild or a background vacuum.

Amp	LSM (leveled)	LSM (tiered)	B+ tree
Write	10–30×	3–5×	3–10× (depth × row-to-page ratio)
Read	~6 (one per level, mostly bloomed to RAM)	20–40 (many SSTables per lookup)	~4 (one per tree level, cache-heavy)
Space	~1.1×	1.3–2×	1.2–1.5× (fragmented)

The first takeaway: B-tree sits between leveled and tiered LSM on every axis, without being dominant on any of them. That is why both engines coexist — they are not the same point on the trilemma with different names; they occupy genuinely different regions of the design space.

The trilemma, redrawn with a B-tree point

The Monkey paper's triangle was drawn for LSM compaction strategies. Re-draw it with B-tree on the same plot.

The trilemma triangle with both engine families plotted. LSM strategies (tiered, universal, leveled) trace a curve from the low-write-amp vertex toward the low-read-amp / low-space-amp edge, tunable by operator choice. B+ tree is a single fixed point near leveled LSM but slightly better on read amp (fewer I/Os per lookup, cache always hot) and slightly worse on write amp (every row update writes a whole page). The B-tree does not have a tuning curve — you get what the engine gives you, period.

Two observations worth saying out loud.

B-tree sits near leveled LSM, not on it. The two strategies occupy neighbouring regions. A read-heavy workload that works well on Postgres would also work well on RocksDB with leveled compaction — the numbers come out close. The real differences are operational, not amplification-theoretic.

LSM has a tuning curve; B-tree does not. An LSM operator can slide the working point from tiered through universal to leveled by changing a config. A B-tree operator cannot — the engine's position on the triangle is baked into its data structure. The closest a B-tree gets to tuning is changing page size (8 KiB vs 16 KiB) or fill factor, neither of which meaningfully moves the triangle point.

The comparison table — ten axes

The triangle compresses too much. Here is the real surface.

Axis	B+ tree	LSM tree
Write path	in-place page update; random write per dirty page	append to memtable; sequential write to WAL; bulk flush to SSTable
Read path	traverse root→leaf; 3–4 page reads (mostly cached)	check memtable → L0 SSTs → one file per level; bloom-filtered
Write amp	3–10× (depth × page-to-row ratio)	3–5× (tiered) to 10–30× (leveled)
Read amp (worst)	4 I/Os (tree depth)	20–40 I/Os (tiered, before bloom) or 6 (leveled, after bloom)
Space amp	1.2–1.5× (page fragmentation)	1.1× (leveled) to 2× (tiered)
Latency shape	smooth; tail driven by page-cache misses and checkpoint flushes	bimodal; baseline tiny, p99 spikes during compaction
Concurrency mechanism	latch coupling / lock coupling on B-link tree; MVCC overlay	immutable SSTables + memtable RW lock; MVCC native via sequence numbers
Range scans	cheap; leaf-page linked list → sequential disk reads	expensive; merging iterator across memtable + N SSTables
Hardware sweet spot	SSD or HDD; cache-heavy; benefits from large RAM	SSD; benefits from high sequential write bandwidth
Operational pain point	vacuum stalls; index bloat; page-level locking under hot update	compaction stalls; write amplification bill; tombstone backlog

A few axes are worth drilling into because they drive engine choice more than the amp numbers.

Latency shape. A B-tree's latency histogram is a narrow peak. p50 ≈ 400 µs (one SSD read), p99 ≈ 2 ms (four uncached reads), p99.9 ≈ 10 ms (a checkpoint-driven fsync hiccup). An LSM's latency histogram is bimodal — a narrow peak at p50, and a long tail that stretches to 40–100 ms during compaction windows. For a workload with a 10 ms SLO, the B-tree's narrower histogram is worth more than its marginally worse write amp.

Range scans. A B-tree leaf is a sorted array, and leaves are linked left-to-right. A range scan over a million rows reads a few thousand sequential pages — one of the fastest disk-access patterns there is. An LSM range scan instantiates a merging iterator over the memtable plus every relevant SSTable, and the merge cost is logarithmic in the number of SSTables. For scan-heavy workloads (reporting queries, full-table exports, join operators), the B-tree wins by a wide margin.

Concurrency. Both engines support MVCC, but reach it differently. The LSM gets MVCC almost for free — every write carries a sequence number, and an iterator pinned at sequence s just filters to writes with seq ≤ s. The B-tree layers MVCC on top of in-place update by storing versions in a separate structure (Postgres: heap tuples with xmin/xmax; InnoDB: undo log). The B-tree's MVCC is more expensive to read but cleaner to garbage-collect; the LSM's is free at write time but requires tombstone-aware compaction.

Hardware — what the disk cares about

The engine-vs-engine decision used to depend heavily on whether your disk was a hard drive or an SSD. In 2026 most production workloads run on SSDs, but the hardware still shapes the choice in three ways.

HDDs. Random I/O on a spinning platter is 100–200 IOPS at best, because the head has to physically seek. Sequential I/O is 100+ MB/s. An LSM's sequential-write bias is worth almost everything on HDDs — you turn 20 000 random writes per second (impossible) into a 5 MB/s sequential stream (trivial). B-trees on HDD are painful; every page update is a random seek. This is why log-structured storage was invented in the first place — Ousterhout and Rosenblum's Log-Structured File System paper (1991) argued that spinning disks would become so much worse at random I/O than sequential that every write should be sequential.

SSDs. Random I/O on an SSD is ~100 000 IOPS, within 2–3× of sequential throughput. The LSM's sequential-write advantage shrinks. A B-tree on SSD is perfectly serviceable — a random 16 KiB page write costs microseconds. Write amplification becomes the dominant cost metric, because SSDs have finite write endurance (measured in DWPD — drive writes per day). An LSM with 15× write amp burns through an SSD 15× faster than a B-tree with equal application throughput. For high-ingest workloads on expensive SSDs, this matters for the hardware budget.

DataPersist memory (Optane, CXL.mem, battery-backed DRAM). Byte-addressable, near-DRAM latency, persistent. Changes the geometry entirely — "page" is no longer a fundamental unit, because a two-byte write is cheap and atomic. Purpose-built engines for this tier (Intel's PMDK, LeanStore, Umbra) look like neither B-tree nor LSM — they are variable-size-page hybrid structures that exploit byte addressability. Intel discontinued Optane in 2022, so this tier is smaller in 2026 than the mid-2010s projected, but CXL-attached persistent memory is growing. For storage design today, persistent memory is a footnote that a specialist workload might hit; for most of the production world it is irrelevant.

Operational pathologies — what breaks at 3 a.m.

Both engines have specific failure modes their operators fear. Knowing them by name is how you pick the engine your team can actually run.

LSM — compaction stalls. The engine is ingesting at 100 000 writes/sec. Compaction is running at 80 MB/s and cannot keep up. L0 grows from 4 files to 20 to 40. RocksDB's level0_stop_writes_trigger fires and the engine stops accepting writes. The on-call engineer sees a client-side alert: "database not responding." Recovery is either wait out the compaction backlog (minutes to hours) or throw bigger disks at the cluster. This is a designed failure mode — the engine prefers to stall writes rather than accept them faster than it can durably store them — but "designed" does not mean "acceptable" at 3 a.m.

LSM — tombstone backlog. Users delete 10% of rows per day. Deletes become tombstones, which live in SSTables until compaction reclaims them. If the delete rate exceeds the compaction rate, tombstones accumulate; read performance degrades because every read must skip over tombstones; eventually the backlog becomes so large that a full compaction is the only remedy, and that compaction takes the cluster offline.

LSM — GC stalls in JVM-based engines. Cassandra runs on the JVM. A compaction that reads 10 GB of SSTables and produces 10 GB of new ones allocates enormous amounts of short-lived memory. If the GC pauses for a full collection during compaction, the cluster's latency histogram cliff-walks — a 2-second GC pause is not uncommon. This is not an LSM-structural problem; it is a runtime-choice problem. But many production LSMs (Cassandra, HBase, Kafka) run on the JVM, so the on-call engineer lumps the pathologies together.

B-tree — vacuum stalls. Postgres keeps deleted rows visible to old transactions (MVCC) until no transaction can see them, then the VACUUM daemon reclaims the space. If VACUUM falls behind — because the workload never pauses, or because long-running transactions pin old versions — table bloat grows, index bloat grows, and a forced VACUUM FULL eventually rebuilds the table with the server offline. This is Postgres's version of the LSM's tombstone backlog, with the same character: the background reclaimer can lose to the foreground workload, and the recovery is expensive.

B-tree — hot-page contention. A B-tree leaf holds ~100 rows. If one row gets updated 100 000 times per second, every one of those updates takes a latch on the same leaf page. Throughput collapses to what a single page's lock can support. LSMs don't have this mode — a hot key just creates more tombstones and overwrites across SSTables, parallelised naturally. This is the worst B-tree workload, and the usual answer is "partition the hot key" (either at the application or by sharding the table) rather than "switch engines".

B-tree — index bloat. Every secondary index is a parallel B-tree. An update to an indexed column requires updating the row and every affected index's B-tree. A table with 10 secondary indexes has 10× the write amp of an unindexed one. An LSM shares this problem but tends to pay it less (each index is its own LSM, and tombstones are cheap), whereas a B-tree's index writes all hit page locks and are serialised with the row update.

The decision flowchart

Compress the decision into a flowchart. The left-to-right path from "what does your workload look like" to "which engine" should take thirty seconds to walk.

The workload-to-engine decision flowchart. Start at "describe your workload" and walk right. The four decision diamonds — write-heavy? dataset vs RAM? tight SLO? time-series? — partition the workload space into four engine choices. The unlabelled middle region (write-heavy but RAM-sized) is the tie zone, where either engine works and operational familiarity is the tiebreaker. The default for uncertain workloads is B+ tree, because it is operationally simpler.

Four branches, four recommendations. Work through each.

Write-heavy + dataset >> RAM + SSD → LSM. The canonical LSM workload. Writes outnumber reads; the dataset is big enough that caching it is impossible; the disk is an SSD, so sequential writes are only moderately faster than random but write endurance matters and compaction I/O is affordable. Metrics ingestion, event logging, time-series data, writing-heavy IoT sensor fleets. Pick RocksDB (as a library), Cassandra (for horizontal scaling), or ScyllaDB (for Cassandra-compatible higher throughput).

Tight p99 SLO + complex queries → B+ tree. The canonical OLTP workload. Read-heavy or balanced; tight latency budgets on user-facing queries; rich query language with joins, secondary indexes, range scans, transactions. Pick Postgres for general-purpose OLTP, MySQL with InnoDB for ubiquity, SQL Server for Vistron shops.

Time-series with TTL → LSM with TWCS or FIFO. A specialised write-heavy workload where old data becomes worthless. You want wholesale expiry of old data with minimal overhead. Cassandra with TWCS or RocksDB with FIFO compaction are built for this. A plain B-tree can do it but pays a vacuum cost for every TTL expiry; the LSM just drops whole buckets.

Everything else → B+ tree (default). The default for uncertain workloads. Fewer operational pathologies, smoother latency, well-understood failure modes, mature tooling. You can always migrate to an LSM later if the workload evolves; migrating off an LSM to a B-tree is much harder (you rarely realise you need to until the B-tree migration is already blocked by your data volume).

Real-world pairings

The decision flowchart is what theory says. Here is what the industry actually did.

Postgres — B+ tree. General-purpose OLTP, complex SQL, ACID transactions, tight tail latency. Every major Postgres deployment at scale (Picstrand, Heroku, Supabase) lives on the B+ tree engine. Postgres does have experimental LSM-like storage (Zheap, pluggable table AM), but the production default is and will remain the B+ tree heap.

MySQL InnoDB — B+ tree (clustered). InnoDB's primary index is a clustered B+ tree — the leaf pages store the rows themselves, not just row pointers. Point lookups are cheap (no secondary hop), but secondary indexes are more expensive (a lookup goes index → primary key → clustered tree). This is why MySQL is especially good at primary-key-heavy workloads (typical web app CRUD) and relatively worse at analytical queries with many joins.

SQL Server / Oracle — B+ tree with sophisticated extensions. Same engine family, with bells and whistles for enterprise workloads: partitioning, columnstore indexes alongside the B-tree, in-memory OLTP (Hekaton in SQL Server). The B-tree is the primary storage engine and has been for 40 years.

RocksDB — LSM, embedded. The default LSM engine for embedded key-value use. Shipped inside Sociogram's services, Pearle's FoundationDB, ProfNet's Ambry, CockroachDB, TiDB, YugabyteDB. When you see a distributed SQL database built in the 2015–2025 era, its storage layer is usually RocksDB. Shipped with leveled compaction by default; universal and FIFO are opt-in.

Cassandra / ScyllaDB — LSM, distributed. Write-optimised distributed databases designed for high ingest and wide-row workloads. Runs Streamora's metadata, Pearle's iMessage, Picstrand's Direct. Default compaction is STCS, but TWCS is common for time-series and LCS for read-heavy tables. ScyllaDB is Cassandra-compatible but rewritten in C++ on Seastar, avoiding the JVM's GC pathology.

CockroachDB, TiDB, YugabyteDB — distributed SQL on LSM. All three built distributed SQL (ACID, Postgres-compatible SQL, or similar) on top of a single-node LSM (RocksDB in all three's case, now Pebble in CockroachDB's case — Pebble is a RocksDB-compatible Go reimplementation). The choice of LSM as the local storage layer was driven by the horizontal scaling story — sharding an LSM across nodes is simpler than sharding a B-tree — not by the amp numbers.

SQLite — B+ tree, embedded. The world's most deployed database, a B+ tree inside a single file. Runs on every phone, every browser, every embedded device. The choice was made in 2000 and has never been revisited because it is correct for the workload (occasional writes, local files, no background processes).

Picking the right engine for three workloads

Workload A: OLTP row-level — e-commerce order tracking.

Traffic: 5000 reads/sec, 500 writes/sec (10:1 read-heavy). Rows are small (~1 KB); typical query is "get order by ID", "list orders for user X", "update order status". Dataset is 500 GB, fits partially in a 64 GB RAM buffer pool. SLO: 10 ms at p99 for any user-facing query. Access pattern: 20% of orders (the recent ones) receive 80% of traffic — a classic hot-set.

Flowchart walk: write-heavy? no (10:1 read-heavy). Tight p99 SLO + complex queries (secondary indexes on user, status, date)? yes. → B+ tree (Postgres or MySQL).

Reasoning: the tight p99 SLO kills LSM because compaction spikes will blow the budget. Complex queries with multiple secondary indexes suit the B-tree's mature query planner. The 80/20 hot-set fits in the buffer pool — nearly every read is a cache hit, giving sub-millisecond p50. Vacuum can be tuned to run during off-peak hours.

Workload B: write-heavy analytics ingest — clickstream pipeline.

Traffic: 50 000 writes/sec sustained, 100 reads/sec (ad-hoc analyst queries). Rows are large and mostly append-only (event records, session replays). Dataset grows by 500 GB/day, retained 30 days. SLO: none for reads (analysts wait seconds or minutes); writes must not fall behind the ingest rate. Disk: NVMe SSDs, write endurance a real concern.

Flowchart walk: write-heavy? yes (500:1 write-heavy!). Dataset >> RAM + SSD? yes (15 TB, 128 GB RAM). → LSM (RocksDB or Cassandra).

Reasoning: the write volume rules out a B-tree — 50 000 random page writes per second would saturate any buffer pool and hit WAL fsync ceilings. The LSM turns every write into a memtable insert plus a sequential WAL append; bulk flushes and compactions absorb the disk I/O. Lack of read SLO means compaction spikes do not matter. Secondary consideration: write amp matters for SSD endurance — pick tiered or universal compaction with size_ratio tuned to keep write amp under 6×.

Workload C: time-series — IoT sensor fleet.

Traffic: 10 000 writes/sec (a million sensors reporting every 100 seconds). Rows are tiny (timestamp + sensor_id + float). Reads are mostly range scans over recent windows ("sensor X over last hour"); historical queries rare. TTL: 90 days, then the data is deleted wholesale.

Flowchart walk: write-heavy? yes. Dataset >> RAM? yes (90 days × 10k/sec × 30 B ≈ 2.5 TB). Is it time-series with TTL? yes. → LSM with TWCS or FIFO (Cassandra with TimeWindowCompactionStrategy, or RocksDB FIFO).

Reasoning: the TTL access pattern is the deciding factor. With TWCS, each day's data lives in its own SSTable bucket; after 90 days, the whole bucket is dropped at once with zero compaction work. A B-tree would pay a vacuum per TTL expiry. A plain leveled LSM would rewrite old data during compaction right up until the moment it expires — pure waste. TWCS is purpose-built for this exact shape.

Common confusions

"B-trees are old and LSMs are new — LSMs must be better." B-trees were invented in 1970 (Bayer and McCreight). LSMs were invented in 1996 (O'Neil et al.), thirty years ago. Neither is "newer" in any meaningful sense. LSMs came into fashion in the 2010s because SSDs made sequential writes less dominant and Bigtable showed the architecture could scale horizontally, not because they are a recent invention.
"B-trees are bad at writes." B-trees are bad at write-amplified workloads. For any workload where writes are bounded by CPU or network rather than disk I/O — which is most OLTP — the B-tree write path is fine. A 1000-tps OLTP database does not care about write amp; a 100 000-tps ingest pipeline does.
"LSMs are bad at reads." LSMs are bad at read-amplified lookups when Bloom filters can't save them (range scans, wildcard queries, secondary lookups). For point lookups on a primary key with Bloom filters configured well, an LSM's read amp is ~1 — same as a cached B-tree. The read-path gap is narrower than the "LSMs are write-optimised" framing suggests.
"MVCC means the engine must be a B-tree." No. Both engines support MVCC. Postgres and InnoDB are B-trees with MVCC layered on top. RocksDB and Cassandra are LSMs with MVCC native (sequence numbers per write). The data structure choice is orthogonal to the consistency model choice.
"Compaction is bad; therefore LSMs are bad." Compaction is background work with tunable pain. VACUUM is also background work with tunable pain — it is Postgres's compaction, different name. Every engine that supports MVCC plus deletes has some form of background reclaimer; the LSM's is just more visible in the p99 graph.
"If you need a database, pick Postgres." This is actually reasonable advice, even though it sounds dismissive of all the LSM engineering. For ~80% of workloads, Postgres (a B+ tree) is the right choice, because 80% of workloads are mid-scale OLTP with rich queries and tight SLOs. The 20% that needs an LSM needs it for specific reasons (scale, write-heaviness, time-series) that most applications do not share.

Going deeper

Two production databases have tried to split the difference. Both are worth knowing as hybrid architectures, because they name the axis the mainstream engines forced you to choose on.

WiredTiger — the optional LSM mode

WiredTiger is MongoDB's storage engine since 2015. By default it is a B+ tree. But it ships with an LSM mode — type=lsm at collection creation — which turns a collection into an LSM tree with leveled compaction on the same engine runtime. The same binary, the same transaction manager, the same cache, different data structure per collection.

The motivating idea: the choice should be per-table, not per-database. An application might have a users collection (B-tree, read-heavy) and an events collection (LSM, write-heavy) in the same cluster, each using the right structure for its workload. MongoDB keeps the B-tree as default because most collections are read-heavy. Very few users enable LSM mode in practice — the observation is that workload separation is possible but operationally more complex, and most teams would rather split the write-heavy workload into a separate cluster running a purpose-built LSM engine.

The Bw-tree — lock-free B-tree with LSM-style deltas

The Bw-tree (Levandoski, Lomet, Sengupta — 2013) is a B-tree variant designed for multi-core CPUs with many-threaded in-memory workloads. Its insight is to avoid latches entirely by making page updates delta-chained: instead of mutating a page in place, you prepend a delta record describing the change to a chain of deltas, and consolidate the chain into a fresh page periodically. The chain-plus-consolidation pattern is lifted directly from LSM design — deltas are the memtable, consolidation is compaction — but applied at the level of a single page rather than the whole tree.

The Bw-tree ships in Vistron's Hekaton (the in-memory OLTP engine inside SQL Server) and Azure Cosmos DB's document engine. For workloads that are B-tree-shaped but multicore-bound, it is dramatically faster than a latch-coupled B-tree. It is the clearest hybrid: a B-tree at the structure level, an LSM at the update level.

Umbra and LeanStore — beyond B-tree and LSM

Two research systems from the TU München database group argue that both B-tree and LSM are legacy architectures designed for a world where RAM was scarce and disks were slow. On modern hardware — large RAM, fast SSDs, many cores — the right structure is neither.

LeanStore (Leis et al., 2018) is a variable-size-page B-tree with a buffer manager optimised for the case where the working set mostly fits in RAM. The innovation is a pointer-swizzling scheme: page references are swizzled between on-disk IDs and in-memory pointers depending on whether the target is resident, eliminating the buffer manager's hash table lookup on cached hits. The net result is a B-tree that performs near-DRAM-speed for cached workloads and like a traditional B-tree for cold ones — the best of both without a separate in-memory engine.

Umbra (Neumann, Freitag — 2020) extends LeanStore's ideas and adds a LLVM-based query compiler. Its storage layer is a variable-size-page B-tree where pages can grow from 64 KiB to 64 MiB adaptively, supporting workloads from OLTP to analytics on the same engine. Umbra treats the classical B-tree-vs-LSM question as obsolete: the right structure is a B-tree whose page size itself is a tunable dimension.

These are research systems — not production engines you would pick for your next deployment — but they are worth knowing because they name the direction the field is moving. In ten years the mainstream engines may no longer look like either Postgres or RocksDB. Until then, the decision is between the two, and the flowchart above is the decision.

Adaptive storage — the self-tuning engine

A newer thread of research (FAST-LSM, ADABase, Siberia) is the adaptive storage engine — one that observes the live workload and shifts its structure in response. Hot data in a B-tree, cold data in an LSM, mediated by a policy that moves pages between representations based on access frequency. No production engine ships this fully, but pieces of it exist: SQL Server's Hekaton is a B-tree-like in-memory engine that coexists with the disk-based B-tree, and the engine transparently moves rows between them.

The common thread with LeanStore/Umbra: the B-tree-vs-LSM decision, which has defined storage engineering for three decades, is converging toward "it depends, and the engine will figure it out". The decision in this chapter is the decision you have to make today, with 2026-era engines. In five years you may not need to make it.

Where this leads next

One wall remains before Build 4 is complete. You have the B+ tree structure, splits and merges, the buffer pool, free-space management, torn-write defences — almost the whole engine. The one piece missing is the most important: what happens when the engine crashes in the middle of a split?

A split is a non-atomic operation that touches three pages — the old leaf being split, the new sibling, and the parent whose pointer now references two children instead of one. If the power cuts between writing the new leaf and updating the parent, the tree on disk is inconsistent: a leaf exists with no pointer to it, or worse, a parent pointer references a page that doesn't exist. No checksum detects this — every page is individually valid. The invariant that the tree is violated, but no single page can tell you.

This is the final wall of the B+ tree, and its answer is the most foundational piece of database engineering: the write-ahead log (WAL), structured using the ARIES recovery protocol (Compustar, 1992). Build 5 is the WAL. It starts by naming the crash-mid-split hazard precisely, then builds the log record, the log-sequence number, the redo-undo distinction, checkpoints, and recovery. By the end of Build 5 the B+ tree is crash-safe — and along the way you will see why every serious database engine has a log sitting between the application and the storage structure.

For now, the takeaway from this chapter: the B-tree-vs-LSM decision is workload-first, hardware-second, operational-third — and most of the time, all three converge on the same answer. Read-heavy OLTP with complex queries and tight SLOs wants a B+ tree. Write-heavy streaming ingest on commodity SSDs wants an LSM. Time-series with TTL wants an LSM with bucketed compaction. And when you cannot decide — when the workload is ambiguous and the numbers are close — pick the engine your team can debug at 3 a.m. That is usually more important than any amplification number on the spec sheet.

References

Patrick O'Neil, Edward Cheng, Dieter Gawlick, Elizabeth O'Neil, The Log-Structured Merge-Tree (LSM-Tree) (Acta DataFlow, 1996) — the original LSM paper that defined the architecture whose trade-offs this chapter compares against B-trees. cs.umb.edu.
Rudolf Bayer, Edward McCreight, Organization and Maintenance of Large Ordered Indexes (Acta DataFlow, 1972) — the founding paper of the B-tree, which every in-place storage engine in production today descends from. infolab.usc.edu.
Siying Dong, Mark Callaghan, Leonidas Galanis, Dhruba Borthakur, Tony Savor, Michael Strum, Optimizing Space Amplification in RocksDB (CIDR 2017) — Sociogram's production report on when to pick leveled, universal, or FIFO compaction, which underlies the LSM-side recommendations in this chapter's flowchart. cidrdb.org.
Viktor Leis, Michael Haubenschild, Alfons Kemper, Thomas Neumann, LeanStore: In-Memory Data Management Beyond Main Memory (ICDE 2018) — shows that a modern B-tree with pointer swizzling can match in-memory engines while retaining on-disk semantics, blurring the B-tree-vs-LSM boundary at the high end. db.in.tum.de.
Justin Levandoski, David Lomet, Sudipta Sengupta, The Bw-Tree: A B-tree for New Hardware Platforms (ICDE 2013) — introduces the delta-chain B-tree used in Vistron's Hekaton and Azure Cosmos DB, the clearest hybrid of B-tree structure with LSM-style updates. microsoft.com/research.
Thomas Neumann, Michael Freitag, Umbra: A Disk-Based System with In-Memory Performance (CIDR 2020) — argues that variable-page-size B-trees with compiled query plans make the classical engine-choice question obsolete for modern hardware. cidrdb.org.