In short

You now have three places a key can live: the active memtable (taking fresh writes right now), the frozen memtable (sealed, being flushed), and a growing list of SSTables on disk, each one the frozen output of some earlier flush. A user calls get("age"). Where is it? The honest answer is "in at most one of those places — but we do not know which until we look." This chapter writes the algorithm that looks. It is a four-layer loop: check the active memtable first; if miss, check the frozen memtable; if miss, walk the SSTables from newest to oldest, returning on the first hit. The "newest wins" rule is not a heuristic — it is a direct consequence of log-structured semantics. The newest record for a key is, by construction, the one written most recently, and in an LSM the layers are strictly ordered from newest (active memtable) to oldest (the earliest-flushed SSTable). A tombstone hit at any layer also counts as a find — it means "deleted at this point in time, stop searching, return None." The cost of this algorithm in the worst case is O(N_{\text{sstables}} \times \log(\text{sstable-size})) because each SSTable probe is an O(\log) sparse-index lookup plus one block read. Forty SSTables times a 4 KB block read each is 160 KB of disk I/O for a single point lookup that missed everywhere — unusable for a hot read path. That pain is what makes Bloom filters (next chapter) a near-necessity rather than an optimisation.

You have the memtable (chapter 13) — a sorted skip list in RAM that absorbs writes in O(\log n). You have the flush (chapter 14) — a sequential transcription that turns a sealed memtable into a block-aligned SSTable on disk, complete with a sparse index and a footer. You have the active/frozen swap that keeps the write path unblocked during a flush.

What you do not yet have is a coherent get(key).

The write side is easy: every put() goes to exactly one place (the active memtable). The read side is the tricky one, because after a week of operation your store looks like this:

active memtable      : 8 MB, ~40k keys, all very recent writes
frozen memtable      : 64 MB, ~300k keys, currently being flushed to disk
sstable-0042.sst     : 64 MB, ~300k keys, flushed 3 seconds ago
sstable-0041.sst     : 64 MB, flushed a minute ago
sstable-0040.sst     : 64 MB, flushed 10 minutes ago
...
sstable-0001.sst     : 64 MB, flushed a week ago

Call get("age"). The key age was written at some point. It might have been overwritten since. It might have been deleted since. The value you should return is the one associated with the most recent put("age", ...), unless a more recent delete("age") tombstone overrides that. The challenge: the "most recent put" could be sitting in any of forty-three places.

This chapter writes the algorithm that finds it — and then counts its cost, which turns out to be the thing that makes the next chapter unavoidable.

The layered invariant

Every LSM read algorithm rests on one structural invariant: the layers are ordered from newest to oldest, and each layer's contents are entirely more recent than every layer below it.

Concretely:

  1. The active memtable holds the newest writes — everything written since the last flush started.
  2. The frozen memtable, if present, holds writes that were in the active memtable at the moment of the swap. All of them are older than every write in the current active memtable and newer than every write that is already on disk.
  3. The SSTables are ordered by creation time. sstable-0042.sst was flushed after sstable-0041.sst, so every key in it is newer than every key in 0041. Same all the way down.

Why is this invariant airtight? Because the only way a key gets into an SSTable is via a flush, and flushes happen one at a time in order; because a memtable freeze is an atomic swap that partitions the write stream into a before and an after; and because SSTables on disk are immutable — nothing ever rewrites them. The layer ordering is temporal, and temporal ordering is what log-structured storage gives you for free.

This invariant is what makes "newest wins" correct. If you find a value for age in the active memtable, you can stop — no deeper layer can contain a newer value, because deeper layers are by definition older. If you have to look one level deeper, the same argument applies recursively.

The layered LSM read, checking newest to oldest with short-circuitA vertical stack of layers from top to bottom, arranged in order of decreasing recency. Top layer is the active memtable drawn as a green box labelled "active memtable (RAM, newest)". Below it is the frozen memtable, a slightly faded green box labelled "frozen memtable (RAM, flushing)". Below that are five SSTable boxes labelled sstable-0042 through sstable-0038, drawn as progressively more faded grey boxes going downward, with "oldest" marked at the bottom. A read arrow enters at the top marked "get(key)" and descends through the layers one at a time. At each layer, a diamond-shaped decision node asks "hit?". If yes, the algorithm returns; if no, it continues to the next layer below. A short-circuit arrow labelled "found — stop here" exits from the frozen memtable to a box labelled "return value (or None if tombstone)". Arrows from deeper SSTables are drawn dashed to indicate they are not traversed when an earlier layer matches.layers from newest (top) to oldest (bottom)active memtable (RAM)taking fresh writes — newestfrozen memtable (RAM)sealed, being flushedsstable-0042.sst (newest on disk)sstable-0041.sstsstable-0040.sst... (more SSTables) ...sstable-0001.sst (oldest — 1 week old)the read walks down; the layer ordering is temporalcontrol flowget(key)active memtable hit?return if yesnofrozen memtable hit?return if yesnosstable[i] hit? i=0..Nnewest first, return on hitall missreturn Noneshort-circuit: first match winsincluding tombstone matches
The layered read. Layers are stacked newest-to-oldest, and get(key) walks them in order, returning on the first match. A match at any layer — including a tombstone — ends the search. If no layer matches, the key does not exist and None is returned. The shape is a classic short-circuit chain; the only thing specific to LSM is the ordering rule.

The rule

Stated as one sentence: to read a key in an LSM-tree, check layers from newest to oldest and return the value (or None for a tombstone) from the first layer that has the key; if no layer has it, return None.

That is the whole algorithm. Everything else in this chapter is either code that implements it or an accounting of what it costs.

The rule has three pieces worth naming separately.

Newest first. The layers have a natural order (active memtable → frozen → SSTables newest to oldest) and the read walks them in that order. Reversing the order would be correct in a purely read-only sense — you would still find a record for the key — but you would find the wrong (stale) one.

Stop on first hit. Once a layer has the key, deeper layers are irrelevant. This is not an optimisation; it is a correctness property. A deeper hit would be strictly older, and returning it would amount to returning stale data.

Tombstone counts as a hit. A tombstone at any layer means "this key has been deleted as of this point in the log." The search stops there and returns None. The presence of an older put for the same key in a deeper SSTable is irrelevant — the tombstone is newer, and newest wins.

Why "newest wins" is correct

The temptation at this point is to treat "newest wins" as a convention that the designer picks. It is not; it is forced by the semantics of the put/delete API.

Consider a user who calls:

put("balance", "1000")     # t = 10:00:00
put("balance", "1500")     # t = 10:00:05
get("balance")             # t = 10:00:10

The user expects 1500. Any weaker guarantee would make the store unusable — reads that could return arbitrary past values are not a key-value store, they are a time-travel log.

Now observe what happens inside the LSM. Between 10:00:00 and 10:00:05, a flush happened. The memtable that held 1000 got sealed, written to sstable-0023.sst, and dropped. The value 1500 was written into the fresh memtable that succeeded it. At 10:00:10 the state is:

active memtable : {balance: "1500"}
sstable-0023    : {balance: "1000", ...}

The user's expectation — get("balance") == "1500" — can only be met by checking the active memtable first. Checking the SSTable first (or checking everything and picking randomly) would return 1000, which is a stale read of a value the user already overwrote.

This is the read-your-own-writes guarantee, and log-structured storage gives it to you for free provided you respect the layer ordering. The ordering is the whole contract; the algorithm is just reading off the contract.

The same argument applies to deletes. delete("balance") writes a tombstone into the current memtable. Until that memtable is flushed, the tombstone sits in RAM. If a subsequent get("balance") skipped the memtable and went straight to the SSTable, it would find 1500 (the old value, written before the delete) and return it — the delete would silently fail to stick. Checking the memtable first catches the tombstone, returns None, and the delete works as advertised.

Newest wins is not a policy choice. It is the only rule that makes put/get/delete behave like a key-value store.

Tombstones on the read path

The previous chapter (tombstones) built the write-side machinery: delete(key) appends a record with a type byte saying "this is an erasure, not a value." The read path now has to honour those records.

The rule is the same as for any record, with one extra step: a tombstone hit ends the search and returns None. Here is the full decision tree for what get() does when it probes a layer:

probe layer L for key k:
    if L has no record for k:
        continue to next layer
    elif L has a put record for k with value v:
        return v                           # non-null value — real hit
    elif L has a tombstone record for k:
        return None                        # tombstone hit — deleted

The second and third cases both terminate the search. The first is the only case that keeps looking.

Why terminate on a tombstone? Because the tombstone is the newest thing the store knows about this key. Any older record for the same key in a deeper layer is superseded. You are not looking for "the most recent put ignoring tombstones"; you are looking for "the current state of the key," and the current state is "deleted."

This is subtle enough to be worth a concrete example. Imagine:

sstable-0042 : {alice: "35"}    # put at t=100
sstable-0035 : {alice: "34"}    # put at t=50 (overwritten by 0042)
memtable     : {alice: <TOMBSTONE>}    # delete at t=200

A get("alice"):

  1. Check memtable. Hit — tombstone. Return None.

The search stops at the memtable. It does not descend to sstable-0042 or sstable-0035, even though both contain real values for alice. The tombstone in the memtable is newer than both, and newest wins. The user correctly sees the key as deleted.

Contrast with the same state minus the tombstone:

sstable-0042 : {alice: "35"}
sstable-0035 : {alice: "34"}
memtable     : {}                  # alice not touched recently

Now get("alice"):

  1. Check memtable. Miss.
  2. (No frozen memtable.)
  3. Check sstable-0042. Hit — value "35". Return "35".
  4. sstable-0035 is never opened.

The search short-circuits at the newest SSTable that has the key. sstable-0035 still has its stale alice: "34" on disk; that record will not be cleaned up until a compaction scans it and drops it. But from the read path's perspective, it is invisible — no read ever reaches it, because a newer record for the same key always stops the search first.

The code

Forty lines of Python, almost all of them plumbing. The core algorithm is a loop.

# lsm_read.py — the LSM read path
TOMBSTONE = object()          # a sentinel distinct from every real value

class LsmStore:
    def __init__(self):
        self.active  = SkipListMemtable()   # chapter 13
        self.frozen  = None                  # or a SkipListMemtable being flushed
        self.sstables = []                   # list of SSTableReader, newest at index 0
        self.lock    = threading.RLock()     # coarse lock for layer swaps

    def lsm_get(self, key):
        """Return the current value for `key`, or None if absent/deleted.

        Walks layers newest-to-oldest, short-circuits on first hit.
        A tombstone at any layer terminates the search with None."""
        with self.lock:
            layers = [self.active]
            if self.frozen is not None:
                layers.append(self.frozen)
            layers.extend(self.sstables)     # sstables already newest-first
        for layer in layers:
            hit = layer.probe(key)            # returns: value / TOMBSTONE / None (=miss)
            if hit is None:
                continue                      # miss — try next layer
            if hit is TOMBSTONE:
                return None                   # tombstone — stop searching
            return hit                        # real value — stop searching
        return None                           # no layer had it

The inner layer.probe(key) is polymorphic. For a memtable it is a skip-list lookup; for an SSTable it is a sparse-index binary-search plus a block read. Both return the same tri-valued result: the value, or TOMBSTONE, or None for a miss. The read-path code above does not care which layer it is talking to — the layer tells it whether the key was there and, if so, what kind of record.

The two probe() implementations, sketched:

class SkipListMemtable:
    def probe(self, key):
        node = self._lookup(key)              # existing get() from chapter 13
        if node is None:
            return None
        return TOMBSTONE if node.is_tombstone else node.value

class SSTableReader:
    def probe(self, key):
        """Use the sparse index to find the block that could hold `key`, read it,
        scan for the record. Returns value / TOMBSTONE / None."""
        block_offset = self._sparse_index_lookup(key)   # O(log) binary search
        if block_offset is None:
            return None                                  # key is outside this file's range
        block = self._read_block(block_offset)           # one 4 KB disk read
        rec = self._scan_block_for(block, key)
        if rec is None:
            return None
        return TOMBSTONE if rec.type == RECORD_TOMB else rec.value

Why copy the layer list under the lock and then release: iterating the live self.sstables list while a background compaction or flush is mutating it would expose the reader to "list changed size during iteration" races. Copying a short list of references under a brief lock is cheap (nanoseconds); holding the lock for the entire read would serialise every reader against every writer, which is exactly what the LSM design avoids. After the copy, each reader walks its own snapshot; writers can install new SSTables into the canonical list concurrently.

Why TOMBSTONE is a sentinel object, not None: because None means "the layer does not have this key at all" in this API. We need a third value — "the layer has a record, and the record says deleted" — that no real user value can collide with. Using an object() sentinel is Python-idiomatic; in typed languages you would use an enum or a tagged union.

Why sstables is kept newest-first in the list: the read loop walks the list front-to-back and wants newest first. Keeping the list already sorted means no per-read sort. When a flush completes, the new SSTable is inserted at index 0 (self.sstables.insert(0, new_sst)); the list invariant is trivial to maintain.

Two reads, traced

A read served from the memtable vs. a read served from a cold SSTable

Set up a small store with three layers.

# demo
store = LsmStore()
# populate some state (for illustration; normally these would come from real put/flushes)
store.sstables = [
    SSTableReader("sstable-0003.sst"),   # newest on disk — contains city, email, name
    SSTableReader("sstable-0002.sst"),   # contains age, locale, zip
    SSTableReader("sstable-0001.sst"),   # oldest — contains user_id, role, age (stale)
]
store.active.insert("score", "A9F")      # fresh write in active memtable
store.active.insert("views", "421")

The layer list, newest-to-oldest:

[active memtable, sstable-0003, sstable-0002, sstable-0001]

Read 1: store.lsm_get("views")

probe active memtable    hit: "421"       STOP — return "421"

One layer touched. Zero disk I/O. Roughly 50–200 nanoseconds depending on cache. This is the hot-path case — keys that were written recently are served entirely from RAM, and that covers a large fraction of real workloads by the 80/20 rule.

Read 2: store.lsm_get("user_id")

probe active memtable    miss
probe sstable-0003       (opens file, sparse-index binary-search, reads one 4 KB block)
                         scans block for "user_id" — not present  -> miss
probe sstable-0002       (reads one 4 KB block)
                         scans block for "user_id" — not present  -> miss
probe sstable-0001       (reads one 4 KB block)
                         scans block — found "user_id": "dipti-001"
                         STOP — return "dipti-001"

Four layers touched. Three 4 KB disk reads (12 KB). Roughly 300 microseconds on a fast SSD, 30 milliseconds on a spinning disk. This is the cold-path case: the key lives in the oldest SSTable and the read had to probe every newer one first, just to confirm they did not have it.

Notice how asymmetric the costs are. The hot read was sub-microsecond; the cold read was thousands of times slower. That ratio gets worse every time a new SSTable is added to the stack — a store with 40 SSTables can take 40 disk reads to serve a miss. This is read amplification, and it is the pain that the next chapter (Bloom filters) exists to fix.

Read 3 (for completeness): store.lsm_get("mobile"), where "mobile" has never been written.

probe active memtable    miss
probe sstable-0003       miss
probe sstable-0002       miss
probe sstable-0001       miss
return None

Four layers, three disk reads, zero hits. The full pessimal case. Bloom filters will turn this from three disk reads into zero with very high probability.

Cost analysis

Let:

Each memtable probe is O(\log S_m) where S_m is the memtable size — a few microseconds in practice, all in RAM.

Each SSTable probe is:

  1. Binary-search the sparse index: O(\log N_{\text{blocks}}) where N_{\text{blocks}} is the number of data blocks in the SSTable (~16k for a 64 MB SSTable at 4 KB blocks). This is a pure RAM operation because the sparse index is typically cached (it is small — tens of KB).
  2. Read one block from disk: one 4 KB I/O.
  3. Scan the block for the target key: O(\text{block-size}) linear scan, a few microseconds.

The dominant cost is the 4 KB disk read. On SSD that is ~100 \mus; on a spinning disk it is ~10 ms (seek dominates).

Worst-case cost of a single lsm_get():

T_{\text{worst}} = O(M) \cdot T_{\text{memtable-probe}} + O(N) \cdot T_{\text{sstable-probe}}

with T_{\text{sstable-probe}} \approx 100\ \mus on SSD. For N = 40 SSTables: 4 ms per read, 250 reads/sec from a single thread. That is two orders of magnitude slower than a B-tree's ~40 \mus per read.

The cost is dominated by the O(N) term. If we can reduce the number of SSTables that have to be probed — ideally to 1 on average — the LSM read path becomes competitive with a B-tree. That is exactly what Bloom filters achieve, and it is why every production LSM engine has one.

Best-case cost: 1 memtable probe, 0 disk I/O, returns the hot key. Roughly 100 ns. Faster than any B-tree because the B-tree has to walk a few levels of its own tree even when the working set is hot in page cache.

The LSM read path is bipolar — very fast for hot keys, very slow for cold ones. The goal of every subsequent chapter in Build 3 is to narrow this gap: Bloom filters cut the false-positive probes, block caches cut the I/O cost, and compaction bounds N by merging old SSTables together. With all three in place, the worst-case looks like O(\log N) probes at ~1 block cache miss per probe, which is within a factor of 2-3 of a B-tree.

Common confusions

Going deeper

Everything above is enough to build a correct read path. The sections that follow fill in the operational concerns — read amplification as a metric, why Bloom filters are a near-necessity rather than an optimisation, block caches, and negative caching.

Read amplification as a metric

Read amplification is the ratio of disk operations performed to application-level reads requested. A single get("alice") that touches 40 SSTables and performs 40 block reads has a read amplification of 40. A B-tree's read amplification is typically 1-2 (one or two levels of the tree miss cache on average).

Read amplification is the headline disadvantage of LSM-trees. The write amplification is low (flushes are sequential and cheap; compactions batch I/O); the space amplification is low (compaction reclaims dead bytes); but the read amplification is linear in the number of SSTables unless you fight it actively.

There are three knobs that control read amplification:

  1. Number of SSTables: bounded by compaction policy. Tiered compaction keeps N small by merging aggressively; levelled compaction keeps N bounded per level but can have many levels. Chapter 19 compares them.
  2. Probability that a given SSTable actually contains the key: made small by Bloom filters (next chapter). A well-tuned Bloom filter gives a ~1% false-positive rate, so 40 SSTables reduce to ~0.4 actual block reads on average.
  3. Probability that the target block is in cache: handled by the block cache. A hot working set fits in cache and the "disk" read is actually a memory read.

With all three dialled in, the LSM read path can achieve read amplification under 2 — comparable to a B-tree. Without them, it is unusable.

Bloom filters as near-necessity

Bloom filters are the single most important optimisation for LSM reads. The logic: before opening an SSTable's file and reading a block, first consult a small in-memory bit-array (the Bloom filter) that was built when the SSTable was flushed. The filter can definitively say "this key is not in this SSTable"; it can also say "this key might be in this SSTable," with a small controllable false-positive rate (typically 1%).

The impact: of 40 SSTables probed for a miss, 39.6 are skipped on average (0 I/O each), and only 0.4 are opened speculatively. That is a 100× reduction in disk reads for miss-heavy workloads — which, in systems with lots of unique keys, is most workloads.

Bloom filters cost memory (about 10 bits per key at 1% FPR, so ~1 MB per 100k-key SSTable) and cost a small constant CPU time per probe (a few hash computations and bit-array lookups). Both costs are negligible compared to what they save. Every production LSM engine enables them by default, and many have tunable FPR per level (tighter at the bottom, looser at the top, to balance memory across the tree).

The next chapter builds Bloom filters from scratch — the bit-array, the hash functions, the FPR/memory trade-off, and the integration with the SSTable reader.

Block caches

The SSTable on disk is organised into 4 KB data blocks (chapter 14's flush). Once a block has been read from disk, it is a candidate for block caching — keeping it in RAM for subsequent reads.

A block cache is an LRU (or ARC, or 2Q) map from (sstable_id, block_offset) to the block bytes. The read path consults it before issuing a disk read; a hit saves an I/O, a miss loads the block from disk and inserts it into the cache.

Block caches are much more memory-efficient than caching whole SSTables because they exploit locality: a hot key causes its block to be cached, and the ~30 neighbouring keys in the same block are cached with it for free. In practice, a block cache sized at 10-20% of working-set size catches 90%+ of reads on typical workloads.

RocksDB's default block cache is 8 MB (tunable up to dozens of GB in production); Cassandra's "chunk cache" is similar. The one gotcha is interaction with the OS page cache — both are trying to cache disk blocks, and their combined memory usage can exceed either one's budget if not carefully managed. RocksDB offers cache_index_and_filter_blocks to explicitly decide what lives in the user-space block cache versus relying on the kernel.

Negative caching

A subtle read-path optimisation: cache misses, not just hits. If a get("nonexistent-key") walks 40 SSTables and finds nothing, it is expensive — but the same miss is likely to repeat (maybe the application is retrying, maybe the key is being polled). A negative cache stores "I recently looked for this key and found nothing," short-circuiting future identical reads.

Negative caching is powerful but dangerous: if the key is created after the negative-cache entry, reads will keep returning None until the entry expires. Most LSM engines do not do negative caching at the read-path level and instead rely on Bloom filters to make cold misses cheap in the first place. Cassandra has an explicit "key cache" that stores "this key is present in this SSTable at this offset" — positive caching with a different granularity than block caching. The design space is rich; most engines pick Bloom filters + block cache and call it a day.

Read path and compaction interplay

One last operational note. Compaction merges many SSTables into one, which reduces read amplification directly (fewer layers to probe) and indirectly (better Bloom filter accuracy, more concentrated block cache). Conversely, a workload with heavy write pressure creates SSTables faster than compaction can merge them, which causes read amplification to spike. Monitoring the number of SSTables per level is a key operational signal — if it climbs past the engine's target, reads get slower and slower until compaction catches up.

This is why LSM engines have "stall" mechanisms: if the SSTable count exceeds a threshold, the write path slows down (or blocks entirely) to let compaction catch up. The stall hurts writes but protects read latency; the alternative is unbounded degradation of both. Chapter 19 covers the policies.

Where this leads next

You now have a working LSM-tree store end-to-end. Writes go to a memtable. The memtable flushes to an SSTable when full. Reads check the memtable, then the frozen memtable, then the SSTables newest-to-oldest. Deletes tombstone keys; reads honour the tombstones. The algorithm is correct.

The algorithm is also slow. For a cold key that misses in the memtable, the read touches every SSTable on disk — 40 disk reads to conclude a key does not exist. That is the read-amplification problem, and without a fix, an LSM engine is strictly worse than a B-tree for point lookups.

The fix is Bloom filters: a tiny (10 bits per key) probabilistic data structure that tells you, with high confidence, whether a key is in an SSTable without opening the file. A well-tuned Bloom filter skips 99% of the false SSTable probes, cutting the 40 reads down to under 1 on average. Suddenly the LSM read path is competitive with B-trees — and Bloom filters are cheap enough that every production engine uses them by default.

The next chapter builds Bloom filters from zero: the bit array, the hash functions, the false-positive analysis, and the integration into SSTableReader.probe(). After that, block caches and compaction policies round out Build 3, and the LSM storage engine is a complete, production-shaped machine.

References

  1. Patrick O'Neil et al., The Log-Structured Merge-Tree (LSM-Tree) (Acta Informatica, 1996) — the foundational paper; §3 introduces the layered read rule and its amortised cost. Springer.
  2. Sanjay Ghemawat and Jeff Dean, LevelDB Implementation Notes — the real-world read path in the LSM engine that inspired RocksDB, Cassandra, and many others. github.com/google/leveldb.
  3. Siying Dong et al., Optimizing Space Amplification in RocksDB (CIDR 2017) — discusses read amplification, block caches, and Bloom filter configuration across levels. cidrdb.org.
  4. Chen Luo and Michael J. Carey, LSM-based Storage Techniques: A Survey (VLDB Journal, 2020) — comprehensive survey; §3.2 covers read path variants and their read amplification. arxiv.org/abs/1812.07527.
  5. Avinash Lakshman and Prashant Malik, Cassandra — A Decentralized Structured Storage System (LADIS 2009) — Cassandra's read path across memtable and SSTables, with key cache and Bloom filter integration. cassandra.apache.org.
  6. Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 3 — the classroom-friendly walkthrough of LSM read paths, tombstones, and Bloom filter motivation. dataintensive.net.