In short

Flushes (chapter 14) create SSTables. Every flush creates one more. After a few hours of writes you have hundreds. Each one can hold an older version of a key that a newer SSTable has already overwritten, or a tombstone whose corresponding value is still sitting in a deeper file. Reads pay for this pile-up — even with Bloom filters (chapter 16) the filters themselves cost RAM in proportion to the file count, and the probability of at least one false positive per read climbs with the number of files. Compaction is the background process that periodically picks some SSTables, merges them, drops the overridden and tombstoned records, and writes the result back out as fewer (and larger) SSTables. This chapter covers leveled compaction, the design LevelDB invented and RocksDB inherited. Files are organised into levels L0, L1, L2, ... where each level Lk holds roughly 10× as much data as Lk-1. Level 0 is special — flushes land there, and L0 files may overlap in key range. Every level below L0 has the non-overlap invariant: within any single level Lk (k ≥ 1), no two files share a key. Compaction picks one file from Lk, finds every file in Lk+1 that overlaps its key range (usually a handful), merges them with the classic sorted merge from chapter 12, splits the output into new Lk+1 files at roughly the target size, and atomically swaps them into place. The payoff: reads touch at most one file per level below L0 (non-overlap → at most one candidate), so read amplification is O(L) where L is the number of levels (typically 6 or 7 for a few hundred GB of data); write amplification is O(L · log_T N) where T is the size multiplier (~10) — every byte gets rewritten about 10 times over its lifetime as it migrates from L1 to L6; and space amplification is tiny, about 1.1× the live-data size, because leveled compaction is aggressive about reclaiming dead bytes. This chapter builds the pyramid, draws the picker, traces a full pass, and derives the numbers.

You built the write path. Writes land in the memtable (chapter 13), the memtable flushes to an SSTable (chapter 14), reads stitch the memtable and the SSTable stack together (chapter 15), and Bloom filters skip most of the unhelpful SSTable probes (chapter 16).

The machine runs. But run it for a day.

$ ls -lh /var/lib/lsm/
-rw-r--r-- 64M sstable-0001.sst
-rw-r--r-- 64M sstable-0002.sst
-rw-r--r-- 64M sstable-0003.sst
...
-rw-r--r-- 64M sstable-1342.sst

A thousand SSTables. Each one is a historical snapshot of some 64 MB slice of the memtable as it existed at flush time. Keys are everywhere — user:42:email might sit in sstable-0412 (the value you care about), and also in sstable-0087 (an older put for the same key, overwritten since), and also in sstable-0029 (the original put from a week ago). The newest one wins for reads, but the older copies are still on disk, eating space, slowing things down.

You have a garbage problem. On-disk garbage, specifically — bytes that nothing will ever read, because newer records have superseded them. The machinery to clean it up is compaction, and this chapter builds the specific policy that LevelDB, RocksDB, and a dozen other engines use: leveled compaction.

What compaction has to do

Compaction is just a merge. You already have the merge primitive — chapter 12 built the streaming k-way merge that walks multiple sorted inputs in lockstep and emits the combined sorted output. What changes is the choice of inputs and the organisation of the output.

Three things any compaction must do, regardless of policy:

  1. Pick some input SSTables. Fewer is faster; more reclaims more garbage in a single pass. The picker is the heart of the policy.
  2. Merge them. A standard k-way merge. When the same key appears in multiple inputs, keep only the newest version (by the SSTable ordering). When a tombstone is the newest version for a key, drop the whole key from the output (assuming no older SSTable outside the merge still holds that key — more on this caveat later).
  3. Write the output. One or more new SSTables replace the inputs. The inputs are deleted. Reads see the new files instead.

Leveled compaction is one answer to question (1). Tiered compaction (next chapter) is a different answer. The merge itself is the same.

The pyramid

Leveled compaction organises SSTables into levels, indexed L0, L1, L2, and so on. Each level has a byte budget that is T times the budget of the level above — typically T = 10.

The leveled compaction pyramidA horizontal layout of seven levels L0 through L6 stacked vertically, each level wider than the one above. L0 at the top shows four small boxes labelled sstable-A, sstable-B, sstable-C, sstable-D, drawn with overlapping shaded bands to indicate their key ranges overlap. Below, L1 is a wider strip divided into four non-overlapping rectangular SSTables each labelled with a key range like a-h, h-o, o-v, v-z, totalling about 256 MB. L2 is wider still, divided into roughly 40 SSTables with narrower key ranges, totalling 2.5 GB. L3 is wider with 400 SSTables totalling 25 GB. L4, L5, L6 are progressively wider bands, each 10 times the previous, labelled with their byte budgets. To the right of each level is a label showing the target size and whether overlap is allowed.Leveled compaction pyramid — each Lk holds ~10× the data of Lk-1L0sst-A a..zsst-B c..wsst-C m..zsst-D b..p4 files, may overlap (~256 MB)overlap allowedL1a..hh..oo..vv..z~256 MB (non-overlapping)L2~40 files, each 64 MB, non-overlapping~2.5 GBL3~400 files~25 GBL4~4000 files~250 GBL5~40 000 files~2.5 TBL6bottommost — oldest data~25 TBEach level is ~10× larger than the one above. L0 is overlap-allowed (flush dumps whatever's hot); every level below L0 partitions the key space into non-overlapping 64 MB SSTables.Compaction picks a file from Lk, finds overlapping files in Lk+1, merges them, splits the output into new Lk+1 files.
The leveled pyramid. Level 0 receives flushes and allows overlap — files there are just whatever came out of the memtable, one after another. Below L0, each level is strictly non-overlapping: the key space is partitioned into ~64 MB chunks, one chunk per SSTable. Each level holds ~10× the data of the one above, so a 25 TB database needs about 7 levels. Compaction's job is to keep each level within its budget by pushing overflow down to the next level.

Two invariants do all the work.

Invariant A (size): Each level Lk has a byte budget, budget(Lk) = T · budget(Lk-1). When a level exceeds its budget, compaction kicks in to push data down to Lk+1.

Invariant B (non-overlap, k ≥ 1): Within any level Lk with k ≥ 1, no two files share any key in their key range. The whole level together partitions some subset of the key space into disjoint intervals.

Invariant A is what bounds the number of levels — the total data is budget(L0) · (1 + T + T² + ... + T^L) ≈ budget(L0) · T^L, so for 25 TB of data with 64 MB L0 files, L ≈ log_10(25 TB / 256 MB) ≈ 5. Six levels cover it.

Invariant B is what makes reads fast: to find a key in a level with non-overlapping files, you do one binary search on the level's file-range index to find the single candidate file, and probe that file (or skip it via its Bloom filter). One file per level, at most. Across L levels plus L0's up-to-4 overlapping files, a read looks at at most 4 + L files — say 4 + 6 = 10 for a healthy configuration.

The level picker

Compaction is triggered whenever some level exceeds its budget. The scheduler needs to pick, among all over-budget levels, which one to compact, and which file within it to push down. LevelDB uses a priority score per level:

# leveled_compaction.py — pick the next compaction
def compaction_score(level, levels_state):
    """Return a score per level. Higher score = more urgent to compact."""
    if level == 0:
        return len(levels_state[0].files) / 4.0          # target: 4 files
    else:
        budget = L0_BUDGET_BYTES * (T ** level)          # 256MB × 10^(k-1) for k≥1
        return levels_state[level].total_bytes / budget

def pick_compaction(levels_state):
    """Return (src_level, src_file) to compact next, or None if nothing is over budget."""
    scores = [(compaction_score(k, levels_state), k) for k in range(MAX_LEVEL + 1)]
    best_score, best_level = max(scores)
    if best_score < 1.0:
        return None                                      # nothing over budget
    level = levels_state[best_level]
    src_file = pick_file_in_level(level, best_level)
    return (best_level, src_file)

def pick_file_in_level(level, k):
    """Which file from level k should be compacted down?"""
    if k == 0:
        return level.files[0]                            # oldest L0 file
    # For Lk, k>=1: round-robin the key space so no single key range
    # gets hammered repeatedly. LevelDB remembers the last compacted key
    # per level and picks the next file whose largest key > last_key.
    for f in level.files_sorted_by_key:
        if f.largest_key > level.last_compacted_key:
            return f
    return level.files_sorted_by_key[0]                  # wrap around

Why score by ratio to budget rather than raw bytes: ratios are comparable across levels. L0 at 5 files is 25% over budget; L3 at 26 GB (budget 25 GB) is 4% over budget. Both are over, but the L0 pressure is more urgent — reads go through L0 on every query, so letting L0 bloat slows every read. Scoring by used/budget naturally sorts by urgency.

Why pick_file_in_level rotates through the key space: if the picker always picked the file with the most data, say, you would compact the same "hot" region of the key space over and over, wearing out its SSDs (on SSDs, write amplification becomes endurance amplification) and starving the "cold" regions of compaction. Round-robin by key range spreads the wear evenly. LevelDB stores the "last compacted key" per level in its MANIFEST so the rotation is durable across restarts.

The compaction pass

Picking (src_level, src_file) is half the job. The other half is: find every file in src_level + 1 that overlaps src_file, and merge them all together.

def overlapping_files(target_level, src_key_range):
    """Return all files in target_level whose key range overlaps src_key_range."""
    lo, hi = src_key_range
    return [f for f in target_level.files_sorted_by_key
            if f.largest_key >= lo and f.smallest_key <= hi]

def do_compaction(src_level_idx, src_file, levels_state):
    """Compact src_file from Lk into its overlapping files in Lk+1."""
    dst_level = levels_state[src_level_idx + 1]
    src_range = (src_file.smallest_key, src_file.largest_key)
    dst_overlaps = overlapping_files(dst_level, src_range)

    inputs = [src_file] + dst_overlaps                   # all files to merge
    out_files = []
    writer = SSTableWriter.new(level=src_level_idx + 1)

    is_bottommost = (src_level_idx + 1 == MAX_LEVEL)
    last_key = None
    for key, value, seq, is_tomb in merge_sorted(inputs):
        # dedupe: only emit the newest version of each key
        if key == last_key:
            continue                                     # older version, drop
        last_key = key
        # tombstone optimisation: drop tombstones at the bottommost level
        if is_tomb and is_bottommost:
            continue                                     # nothing below to hide
        writer.append(key, value, seq, is_tomb)
        if writer.size_bytes >= TARGET_FILE_BYTES:
            out_files.append(writer.close())
            writer = SSTableWriter.new(level=src_level_idx + 1)
    if writer.size_bytes > 0:
        out_files.append(writer.close())

    # atomic swap: delete the inputs, install the outputs
    with levels_state.lock:
        for f in inputs:
            levels_state[f.level].remove(f)
            schedule_delete(f.path)
        for f in out_files:
            levels_state[src_level_idx + 1].add(f)
    return out_files

Why include src_file itself in inputs: it is one of the sorted streams. The merge walks the union of all inputs — src_file from Lk and the overlapping files from Lk+1 — and emits the newest version of each key. The src file's records will typically be newer (they are migrating down from a higher level), but the merge does not care; it uses per-record sequence numbers to decide newest, which is the robust answer.

Why split the output at TARGET_FILE_BYTES: the non-overlap invariant says files within a level must be disjoint in key range, but nothing forces them to be one single giant file. Splitting keeps individual file sizes bounded (~64 MB), which bounds the next compaction's input size when one of these files gets picked to be pushed down to Lk+2. If you let a single file grow to be the entire level, every future compaction at this level would have to rewrite the whole level.

Why the "bottommost" tombstone drop is correct only at the bottommost level: a tombstone's job is to shadow older puts for the same key in lower levels. If there is no lower level, the tombstone has nothing to shadow and can be deleted entirely. At non-bottommost levels, the tombstone must be preserved — a put for the same key might exist in Lk+2 or deeper, and dropping the tombstone would expose that old put to reads (silent data resurrection). This is the single most error-prone detail in writing a compactor.

The merge itself uses the same merge_sorted primitive from chapter 12. Because every input file is individually sorted, and the merge iterator walks them in lockstep pulling the smallest key at each step, the output is sorted too. Two records with the same key arrive consecutively from the merge — the one with the higher sequence number (newer) comes first, and the loop drops the older duplicates with if key == last_key: continue.

A full compaction, traced

Compacting one L1 file into L2

Suppose the store has this state:

L0: (4 files, allowed to overlap — not involved in this compaction)

L1 (256 MB target, 320 MB used — 25% over budget, score 1.25):
    L1-a:  keys a..h   64 MB
    L1-b:  keys h..o   64 MB           ← picked: largest_key > last_compacted_key
    L1-c:  keys o..v   64 MB
    L1-d:  keys v..z   64 MB
    L1-e:  keys a..z   64 MB           (overlaps everyone — invariant violation?
                                         no, we're over budget precisely because
                                         of this; compaction will fix it)

L2 (2.5 GB target, 1.8 GB used, not over budget):
    L2-1:  keys a..c   64 MB
    L2-2:  keys c..g   64 MB
    L2-3:  keys g..k   64 MB
    L2-4:  keys k..n   64 MB
    L2-5:  keys n..r   64 MB
    L2-6:  ... etc

The picker scores L1 at 1.25 and L2 at 0.72; L1 wins. Within L1 it picks L1-b (keys h..o). Now compact it:

pick_compaction() →  ("L1", L1-b)
src_range = ("h", "o")
overlapping_files(L2, ("h","o")) = [L2-3 (g..k), L2-4 (k..n)]
                                   (L2-5 starts at "n"; if the range is
                                    strictly > "o" it may or may not overlap —
                                    here we assume L2-5 starts at "n+" and does overlap)

Actually with keys h..o in L1-b and L2-5 covering n..r:
overlapping_files(L2, ("h","o")) = [L2-3, L2-4, L2-5]

inputs = [L1-b, L2-3, L2-4, L2-5]    # 4 × 64 MB = 256 MB of input

merge_sorted(inputs) walks all four in lockstep, emitting the newest version
of every key:
    for each key k in the union of the four files' key sets:
        emit (k, newest_value_for_k)

Output: one logical sorted stream covering keys roughly g..r
        (spans the union of the inputs' ranges).

writer splits output at 64 MB boundaries:
    L2-3':  keys g..k   64 MB
    L2-4':  keys k..n   64 MB
    L2-5':  keys n..r   64 MB
    (the output is slightly smaller than the input total because duplicate
     keys have been deduped and the tombstones have dropped some records)

Atomic swap (inside levels_state.lock):
    L1 loses L1-b           → L1 now has a, c, d, e (256 MB, at budget)
    L2 loses L2-3, L2-4, L2-5 and gains L2-3', L2-4', L2-5'
    → L2 now has L2-1, L2-2, L2-3', L2-4', L2-5', L2-6, ...
       (still non-overlapping by construction, still ~1.8 GB total)

Delete the old files on disk (once no reader holds a reference).

Four input files, three output files, one net file deletion from L1 (good — that's the point; it drained one file's worth of data down). L1's score drops back below 1.0 and the picker stops waking up about L1 for a while. The next compaction might be triggered by L0 (flushes keep filling it), by a different L1 file if L1 goes over budget again, or by some deeper level.

What happened to reads during all this? They saw a consistent view throughout. Before the atomic swap, L1-b and the three L2 files were the authoritative source; after the swap, the three L2-3'/L2-4'/L2-5' files are. No read ever sees "key temporarily missing" because the swap installs the new files before deleting references to the old ones, and readers holding a reference to an old file continue to see it until they finish their read (the reference-count keeps the old file alive; actual disk deletion is deferred).

Write, read, and space amplification

Leveled compaction has three numbers that matter, and they all fall out of the level structure.

Write amplification. Every byte written to the database is (on average) rewritten once per level as it migrates from L1 down to L6. At each level below L0, compaction takes ~1 file of size F from Lk and merges it with the overlapping ~T files in Lk+1 (because Lk+1 is T times wider in key space than a single Lk file), so the compaction writes (T+1)·F bytes of output per F bytes of input. The per-level amplification is T + 1. Over L levels:

\text{WA}_{\text{leveled}} \approx (T + 1) \cdot L

With T = 10 and L = 6, that is 11 × 6 = 66 — every byte is written to disk about 66 times over its lifetime. That is a lot, and it is leveled compaction's main weakness. RocksDB's compaction_readahead_size, batched I/O, and sub-compactions (Going deeper) exist to keep this from killing SSDs.

(The more careful derivation treats L0 as ~4 times worse because L0 files overlap and must each be merged with every other L0 file when L0 → L1 happens. LevelDB's L0 → L1 compaction reads all L0 files plus the overlapping L1 files, yielding an L0→L1 amplification that is ~(#L0 + overlap_in_L1). For typical #L0 = 4 and T = 10, it is still dominated by the (T+1)·L term.)

Read amplification. A point lookup touches:

\text{RA}_{\text{leveled, worst-case}} = 4_{L0} + L = 4 + 6 = 10 \text{ file probes, of which ~1 is a real disk read (Bloom filters skip the rest)}

With 1% FPR Bloom filters, expected disk reads per point lookup for a key that exists in some level k: one confirmed read at level k, plus ~0.1 false-positive reads elsewhere. Total: ~1.1 disk reads per point lookup. This is why leveled compaction is the default choice for read-heavy workloads.

Space amplification. The ratio of on-disk bytes to logical live-data bytes. Leveled's non-overlap invariant is crucial here: within any level Lk (k ≥ 1), a given key appears at most once. The same key can appear in multiple levels (an older version in L5, a newer one in L2), but compaction is constantly draining older copies down and deduplicating them on the way. The steady-state space overhead is dominated by the bottommost level's size relative to everything above it:

\text{SA}_{\text{leveled}} \approx 1 + \frac{1}{T} + \frac{1}{T^2} + \dots \approx \frac{T}{T - 1} \approx 1.11 \text{ (for } T = 10\text{)}

That is astonishingly tight — 10% space overhead. For a 100 GB logical dataset you need about 110 GB of disk. Contrast with tiered compaction (next chapter), which can temporarily need 2-3× the live data size.

Summary table:

Policy Write amp Read amp (files) Space amp
Leveled (T=10, L=6) ~66 ~10 ~1.11
Tiered (next chapter) ~20 ~30 ~2.0

Leveled pays for its excellent read and space numbers in write amplification. That trade is the whole story of the next two chapters.

Common confusions

Going deeper

You have the pyramid and the picker. This section covers the four tuning knobs and the one optimisation that make production leveled compaction actually work — level_size_multiplier tuning, compaction priority heuristics, subcompactions, and auto-compaction backpressure.

Tuning level_size_multiplier

The multiplier T (typically 10) controls the trade-off between the number of levels and the per-level compaction cost.

RocksDB exposes this as level0_file_num_compaction_trigger, level0_slowdown_writes_trigger, max_bytes_for_level_base, and max_bytes_for_level_multiplier. The last one is T itself. Default is 10. Tuning it down to 8 or up to 15 is common; going outside [5, 20] is unusual.

A subtle trick: variable T per level. RocksDB supports max_bytes_for_level_multiplier_additional which lets you give each level its own multiplier. Increasing T at deeper levels (say, 8 at L1→L2, 10 at L2→L3, 15 at L5→L6) makes the pyramid steeper at the bottom, reducing the bottommost level's share of total size and indirectly reducing space amplification. Facebook's MyRocks deployment uses this tuning to keep space overhead under 10% on multi-TB databases.

Compaction priority heuristics

The naive "compact the level with the highest score" rule is good enough most of the time, but production engines add a second layer of choice: within a level, which file should be picked? RocksDB has several compaction_pri strategies:

kMinOverlappingRatio is the one that makes the biggest difference on real workloads. A file that overlaps 20 Lk+1 files is 2× as expensive to compact as one that overlaps 10. Picking the cheap ones first reduces total I/O. The cost is a less-uniform wear pattern on the key space, but in practice the distribution of overlap counts is wide enough that you still visit every key region eventually.

Subcompactions

A compaction of one Lk file against ~T Lk+1 files is a single-threaded merge by default. For large files (say, 256 MB files at deep levels, compacting against 10 of them → 2.5 GB of input), the merge takes long enough that it becomes a bottleneck.

Subcompaction splits the compaction work across multiple threads. The key range of the compaction is divided into N disjoint sub-ranges (typically by looking at the sparse index of the source file), and each sub-range is compacted by a dedicated worker thread. Since the sub-ranges are disjoint, there is no coordination overhead — each worker reads its portion of each input file, merges them, and writes its portion of the output files. The outputs from the sub-workers are already sorted (by sub-range), so they can be concatenated into the final output files without re-merging.

RocksDB's max_subcompactions controls this; default is 1 (no parallelism), production values are 4-8. Subcompactions can halve the wall-clock time of a large compaction at the cost of proportional CPU and I/O peak load. They are essential for multi-GB SSTables.

Auto-compaction backpressure

If writes arrive faster than compactions can drain L0, L0 grows unboundedly. That kills reads (every read probes all L0 files), and eventually L0 outgrows RAM for its Bloom filters. RocksDB has three backpressure levels:

  1. level0_file_num_compaction_trigger (default 4): L0 has more than this many files → compaction gets scheduled.
  2. level0_slowdown_writes_trigger (default 20): L0 is uncomfortably full → write path artificially slows down (each put sleeps for a few hundred microseconds) to give compactions time to catch up.
  3. level0_stop_writes_trigger (default 36): L0 is about to blow up → write path blocks completely until a compaction completes.

This is "write stall" — users see write throughput collapse, sometimes to zero for hundreds of milliseconds, while the system digests its backlog. The design philosophy: better to slow writes temporarily than to let the read path degrade unpredictably. Cassandra, HBase, and every other LSM engine has analogous backpressure, though with different triggers and different default values.

Tuning these three values is one of the core operational tasks for a RocksDB deployment. Too tight → writes stall often. Too loose → reads get slow under write pressure. The defaults (4, 20, 36) are tuned for a specific workload shape; production deployments usually tune them based on measured compaction throughput and L0 build-up rate.

Move compaction — the optimisation worth knowing

Sometimes a compaction picks an Lk file that has no overlap at all in Lk+1. The L1 file covers h..o, and no L2 file covers any key in h..o. In this case there is nothing to merge — the compaction output is identical to the input, byte for byte.

Leveled compaction optimises this: instead of reading and rewriting the file, it moves the file from Lk to Lk+1 with just a metadata change. No I/O. No rewrite. The file's level counter goes up, and the MANIFEST is updated.

Move compactions are surprisingly common in practice — about 10-20% of compactions in typical workloads. They are the reason the "66× write amplification" number is an upper bound; real write amplification is more like 15-25× because move compactions skip the rewrite for a large fraction of levels.

LevelDB implements this as the "trivial move" optimisation. RocksDB has it too, as compaction_style = kCompactionStyleLevel with allow_trivial_move = true (default). The machinery is small — one check in the compactor before starting the merge — but the savings are enormous.

Where this leads next

You now have leveled compaction: the pyramid of levels, the non-overlap invariant, the picker, the per-compaction merge, and the three amplification numbers that define the policy's character. Reads are fast, space is tight, writes pay for both.

The next chapter builds the opposite extreme — tiered compaction (Cassandra's default, originally from Google's Bigtable). Tiered compaction does not maintain a non-overlap invariant. Instead, each level is a "tier" of same-size SSTables that are all overlapping, and compaction merges an entire tier's worth of files at once into a single file at the next tier. The effect: much lower write amplification (each byte is written ~L times total, not L·T), but much worse read amplification (every tier has many files, all possibly overlapping, all must be probed) and worse space amplification (during the merge, the old tier and the new file coexist, temporarily doubling the level's size).

Leveled and tiered sit at opposite ends of a three-way trade-off: write amplification, read amplification, space amplification. You can pick any two, but not all three. The next two chapters make this explicit, and the final chapter of Build 3 — universal compaction — shows how RocksDB's production compactor tries to let you dial the trade-off point at runtime, rather than committing to one corner.

For now, take a moment with the shape you have. A memtable absorbs writes. Flushes drop SSTables into L0. Leveled compaction pushes data down the pyramid, keeping every level below L0 non-overlapping and every level within its size budget. Reads pick the newest version of each key by walking memtables → L0 → L1 → ... → L6, gated by Bloom filters and Invariant B. This is the full LevelDB architecture, and it is the architecture RocksDB, CockroachDB, TiDB, and much of the modern database landscape is built on. A hundred lines of policy code — the picker, the merger, the scheduler — turn a chaotic pile of SSTables into an organised, read-efficient, space-efficient storage engine.

References

  1. Sanjay Ghemawat and Jeff Dean, LevelDB Implementation Notes — the original leveled-compaction design, including the file-picker heuristic, the "trivial move" optimisation, and the non-overlap invariant. github.com/google/leveldb.
  2. Siying Dong et al., Optimizing Space Amplification in RocksDB (CIDR 2017) — quantitative analysis of leveled vs tiered write/read/space amplification, and RocksDB's tuning knobs. cidrdb.org.
  3. Niv Dayan, Manos Athanassoulis, and Stratos Idreos, Monkey: Optimal Navigable Key-Value Store (SIGMOD 2017) — proves the optimal per-level Bloom filter sizing in a leveled LSM and quantifies read amplification. stratos.seas.harvard.edu.
  4. Chen Luo and Michael J. Carey, LSM-based Storage Techniques: A Survey (VLDB Journal, 2020) — comprehensive survey of leveled, tiered, and hybrid compaction policies. arxiv.org/abs/1812.07527.
  5. Facebook RocksDB team, Leveled Compaction (RocksDB wiki) — the production documentation for RocksDB's leveled policy, including max_bytes_for_level_multiplier, compaction_pri, and subcompactions. github.com/facebook/rocksdb/wiki.
  6. Patrick O'Neil et al., The Log-Structured Merge-Tree (LSM-Tree) (Acta Informatica, 1996) — the foundational paper; §3 derives the multi-level merge cost that leveled compaction implements. Springer.