Leveled Compaction (LevelDB / RocksDB)

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Flushes (chapter 14) create SSTables. Every flush creates one more. After a few hours of writes you have hundreds. Each one can hold an older version of a key that a newer SSTable has already overwritten, or a tombstone whose corresponding value is still sitting in a deeper file. Reads pay for this pile-up — even with Bloom filters (chapter 16) the filters themselves cost RAM in proportion to the file count, and the probability of at least one false positive per read climbs with the number of files. Compaction is the background process that periodically picks some SSTables, merges them, drops the overridden and tombstoned records, and writes the result back out as fewer (and larger) SSTables. This chapter covers leveled compaction, the design LevelDB invented and RocksDB inherited. Files are organised into levels L0, L1, L2, ... where each level Lk holds roughly 10× as much data as Lk-1. Level 0 is special — flushes land there, and L0 files may overlap in key range. Every level below L0 has the non-overlap invariant: within any single level Lk (k ≥ 1), no two files share a key. Compaction picks one file from Lk, finds every file in Lk+1 that overlaps its key range (usually a handful), merges them with the classic sorted merge from chapter 12, splits the output into new Lk+1 files at roughly the target size, and atomically swaps them into place. The payoff: reads touch at most one file per level below L0 (non-overlap → at most one candidate), so read amplification is O(L) where L is the number of levels (typically 6 or 7 for a few hundred GB of data); write amplification is O(L · log_T N) where T is the size multiplier (~10) — every byte gets rewritten about 10 times over its lifetime as it migrates from L1 to L6; and space amplification is tiny, about 1.1× the live-data size, because leveled compaction is aggressive about reclaiming dead bytes. This chapter builds the pyramid, draws the picker, traces a full pass, and derives the numbers.

You built the write path. Writes land in the memtable (chapter 13), the memtable flushes to an SSTable (chapter 14), reads stitch the memtable and the SSTable stack together (chapter 15), and Bloom filters skip most of the unhelpful SSTable probes (chapter 16).

The machine runs. But run it for a day.

$ ls -lh /var/lib/lsm/
-rw-r--r-- 64M sstable-0001.sst
-rw-r--r-- 64M sstable-0002.sst
-rw-r--r-- 64M sstable-0003.sst
...
-rw-r--r-- 64M sstable-1342.sst

A thousand SSTables. Each one is a historical snapshot of some 64 MB slice of the memtable as it existed at flush time. Keys are everywhere — user:42:email might sit in sstable-0412 (the value you care about), and also in sstable-0087 (an older put for the same key, overwritten since), and also in sstable-0029 (the original put from a week ago). The newest one wins for reads, but the older copies are still on disk, eating space, slowing things down.

You have a garbage problem. On-disk garbage, specifically — bytes that nothing will ever read, because newer records have superseded them. The machinery to clean it up is compaction, and this chapter builds the specific policy that LevelDB, RocksDB, and a dozen other engines use: leveled compaction.

What compaction has to do

Compaction is just a merge. You already have the merge primitive — chapter 12 built the streaming k-way merge that walks multiple sorted inputs in lockstep and emits the combined sorted output. What changes is the choice of inputs and the organisation of the output.

Three things any compaction must do, regardless of policy:

Pick some input SSTables. Fewer is faster; more reclaims more garbage in a single pass. The picker is the heart of the policy.
Merge them. A standard k-way merge. When the same key appears in multiple inputs, keep only the newest version (by the SSTable ordering). When a tombstone is the newest version for a key, drop the whole key from the output (assuming no older SSTable outside the merge still holds that key — more on this caveat later).
Write the output. One or more new SSTables replace the inputs. The inputs are deleted. Reads see the new files instead.

Leveled compaction is one answer to question (1). Tiered compaction (next chapter) is a different answer. The merge itself is the same.

The pyramid

Leveled compaction organises SSTables into levels, indexed L0, L1, L2, and so on. Each level has a byte budget that is T times the budget of the level above — typically T = 10.

L0 is special. It is where flushes land. Files arrive here directly from the memtable, one per flush, each ~64 MB. L0 files may overlap in key range — the memtable at flush time held whatever keys were written recently, which is a random slice of the key space, so consecutive L0 files can and do share keys. L0 has a budget expressed in file count (typically 4), not bytes.
L1 holds ~10× as much data as L0's budget — so with 4 × 64 MB L0 files, L1's target is ~256 MB. Files in L1 are the same ~64 MB size as L0 files, but they are non-overlapping in key range. Any given key lives in at most one L1 file.
L2 is 10× L1 → ~2.5 GB. Non-overlapping.
L3 is ~25 GB. Non-overlapping.
... and so on up to L6 or L7, depending on how big the database gets.

The leveled pyramid. Level 0 receives flushes and allows overlap — files there are just whatever came out of the memtable, one after another. Below L0, each level is strictly non-overlapping: the key space is partitioned into ~64 MB chunks, one chunk per SSTable. Each level holds ~10× the data of the one above, so a 25 TB database needs about 7 levels. Compaction's job is to keep each level within its budget by pushing overflow down to the next level.

Two invariants do all the work.

Invariant A (size): Each level Lk has a byte budget, budget(Lk) = T · budget(Lk-1). When a level exceeds its budget, compaction kicks in to push data down to Lk+1.

Invariant B (non-overlap, k ≥ 1): Within any level Lk with k ≥ 1, no two files share any key in their key range. The whole level together partitions some subset of the key space into disjoint intervals.

Invariant A is what bounds the number of levels — the total data is budget(L0) · (1 + T + T² + ... + T^L) ≈ budget(L0) · T^L, so for 25 TB of data with 64 MB L0 files, L ≈ log_10(25 TB / 256 MB) ≈ 5. Six levels cover it.

Invariant B is what makes reads fast: to find a key in a level with non-overlapping files, you do one binary search on the level's file-range index to find the single candidate file, and probe that file (or skip it via its Bloom filter). One file per level, at most. Across L levels plus L0's up-to-4 overlapping files, a read looks at at most 4 + L files — say 4 + 6 = 10 for a healthy configuration.

The level picker

Compaction is triggered whenever some level exceeds its budget. The scheduler needs to pick, among all over-budget levels, which one to compact, and which file within it to push down. LevelDB uses a priority score per level:

# leveled_compaction.py — pick the next compaction
def compaction_score(level, levels_state):
    """Return a score per level. Higher score = more urgent to compact."""
    if level == 0:
        return len(levels_state[0].files) / 4.0          # target: 4 files
    else:
        budget = L0_BUDGET_BYTES * (T ** level)          # 256MB × 10^(k-1) for k≥1
        return levels_state[level].total_bytes / budget

def pick_compaction(levels_state):
    """Return (src_level, src_file) to compact next, or None if nothing is over budget."""
    scores = [(compaction_score(k, levels_state), k) for k in range(MAX_LEVEL + 1)]
    best_score, best_level = max(scores)
    if best_score < 1.0:
        return None                                      # nothing over budget
    level = levels_state[best_level]
    src_file = pick_file_in_level(level, best_level)
    return (best_level, src_file)

def pick_file_in_level(level, k):
    """Which file from level k should be compacted down?"""
    if k == 0:
        return level.files[0]                            # oldest L0 file
    # For Lk, k>=1: round-robin the key space so no single key range
    # gets hammered repeatedly. LevelDB remembers the last compacted key
    # per level and picks the next file whose largest key > last_key.
    for f in level.files_sorted_by_key:
        if f.largest_key > level.last_compacted_key:
            return f
    return level.files_sorted_by_key[0]                  # wrap around

Why score by ratio to budget rather than raw bytes: ratios are comparable across levels. L0 at 5 files is 25% over budget; L3 at 26 GB (budget 25 GB) is 4% over budget. Both are over, but the L0 pressure is more urgent — reads go through L0 on every query, so letting L0 bloat slows every read. Scoring by used/budget naturally sorts by urgency.

Why pick_file_in_level rotates through the key space: if the picker always picked the file with the most data, say, you would compact the same "hot" region of the key space over and over, wearing out its SSDs (on SSDs, write amplification becomes endurance amplification) and starving the "cold" regions of compaction. Round-robin by key range spreads the wear evenly. LevelDB stores the "last compacted key" per level in its MANIFEST so the rotation is durable across restarts.

The compaction pass

Picking (src_level, src_file) is half the job. The other half is: find every file in src_level + 1 that overlaps src_file, and merge them all together.

def overlapping_files(target_level, src_key_range):
    """Return all files in target_level whose key range overlaps src_key_range."""
    lo, hi = src_key_range
    return [f for f in target_level.files_sorted_by_key
            if f.largest_key >= lo and f.smallest_key <= hi]

def do_compaction(src_level_idx, src_file, levels_state):
    """Compact src_file from Lk into its overlapping files in Lk+1."""
    dst_level = levels_state[src_level_idx + 1]
    src_range = (src_file.smallest_key, src_file.largest_key)
    dst_overlaps = overlapping_files(dst_level, src_range)

    inputs = [src_file] + dst_overlaps                   # all files to merge
    out_files = []
    writer = SSTableWriter.new(level=src_level_idx + 1)

    is_bottommost = (src_level_idx + 1 == MAX_LEVEL)
    last_key = None
    for key, value, seq, is_tomb in merge_sorted(inputs):
        # dedupe: only emit the newest version of each key
        if key == last_key:
            continue                                     # older version, drop
        last_key = key
        # tombstone optimisation: drop tombstones at the bottommost level
        if is_tomb and is_bottommost:
            continue                                     # nothing below to hide
        writer.append(key, value, seq, is_tomb)
        if writer.size_bytes >= TARGET_FILE_BYTES:
            out_files.append(writer.close())
            writer = SSTableWriter.new(level=src_level_idx + 1)
    if writer.size_bytes > 0:
        out_files.append(writer.close())

    # atomic swap: delete the inputs, install the outputs
    with levels_state.lock:
        for f in inputs:
            levels_state[f.level].remove(f)
            schedule_delete(f.path)
        for f in out_files:
            levels_state[src_level_idx + 1].add(f)
    return out_files

Why include src_file itself in inputs: it is one of the sorted streams. The merge walks the union of all inputs — src_file from Lk and the overlapping files from Lk+1 — and emits the newest version of each key. The src file's records will typically be newer (they are migrating down from a higher level), but the merge does not care; it uses per-record sequence numbers to decide newest, which is the robust answer.

Why split the output at TARGET_FILE_BYTES: the non-overlap invariant says files within a level must be disjoint in key range, but nothing forces them to be one single giant file. Splitting keeps individual file sizes bounded (~64 MB), which bounds the next compaction's input size when one of these files gets picked to be pushed down to Lk+2. If you let a single file grow to be the entire level, every future compaction at this level would have to rewrite the whole level.

Why the "bottommost" tombstone drop is correct only at the bottommost level: a tombstone's job is to shadow older puts for the same key in lower levels. If there is no lower level, the tombstone has nothing to shadow and can be deleted entirely. At non-bottommost levels, the tombstone must be preserved — a put for the same key might exist in Lk+2 or deeper, and dropping the tombstone would expose that old put to reads (silent data resurrection). This is the single most error-prone detail in writing a compactor.

The merge itself uses the same merge_sorted primitive from chapter 12. Because every input file is individually sorted, and the merge iterator walks them in lockstep pulling the smallest key at each step, the output is sorted too. Two records with the same key arrive consecutively from the merge — the one with the higher sequence number (newer) comes first, and the loop drops the older duplicates with if key == last_key: continue.

A full compaction, traced

Compacting one L1 file into L2

Suppose the store has this state:

L0: (4 files, allowed to overlap — not involved in this compaction)

L1 (256 MB target, 320 MB used — 25% over budget, score 1.25):
    L1-a:  keys a..h   64 MB
    L1-b:  keys h..o   64 MB           ← picked: largest_key > last_compacted_key
    L1-c:  keys o..v   64 MB
    L1-d:  keys v..z   64 MB
    L1-e:  keys a..z   64 MB           (overlaps everyone — invariant violation?
                                         no, we're over budget precisely because
                                         of this; compaction will fix it)

L2 (2.5 GB target, 1.8 GB used, not over budget):
    L2-1:  keys a..c   64 MB
    L2-2:  keys c..g   64 MB
    L2-3:  keys g..k   64 MB
    L2-4:  keys k..n   64 MB
    L2-5:  keys n..r   64 MB
    L2-6:  ... etc

The picker scores L1 at 1.25 and L2 at 0.72; L1 wins. Within L1 it picks L1-b (keys h..o). Now compact it:

pick_compaction() →  ("L1", L1-b)
src_range = ("h", "o")
overlapping_files(L2, ("h","o")) = [L2-3 (g..k), L2-4 (k..n)]
                                   (L2-5 starts at "n"; if the range is
                                    strictly > "o" it may or may not overlap —
                                    here we assume L2-5 starts at "n+" and does overlap)

Actually with keys h..o in L1-b and L2-5 covering n..r:
overlapping_files(L2, ("h","o")) = [L2-3, L2-4, L2-5]

inputs = [L1-b, L2-3, L2-4, L2-5]    # 4 × 64 MB = 256 MB of input

merge_sorted(inputs) walks all four in lockstep, emitting the newest version
of every key:
    for each key k in the union of the four files' key sets:
        emit (k, newest_value_for_k)

Output: one logical sorted stream covering keys roughly g..r
        (spans the union of the inputs' ranges).

writer splits output at 64 MB boundaries:
    L2-3':  keys g..k   64 MB
    L2-4':  keys k..n   64 MB
    L2-5':  keys n..r   64 MB
    (the output is slightly smaller than the input total because duplicate
     keys have been deduped and the tombstones have dropped some records)

Atomic swap (inside levels_state.lock):
    L1 loses L1-b           → L1 now has a, c, d, e (256 MB, at budget)
    L2 loses L2-3, L2-4, L2-5 and gains L2-3', L2-4', L2-5'
    → L2 now has L2-1, L2-2, L2-3', L2-4', L2-5', L2-6, ...
       (still non-overlapping by construction, still ~1.8 GB total)

Delete the old files on disk (once no reader holds a reference).

Four input files, three output files, one net file deletion from L1 (good — that's the point; it drained one file's worth of data down). L1's score drops back below 1.0 and the picker stops waking up about L1 for a while. The next compaction might be triggered by L0 (flushes keep filling it), by a different L1 file if L1 goes over budget again, or by some deeper level.

What happened to reads during all this? They saw a consistent view throughout. Before the atomic swap, L1-b and the three L2 files were the authoritative source; after the swap, the three L2-3'/L2-4'/L2-5' files are. No read ever sees "key temporarily missing" because the swap installs the new files before deleting references to the old ones, and readers holding a reference to an old file continue to see it until they finish their read (the reference-count keeps the old file alive; actual disk deletion is deferred).

Write, read, and space amplification

Leveled compaction has three numbers that matter, and they all fall out of the level structure.

Write amplification. Every byte written to the database is (on average) rewritten once per level as it migrates from L1 down to L6. At each level below L0, compaction takes ~1 file of size F from Lk and merges it with the overlapping ~T files in Lk+1 (because Lk+1 is T times wider in key space than a single Lk file), so the compaction writes (T+1)·F bytes of output per F bytes of input. The per-level amplification is T + 1. Over L levels:

\text{WA}_{\text{leveled}} \approx (T + 1) \cdot L

With T = 10 and L = 6, that is 11 × 6 = 66 — every byte is written to disk about 66 times over its lifetime. That is a lot, and it is leveled compaction's main weakness. RocksDB's compaction_readahead_size, batched I/O, and sub-compactions (Going deeper) exist to keep this from killing SSDs.

(The more careful derivation treats L0 as ~4 times worse because L0 files overlap and must each be merged with every other L0 file when L0 → L1 happens. LevelDB's L0 → L1 compaction reads all L0 files plus the overlapping L1 files, yielding an L0→L1 amplification that is ~(#L0 + overlap_in_L1). For typical #L0 = 4 and T = 10, it is still dominated by the (T+1)·L term.)

Read amplification. A point lookup touches:

The memtables (0 disk I/O, just RAM).
Up to 4 L0 files (overlapping, each must be checked individually, each gated by its Bloom filter).
At most 1 file per level L1..L6 (non-overlap invariant — one binary search in the level's index finds the sole candidate, Bloom filter gates the block read).

\text{RA}_{\text{leveled, worst-case}} = 4_{L0} + L = 4 + 6 = 10 \text{ file probes, of which ~1 is a real disk read (Bloom filters skip the rest)}

With 1% FPR Bloom filters, expected disk reads per point lookup for a key that exists in some level k: one confirmed read at level k, plus ~0.1 false-positive reads elsewhere. Total: ~1.1 disk reads per point lookup. This is why leveled compaction is the default choice for read-heavy workloads.

Space amplification. The ratio of on-disk bytes to logical live-data bytes. Leveled's non-overlap invariant is crucial here: within any level Lk (k ≥ 1), a given key appears at most once. The same key can appear in multiple levels (an older version in L5, a newer one in L2), but compaction is constantly draining older copies down and deduplicating them on the way. The steady-state space overhead is dominated by the bottommost level's size relative to everything above it:

\text{SA}_{\text{leveled}} \approx 1 + \frac{1}{T} + \frac{1}{T^2} + \dots \approx \frac{T}{T - 1} \approx 1.11 \text{ (for } T = 10\text{)}

That is astonishingly tight — 10% space overhead. For a 100 GB logical dataset you need about 110 GB of disk. Contrast with tiered compaction (next chapter), which can temporarily need 2-3× the live data size.

Summary table:

Policy	Write amp	Read amp (files)	Space amp
Leveled (T=10, L=6)	~66	~10	~1.11
Tiered (next chapter)	~20	~30	~2.0

Leveled pays for its excellent read and space numbers in write amplification. That trade is the whole story of the next two chapters.

Common confusions

"Why allow L0 to overlap at all? The non-overlap invariant is so clean — why not enforce it everywhere?" Because enforcing non-overlap at L0 would make flushes slow. A flush produces an SSTable that covers whatever keys happened to be in the memtable at flush time — which is a random slice of the whole key space. To fit that file into a non-overlapping L0, you would have to merge it with the existing L0 files on the flush path itself, turning every flush into a compaction. Instead, L0 absorbs the flush in one quick rename, and a separate background compaction later does L0 → L1 (which is the one place where leveled compaction has to handle overlapping inputs). The overlap at L0 is contained to at most 4 files (the L0 file-count limit), which is a small constant read-amp penalty for a big flush-path win.
"What actually triggers a compaction? Is it size, is it a timer, is it the write path?" All three in production engines, and the policy is a priority queue rather than a single trigger. The primary signal is size-over-budget (the compaction_score >= 1.0 check). But there are also seek-based triggers (LevelDB counts how many seeks a file has caused and compacts it if the count is high — files that get probed a lot but rarely hit are wasting Bloom filter memory and read time), age-based triggers (a file that has sat in Lk for a long time gets pushed down on general principles), and pressure triggers (if L0 hits its hard limit, writes stall until a compaction finishes). The scheduler runs in a dedicated background thread (one or more) and picks the highest-priority pending compaction each time it wakes up. Idle systems get idle compactions; loaded systems get pressure-driven ones.
"What if two compactions want to touch the same file?" They can't. The compactor holds a lock on each file it picks as input, and the picker skips files that are already locked. Real engines do more than just lock — they maintain a data structure (LevelDB calls it VersionEdit) that describes the pending file set, so multiple compactions can run in parallel on disjoint file sets across different parts of the tree. You can have an L1→L2 compaction on keys a..h running at the same time as an L3→L4 compaction on keys p..t; they share no files and do not conflict.
"Why don't compactions rewrite the Bloom filters?" They do. Every output SSTable is a fresh file, built from scratch, including its Bloom filter. The flush machinery from chapter 14 is reused — the compactor is effectively a flush from a merged stream rather than from a memtable. This is why compaction is expensive in CPU as well as I/O: every output byte goes through encoding, block packing, checksumming, Bloom-filter insertion, and sparse-index construction.
"A tombstone in L2 shadows a put in L5 — but leveled compaction only touches two adjacent levels at a time. Does the tombstone ever actually meet the put to cancel it?" Eventually, yes. The tombstone migrates down over time: it moves from L2 to L3 to L4 to L5 via successive compactions. Once it reaches L5, the put in L5 is its merge partner (same key range) and the merge drops both from the output. Before that, the tombstone sits in whatever level it is at, shadowing the put on the read path (reads see the tombstone first, return None). The time for a tombstone to "catch" a deep put can be hours or days on a slowly-written database; this is why "deletes are slow to reclaim space" is a common operational complaint in LSM systems. RocksDB has compaction_pri = kOldestSmallestSeqFirst and kMinOverlappingRatio heuristics to accelerate tombstone propagation.
"Can a compaction make things worse by creating a file that violates the non-overlap invariant?" Only if it has bugs. The compactor's output-split logic cuts the merged stream at file-size boundaries that are also key boundaries — a split never happens in the middle of a key. Since the stream is sorted, consecutive output files have disjoint key ranges by construction. The picker must also ensure that the compaction's input src_file truly overlaps only the chosen dst_overlaps and no other L2 files — which it does by the overlapping_files query. If the picker screws this up (picks too narrow a set of overlaps), the output would overlap with some un-merged L2 file, violating the invariant. Production engines add cross-level invariant checks in debug builds to catch this.
"The math says ~66× write amplification. That sounds catastrophic — how is leveled compaction usable at all?" Two mitigations. First, write amplification is lifetime, not per-second — a byte is written, then rewritten 66 times over its lifetime in the database, which is usually days or weeks. The per-second write rate is much lower than the peak. Second, all of those rewrites are sequential I/O (compaction reads sorted input files and writes sorted output files), which is an order of magnitude faster on spinning disks and several-times-faster on SSDs than random I/O. Compaction effectively trades random writes at the user level for sequential writes at the disk level, and the 66× factor is multiplied against the sequential-write bandwidth, which is typically 500 MB/s or more on modern SSDs. A workload writing 100 MB/s of user data incurs 6.6 GB/s of compaction I/O — still within SSD bandwidth if you have multiple devices.

Going deeper

You have the pyramid and the picker. This section covers the four tuning knobs and the one optimisation that make production leveled compaction actually work — level_size_multiplier tuning, compaction priority heuristics, subcompactions, and auto-compaction backpressure.

Tuning `level_size_multiplier`

The multiplier T (typically 10) controls the trade-off between the number of levels and the per-level compaction cost.

Higher T (say, T=20): fewer levels, lower read amplification. But each compaction is bigger — at L1→L2 with T=20, a single L1 file overlaps ~20 L2 files instead of ~10, so each compaction writes 21 files instead of 11. Per-byte write amplification is (T+1)·L / T = (1 + 1/T)·L·T/T, roughly the same, but individual compactions are larger and harder to parallelise.
Lower T (say, T=5): more levels, higher read amplification. Compactions are smaller. Useful for memory-constrained systems where reading ~T overlapping files at once strains the RAM budget.

RocksDB exposes this as level0_file_num_compaction_trigger, level0_slowdown_writes_trigger, max_bytes_for_level_base, and max_bytes_for_level_multiplier. The last one is T itself. Default is 10. Tuning it down to 8 or up to 15 is common; going outside [5, 20] is unusual.

A subtle trick: variable T per level. RocksDB supports max_bytes_for_level_multiplier_additional which lets you give each level its own multiplier. Increasing T at deeper levels (say, 8 at L1→L2, 10 at L2→L3, 15 at L5→L6) makes the pyramid steeper at the bottom, reducing the bottommost level's share of total size and indirectly reducing space amplification. Sociogram's MyRocks deployment uses this tuning to keep space overhead under 10% on multi-TB databases.

Compaction priority heuristics

The naive "compact the level with the highest score" rule is good enough most of the time, but production engines add a second layer of choice: within a level, which file should be picked? RocksDB has several compaction_pri strategies:

kByCompensatedSize (default in many configs): pick the file whose size is compensated by the number of tombstones and deletes it contains. A file full of tombstones shadowing lots of data in deeper levels is high-priority — compacting it reclaims a lot of space.
kOldestLargestSeqFirst: pick the file whose largest sequence number is oldest. This moves data through the levels in roughly arrival-time order, which is friendly to time-series workloads.
kOldestSmallestSeqFirst: pick the file whose smallest sequence number is oldest. Similar motivation, different tie-breaker.
kMinOverlappingRatio: pick the file whose overlap with the next level is smallest. Minimises the amount of work per compaction — the fewer Lk+1 files we touch, the less I/O we do.

kMinOverlappingRatio is the one that makes the biggest difference on real workloads. A file that overlaps 20 Lk+1 files is 2× as expensive to compact as one that overlaps 10. Picking the cheap ones first reduces total I/O. The cost is a less-uniform wear pattern on the key space, but in practice the distribution of overlap counts is wide enough that you still visit every key region eventually.

Subcompactions

A compaction of one Lk file against ~T Lk+1 files is a single-threaded merge by default. For large files (say, 256 MB files at deep levels, compacting against 10 of them → 2.5 GB of input), the merge takes long enough that it becomes a bottleneck.

Subcompaction splits the compaction work across multiple threads. The key range of the compaction is divided into N disjoint sub-ranges (typically by looking at the sparse index of the source file), and each sub-range is compacted by a dedicated worker thread. Since the sub-ranges are disjoint, there is no coordination overhead — each worker reads its portion of each input file, merges them, and writes its portion of the output files. The outputs from the sub-workers are already sorted (by sub-range), so they can be concatenated into the final output files without re-merging.

RocksDB's max_subcompactions controls this; default is 1 (no parallelism), production values are 4-8. Subcompactions can halve the wall-clock time of a large compaction at the cost of proportional CPU and I/O peak load. They are essential for multi-GB SSTables.

Auto-compaction backpressure

If writes arrive faster than compactions can drain L0, L0 grows unboundedly. That kills reads (every read probes all L0 files), and eventually L0 outgrows RAM for its Bloom filters. RocksDB has three backpressure levels:

level0_file_num_compaction_trigger (default 4): L0 has more than this many files → compaction gets scheduled.
level0_slowdown_writes_trigger (default 20): L0 is uncomfortably full → write path artificially slows down (each put sleeps for a few hundred microseconds) to give compactions time to catch up.
level0_stop_writes_trigger (default 36): L0 is about to blow up → write path blocks completely until a compaction completes.

This is "write stall" — users see write throughput collapse, sometimes to zero for hundreds of milliseconds, while the system digests its backlog. The design philosophy: better to slow writes temporarily than to let the read path degrade unpredictably. Cassandra, HBase, and every other LSM engine has analogous backpressure, though with different triggers and different default values.

Tuning these three values is one of the core operational tasks for a RocksDB deployment. Too tight → writes stall often. Too loose → reads get slow under write pressure. The defaults (4, 20, 36) are tuned for a specific workload shape; production deployments usually tune them based on measured compaction throughput and L0 build-up rate.

Move compaction — the optimisation worth knowing

Sometimes a compaction picks an Lk file that has no overlap at all in Lk+1. The L1 file covers h..o, and no L2 file covers any key in h..o. In this case there is nothing to merge — the compaction output is identical to the input, byte for byte.

Leveled compaction optimises this: instead of reading and rewriting the file, it moves the file from Lk to Lk+1 with just a metadata change. No I/O. No rewrite. The file's level counter goes up, and the MANIFEST is updated.

Move compactions are surprisingly common in practice — about 10-20% of compactions in typical workloads. They are the reason the "66× write amplification" number is an upper bound; real write amplification is more like 15-25× because move compactions skip the rewrite for a large fraction of levels.

LevelDB implements this as the "trivial move" optimisation. RocksDB has it too, as compaction_style = kCompactionStyleLevel with allow_trivial_move = true (default). The machinery is small — one check in the compactor before starting the merge — but the savings are enormous.

Where this leads next

You now have leveled compaction: the pyramid of levels, the non-overlap invariant, the picker, the per-compaction merge, and the three amplification numbers that define the policy's character. Reads are fast, space is tight, writes pay for both.

The next chapter builds the opposite extreme — tiered compaction (Cassandra's default, originally from Google's Bigtable). Tiered compaction does not maintain a non-overlap invariant. Instead, each level is a "tier" of same-size SSTables that are all overlapping, and compaction merges an entire tier's worth of files at once into a single file at the next tier. The effect: much lower write amplification (each byte is written ~L times total, not L·T), but much worse read amplification (every tier has many files, all possibly overlapping, all must be probed) and worse space amplification (during the merge, the old tier and the new file coexist, temporarily doubling the level's size).

Leveled and tiered sit at opposite ends of a three-way trade-off: write amplification, read amplification, space amplification. You can pick any two, but not all three. The next two chapters make this explicit, and the final chapter of Build 3 — universal compaction — shows how RocksDB's production compactor tries to let you dial the trade-off point at runtime, rather than committing to one corner.

Tiered compaction (Cassandra) — the opposite trade-off: low write amp, high read amp, worse space amp.
Universal compaction and the write/read/space trilemma — the tunable middle ground, and the fundamental impossibility that forces the trade-off.

For now, take a moment with the shape you have. A memtable absorbs writes. Flushes drop SSTables into L0. Leveled compaction pushes data down the pyramid, keeping every level below L0 non-overlapping and every level within its size budget. Reads pick the newest version of each key by walking memtables → L0 → L1 → ... → L6, gated by Bloom filters and Invariant B. This is the full LevelDB architecture, and it is the architecture RocksDB, CockroachDB, TiDB, and much of the modern database landscape is built on. A hundred lines of policy code — the picker, the merger, the scheduler — turn a chaotic pile of SSTables into an organised, read-efficient, space-efficient storage engine.

References

Sanjay Ghemawat and Jeff Dean, LevelDB Implementation Notes — the original leveled-compaction design, including the file-picker heuristic, the "trivial move" optimisation, and the non-overlap invariant. github.com/google/leveldb.
Siying Dong et al., Optimizing Space Amplification in RocksDB (CIDR 2017) — quantitative analysis of leveled vs tiered write/read/space amplification, and RocksDB's tuning knobs. cidrdb.org.
Niv Dayan, Manos Athanassoulis, and Stratos Idreos, Monkey: Optimal Navigable Key-Value Store (SIGMOD 2017) — proves the optimal per-level Bloom filter sizing in a leveled LSM and quantifies read amplification. stratos.seas.harvard.edu.
Chen Luo and Michael J. Carey, LSM-based Storage Techniques: A Survey (VLDB Journal, 2020) — comprehensive survey of leveled, tiered, and hybrid compaction policies. arxiv.org/abs/1812.07527.
Sociogram RocksDB team, Leveled Compaction (RocksDB wiki) — the production documentation for RocksDB's leveled policy, including max_bytes_for_level_multiplier, compaction_pri, and subcompactions. github.com/facebook/rocksdb/wiki.
Patrick O'Neil et al., The Log-Structured Merge-Tree (LSM-Tree) (Acta DataFlow, 1996) — the foundational paper; §3 derives the multi-level merge cost that leveled compaction implements. Springer.