Universal Compaction and the Write/Read/Space Trilemma

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Leveled compaction keeps reads cheap (~1 SSTable per level) and space tight (~1.1×), but every byte is rewritten 10–30 times. Tiered compaction keeps writes cheap (~3–5× amp), but reads touch many SSTables and space amp can spike to 2× during merges. A natural question: is there a scheme that wins on all three? Dayan, Athanassoulis, and Idreos answered it in the 2017 Monkey paper — no. Write amp, read amp, and space amp form a trilemma: any compaction strategy is a point inside a triangle, and improving one axis costs you on at least one other. Universal compaction, RocksDB's tiered-family policy, is the most honest embodiment of this trilemma in a production engine. Unlike Cassandra's STCS, which has a single knob (min_threshold), universal exposes three independent triggers — the size-ratio trigger (merge when the newest SSTable is significantly smaller than the running sum of the older ones), the sorted-run count trigger (cap total SSTable count to bound read amp), and the space-amp trigger (force a full major compaction when redundant bytes exceed a threshold). Each trigger moves along one axis of the triangle. This chapter draws the trilemma, builds universal's three triggers, compares leveled / tiered / universal / FIFO in a single table, and gives the workload-to-strategy decision tree. The engineering takeaway: stop looking for the best compaction strategy. Pick a point on the triangle that matches your workload, and accept that the other two corners are paying for it.

You have leveled. You have tiered. You might reasonably ask: why not combine them? Do tiered at the top (cheap writes, fresh data) and leveled at the bottom (cheap reads, cold data), and get the best of both worlds.

The instinct is correct. That scheme exists, and it is called hybrid compaction. But before you can make that trade honestly, you need to see why there is a trade at all — why picking a compaction policy is always picking a point in a space, not finding a winner. The geometry of that space is what the last two chapters have been circling around, and it has a name: the write/read/space trilemma.

This chapter does three things. First, it draws the trilemma as a triangle and places every compaction strategy you know on it. Second, it builds RocksDB's universal compaction — the scheme that gives you three independent knobs, one per trilemma axis, and forces you to choose your point in the triangle explicitly. Third, it walks the decision: given your workload, which strategy wins?

The trilemma, as a triangle

The Monkey paper [1] phrased it in one sentence: for any LSM compaction policy, the product of write amplification, read amplification, and space amplification is bounded below by a constant determined by the dataset size and memtable size. You cannot drive all three to 1 simultaneously. Pick two.

Draw it.

The write/read/space amplification trilemma. Every compaction strategy is a point in the triangle: closer to a corner means better on that axis, worse on at least one other. FIFO sits exactly at the "low write amp" vertex because it does no compaction at all — but it lives on disk with unbounded space and read amp. Tiered is near the write-amp corner; leveled is on the read-amp / space-amp edge; universal is a tunable interior point. The dashed curve traces the configurations accessible by tuning a single knob family (e.g. RocksDB's universal options).

Three observations make the trilemma visceral.

Observation 1. Write amp and space amp pull against each other. Reducing write amp means not rewriting bytes; but the bytes you don't rewrite are the overwritten ones, and they accumulate on disk as redundant copies — that is space amp. To drive space amp to 1× you must eagerly rewrite overwrites away, which costs writes.

Observation 2. Write amp and read amp pull against each other. Reducing write amp means leaving many SSTables around; a read then must check all of them (via bloom filter, but still RAM + CPU + occasional false-positive disk seek). Merging aggressively collapses the SSTable count, but that is rewriting, which is write amp.

Observation 3. Read amp and space amp are usually allies, not enemies. Merging collapses the SSTable count (good for reads) and drops obsolete versions (good for space). That's why the triangle is not a straight line — leveled can be good on both read and space simultaneously, and tiered is bad on both simultaneously. The real tension is writes-vs-everything-else, with a secondary twist: leveled's reads-and-space come together only if you are willing to rewrite often.

Where each strategy sits

Before building universal, line up what you already have.

Tiered (STCS). Near the write-amp corner. Merges happen rarely (every N SSTables in a tier); a byte is rewritten \log_N(T/M) times. Reads are expensive (every tier must be probed). Space amp is 1.3–2× because obsolete versions linger between merges.

Leveled. On the read-amp / space-amp edge. Aggressive partitioning-and-rewriting per level means every byte is touched many times over its lifetime (write amp 10–30×), but a read touches one file per level (read amp ≈ levels) and space amp settles near 1.1× because overwrites are reclaimed within one level's compaction cycle.

FIFO. At the write-amp corner (trivially). Do no merging at all — just delete the oldest SSTable when disk fills up. Write amp is exactly 1 (every byte is written once and never rewritten). Read amp is awful (every SSTable must be checked); space amp is whatever the operator tolerates (typically capped at k \cdot dataset for some k). FIFO is useful for strict time-series log retention — you want the last 7 days of data and don't care about point-lookup latency. RocksDB ships a FIFO compaction mode [2] for exactly this workload.

Universal. A tunable interior point, closer to the tiered side of the triangle. It is the scheme this chapter builds.

Universal compaction — the size-ratio trigger

RocksDB's universal compaction is a tiered-family policy with a cleaner, more tunable picker. The rule that replaces STCS's "bucket by size ratio" is the size-ratio trigger:

Walk SSTables from newest to oldest. If the newest SSTable is much smaller than the sum of the older SSTables — specifically, if older_total_size > newest_size × (1 + size_ratio) — then the newest SSTable is "out of step" and triggers a merge that pulls in older SSTables until the size series looks regular again.

Reworded: universal wants the SSTable sizes to form a roughly geometric series, newest smallest, oldest largest. Whenever a fresh flush lands and breaks that geometric shape, it gets merged with its neighbours to restore it.

The size_ratio parameter (default 1 in RocksDB, meaning the threshold is 2× — merge when the older total is more than 2× the newest) is the first of universal's three knobs.

# universal_compaction.py — size-ratio trigger in the spirit of RocksDB
SIZE_RATIO = 1  # percentage-ish; default 1 means threshold = 1 + 1/1 → merge when older_sum > 2 * newest

def size_ratio_pick(sstables_newest_first):
    """
    Walk newest to oldest. Accumulate sizes. The first point at which
    accumulated_older_sizes > candidate_size * (1 + SIZE_RATIO) defines
    the merge set: [newest ... up through that point].
    """
    accumulated = 0
    merge_set = []
    for sst in sstables_newest_first:
        merge_set.append(sst)
        accumulated += sst.size_bytes
        # Does the *next* SSTable fit the ratio test?
        # (We look one ahead to decide whether to stop growing the merge set.)
        idx = len(merge_set)
        if idx >= len(sstables_newest_first):
            break
        next_sst = sstables_newest_first[idx]
        if accumulated > next_sst.size_bytes * (1 + SIZE_RATIO):
            # The series is "regular" from here on — stop.
            break
    # If the merge set is just one SSTable, there's nothing to merge.
    return merge_set if len(merge_set) >= 2 else None

Why a size-ratio trigger instead of a count trigger: STCS merges when N similar-sized SSTables accumulate. That is a step function — nothing happens, then suddenly a big merge. Universal's size-ratio rule is smoother: as soon as the newest file breaks the geometric shape, a small merge happens. The result is many small-to-medium merges rather than occasional huge ones, which spreads I/O more evenly and keeps peak disk usage lower.

The three triggers

Universal does not stop at size-ratio. RocksDB adds two more independent triggers, each pinned to a different axis of the trilemma. [2]

Trigger 1 — size-ratio (write-amp axis). The rule above. Fires most often. Keeps write amp low because merges are triggered by shape, not count.

Trigger 2 — sorted-run count (read-amp axis). A hard ceiling on the number of SSTables (in universal's vocabulary, "sorted runs"). Configured via level0_file_num_compaction_trigger, default ~4 in universal, but higher bounds (e.g. 10) via max_size_amplification_percent and related knobs. If size-ratio alone isn't firing often enough and the SSTable count climbs above the limit, a count-triggered merge fires — merge whatever's needed to get back under the limit. This caps read amp from above.

Trigger 3 — space-amp bound (space-amp axis). The knob max_size_amplification_percent (default 200%). Universal estimates space amp as:

\text{space amp} \approx \frac{\text{size of all SSTables except the largest}}{\text{size of the largest SSTable}}

(the intuition: the largest SSTable is approximately the live dataset; everything smaller is recent updates, many of which shadow rows in the big one). If that ratio exceeds 200%, universal fires a full major compaction — merge every SSTable into one — to reclaim the redundant bytes. This caps space amp from above, at the cost of one expensive full rewrite.

Each trigger pins one axis. Tuning them moves the working point around the triangle:

Knob	Decrease →	Increase →
`size_ratio`	fewer merges, higher write amp (toward STCS)	more merges, higher write amp but smoother
SSTable count limit	fewer SSTables (better read amp, worse write amp)	more SSTables (worse read amp, better write amp)
`max_size_amplification_percent`	force major compactions often (low space, high write)	tolerate more redundancy (high space, low write)

Leveled vs. tiered vs. universal vs. FIFO — the table

Here is the comparison table this two-chapter arc has been aiming for. Numbers are order-of-magnitude typical in production, not exact bounds. T is dataset size, M is memtable size, N is the tiered/universal bucket size (typically 4).

Scheme	Write amp	Read amp (point, worst)	Space amp	Peak compaction I/O	Best workload
Leveled	10\text{–}30\times	~`L` files (≈ 6–7)	\sim 1.1\times	small (one L_k file + overlapping L_{k+1} files)	read-heavy, point lookups, scans
Tiered (STCS)	\sim \log_N(T/M) \approx 3\text{–}5\times	up to N \cdot \log_N(T/M) \approx 20\text{–}40	1.3\text{–}2\times (transient 2×)	large (N whole SSTables)	write-heavy, rarely read
Universal	\sim 3\text{–}6\times (similar to tiered; smoother)	~`max_sorted_runs` (configurable, 4–10)	\le 1 + \text{max\_size\_amplification\_percent}/100 \approx 2\times (hard cap)	medium, occasional full compaction when space-amp trigger fires	mixed read-write with space budget
FIFO	1\times	all SSTables (dozens+)	whatever retention allows	0 (just unlink the oldest)	time-series, strict retention

Three things jump out.

Universal is tiered with safety nets. The write amp is close to STCS; the advantage is the hard caps on read amp (via count trigger) and space amp (via the forced major compaction). In practice that turns universal into "tiered, but without the pathological failure modes" — you no longer wake up to find 60 SSTables in one bucket or space usage at 3× because of a delete storm.

Leveled wins on reads and space. Tiered and universal win on writes. FIFO is in its own corner — it only makes sense when you are literally going to let old data fall off the back of the truck.

Peak compaction I/O is a separate dimension. Leveled merges are small (bounded to one source file plus overlapping targets), tiered merges are large (all N SSTables in a tier), universal merges are medium except when the space-amp trigger fires a full compaction. Large merges can stall foreground traffic if they saturate the disk; this is why universal and tiered sometimes need rate-limiters that leveled does not.

Hybrid and adaptive compaction

The natural next step: run different strategies at different levels. RocksDB calls this compaction-style-per-level, and it's the default in many deployments.

L0: tiered-family (memtable flushes land here as independent files; no sense partitioning the key space yet).
L1 through L_{k-1}: leveled (mid-levels are where reads and space benefit most).
L_k (bottommost, largest): optionally universal or a looser leveled — because the bottom level is 90% of the dataset, keeping its space amp at 1.1× costs enormous write amp on the full-size partition.

Variants exist. Cassandra shipped TimeWindowCompactionStrategy (TWCS) — a time-bucketed tiered, specifically for TTL workloads where old data is useless and the goal is to keep all of it in one bucket so it can be dropped wholesale. ScyllaDB shipped incremental compaction strategy (ICS), which decomposes large SSTables into "fragments" to keep peak disk usage low during merges. [3]

Adaptive compaction is the research-paper frontier: the engine measures the live workload (writes per second, read amp observed, space amp observed) and continuously tunes the triggers — sliding the working point around the triangle to chase the observed load. The Monkey paper [1] and its descendants (Dostoevsky [4], ELSM, RocksDB's FAST-LSM proposals) all aim in this direction. No production engine ships a fully adaptive compactor yet — the tuning loop is hard to make robust — but the knobs universal exposes are the interface that an adaptive loop would drive.

How to choose — workload-driven decision

The decision is not which compaction strategy is best. It is which point in the trilemma triangle matches your workload.

Picking a strategy for three workloads

Workload A: metrics ingestion (Prometheus-style, 1M writes/sec, reads 1% of ingestion).

Write amp dominates cost. Every extra factor of 2\times write amp is 2\times more disk IOPS, 2\times more SSDs, 2\times the AWS bill. Reads are rare and usually range scans over recent data, which sit in the top tier anyway (cache-friendly). Space amp at 1.5× is tolerable — disk is cheaper than IOPS.

Winner: Tiered or Universal. Cassandra with STCS, or RocksDB with universal (max_size_amplification_percent = 200, generous sorted-run limit). Write amp ~3–5×; reads pay but rarely.

Workload B: user profile store (Bigtable-like, 10K writes/sec, 100K reads/sec, 99th percentile SLO 10ms).

Reads dominate. Write amp of 20\times means your flush throughput is 50 KB/s per 1 MB/s of application writes — fine, because ingest is modest. Every point read must pay one seek per level; at L=6, that's 6 seeks on a cold cache. Tiered would blow the SLO with 30+ SSTables and multiple bloom false positives per read.

Winner: Leveled. RocksDB's LCS or LevelDB. Read amp ≈ 6; space amp ≈ 1.1 (storage cost stays linear).

Workload C: 7-day application log retention (write-only, read only for forensics).

No read-latency SLO. Old data becomes worthless on day 8. You are paying for disk, not for IOPS or read performance.

Winner: FIFO. RocksDB FIFO mode with a 7-day TTL. Write amp 1×; old SSTables are unlinked when total size or age crosses the threshold. Reads are awful — but nobody reads this data unless a postmortem demands it, and then a 2-minute scan is acceptable.

The same engine (RocksDB) handles all three by choosing a different compaction mode per column family. The workload dictates the mode; the mode is not an engine property but a per-table knob.

Common confusions

"Isn't universal just tiered with more knobs?" Partly yes. Universal's size-ratio trigger behaves very much like STCS's bucketing, and for a write-heavy workload with default settings the two produce similar SSTable geometries. The practical difference is the space-amp and count triggers, which act as safety nets. STCS, left alone, can drift into pathological SSTable layouts (many files, huge space amp, one giant file that never merges). Universal can't — one of its three triggers will always fire before things get that bad.
"Why is the space-amp calculation so approximate?" Universal estimates space amp as (sum of all SSTables except the largest) / (largest). This overestimates when the largest SSTable holds many deletes (its "live" size is smaller than its file size) and underestimates when many small SSTables hold overwrites of rows in the big one. A true measurement requires scanning every file and merging logically, which would itself be a compaction. The approximation is good enough: it is off by a factor of 2 in bad cases but rarely wrong about whether a major compaction should fire.
"Can I have more than one universal knob above the hard limit?" No — that is the point of the trilemma. If the size-ratio trigger and the count trigger disagree (size-ratio says "don't merge yet", count says "too many files"), count wins, and universal does the merge. If size-amp says "too much redundant space", it overrides both. The triggers are a priority chain, not a consensus.
"Does universal work with leveled at the bottom?" RocksDB's default in some configurations is kCompactionStyleUniversal, which runs universal at every level. There is a separate mode, kCompactionStyleLevel, which is leveled everywhere, and a hybrid mode discussed in the "Going deeper" section. The reason universal-at-every-level sometimes loses to leveled-at-the-bottom is that the bottommost level holds 90%+ of the dataset, and universal's space-amp trigger forces periodic full rewrites of that monster — very expensive. Leveled handles the bottom by rewriting it in partitioned chunks, which is gentler.
"Is FIFO really a 'compaction strategy'?" It is the degenerate case: compaction with zero work. Calling it a strategy reinforces the point that compaction is a policy choice, and "do nothing, delete the oldest files" is a legitimate choice for some workloads. It also makes the trilemma corners explicit — FIFO is the anchor at the write-amp corner, the same way a perfectly-merged single file would anchor the read-amp corner.

Going deeper

Universal is one name for a family of tunable tiered policies. This section walks the specific RocksDB options, compares with Cassandra's TWCS/DTCS, and previews the adaptive-compaction research.

RocksDB's universal compaction options

The full configuration surface of kCompactionStyleUniversal in RocksDB is about a dozen knobs, but four carry most of the weight [2].

size_ratio (default 1): the percentage threshold for the size-ratio trigger, expressed hundredths. A value of 1 means "fire when older_sum > newest × (1 + 1/100)" — a very tight geometric requirement that causes many small merges. Production deployments often raise this to 10–20 to allow more slack and fewer merges.
min_merge_width / max_merge_width (defaults 2 and UINT_MAX): minimum and maximum number of SSTables to pull into any single compaction. Bounding the max is important on big datasets — a run that would merge 40 SSTables at once will saturate the disk for hours.
max_size_amplification_percent (default 200): the space-amp ceiling. When exceeded, a full major compaction fires. Lowering to 100 forces aggressive space-amp control at the cost of more full compactions; raising to 400 tolerates 4× space amp in exchange for near-zero full compactions.
compression_size_percent (default -1): when \ge 0, the top fraction of SSTables (by size) is not compressed, keeping write amp on the hot path low. The cold bottom is compressed to reclaim space. This is the tunable-per-level-compression interface that universal exposes.

The operator's task is to match these knobs to the workload's position on the trilemma. A "configure once, leave alone" strategy is rare; most teams iterate on the universal options for weeks after going to production, watching write-amp, read-amp, and disk-usage metrics.

Cassandra's DateTieredCompactionStrategy (DTCS) and TWCS

Cassandra shipped DateTieredCompactionStrategy in 2.0.11 as a time-aware variant of STCS for time-series workloads. The bucketing was done by write-time rather than size: SSTables written in the same time window went into the same bucket. The idea was clean — old data lives in old SSTables, new data in new ones, merges respect that boundary — but the implementation had edge cases that caused SSTable fragmentation under common workloads (out-of-order writes, repair streams backfilling old data). DTCS was deprecated in Cassandra 3.8.

TWCS (TimeWindowCompactionStrategy) replaced it with a simpler design: bucket by write window (e.g. 1-day buckets), run STCS within each bucket, and never merge across buckets. Once a bucket's window is past, its SSTables are frozen — no more compaction, just waiting for the TTL to expire them all at once. The net effect is that TTL-expiry is nearly free (an entire bucket drops at once) and space amp is deterministic (bounded by the bucket size). [5]

TWCS is the right answer for metrics, logs, and any workload where the primary data lifecycle is "ingest → expire after T". It sits even further toward the low-write-amp corner of the triangle than STCS because old buckets never get rewritten.

Adaptive compaction proposals

The research frontier has produced several adaptive schemes. Three worth knowing.

Monkey (Dayan, Athanassoulis, Idreos — SIGMOD 2017). [1] Proves the trilemma formally and derives the optimal allocation of Bloom-filter bits-per-key across levels (not uniform — deeper levels get fewer bits). Shows that for the same bloom-filter memory budget, Monkey's allocation gives lower read amp than RocksDB's uniform allocation. Not a compaction strategy per se, but the bloom-filter allocation is tightly coupled to the chosen compaction policy.

Dostoevsky (Dayan and Idreos — SIGMOD 2018). [4] Observes that leveled's write amp is dominated by the cost of rewriting the bottom level (which holds 90%+ of data). Proposes lazy leveling: leveled everywhere except the bottom, which uses tiered. Reclaims much of leveled's read-amp advantage while paying only tiered's write amp on the bottom. RocksDB's per-level compaction-style option supports exactly this hybrid, and operators sometimes deploy it.

Wacky / ELSM / FAST-LSM (2019–2023). A family of proposals where the engine measures observed read-ratio and write-ratio during operation and automatically adjusts N, fanout, and per-level compaction style. None are in production engines yet, but RocksDB's tuning telemetry (rocksdb.compaction-pending, per-level statistics) is the groundwork that an adaptive loop would consume.

The common thread: every adaptive proposal tries to let the engine slide along the dashed curve in the trilemma triangle in response to load, instead of requiring operators to pick a point and pin it there. Until such loops ship, universal compaction is the most tunable policy a production engine offers, and operators manually do the sliding.

Major compactions in universal — the ticking bomb

The space-amp trigger is what gives universal its space-amp guarantee, but it is also the most operationally disruptive moment in a universal-mode database. When max_size_amplification_percent is crossed, universal merges every SSTable into one. For a 500 GB database that's a 500 GB I/O event that can take hours and saturate the disk. During that time, foreground writes keep arriving, piling up in the memtable and then in new L0 SSTables that can't be merged (the compaction thread is busy); the memtable may fill and block writes; the cluster's tail-latency SLO may temporarily collapse.

Production deployments work around this by (a) setting max_size_amplification_percent generously (400+) to make the trigger fire rarely, (b) running compaction on a dedicated thread pool with a rate limit, and (c) monitoring rocksdb.estimate-pending-compaction-bytes to predict and pre-empt the big merge. It is the one place where universal's "no pathological states" claim is weakest — the strategy guarantees the state won't persist, but the transition out of it can itself cause an outage.

Where this leads next

Three chapters, one pattern: flushes create SSTables, SSTables pile up, compaction merges them. Leveled, tiered, and universal are three answers to "how and when". The trilemma tells you that no single answer wins — you pick your point on the triangle.

One piece of the puzzle is still missing. Compaction is a background process that rewrites and deletes SSTables underneath the foreground read path. What happens when a long-running read — say, a full-table scan started ten minutes ago — is in the middle of iterating, and the compactor deletes an SSTable that the scan was about to read?

The answer is snapshots and iterators, the next chapter. The mechanism (reference counts on SSTable handles, a version number stamped into every iterator at creation) is simple. The consequences — consistent range scans, repeatable reads, and how they interact with compaction's reclamation — are where the LSM engine finally starts looking like a database rather than a file store.

After snapshots, Build 3 closes with the bloom-filter / block-cache / write-buffer triad — the RAM-side scaffolding every serious LSM engine layers on top of the compaction machine. Then Build 4 starts: the transactional interface that application code actually talks to, built on top of the storage engine we just finished.

For now, one takeaway: compaction is not a solved problem — it is a family of policies, each pinned to a corner of a trilemma. Leveled, tiered, universal, FIFO, TWCS, hybrid — each is a different bet about what your workload looks like. The job of the operator is to match the bet to the workload. The job of the engine is to make the bets tunable. The Monkey paper's contribution was to formalise what practitioners had known for a decade — you cannot have all three — and give the community a vocabulary for the trade-off. Every engine since has shipped with that trade-off exposed as knobs, not hidden behind a "default strategy".

References

Niv Dayan, Manos Athanassoulis, Stratos Idreos, Monkey: Optimal Navigable Key-Value Store (SIGMOD 2017) — proves the write/read/space amplification trilemma for LSM compaction and derives optimal Bloom-filter allocation per level. stratos.seas.harvard.edu.
RocksDB Team, Universal Compaction — the authoritative documentation for RocksDB's kCompactionStyleUniversal, including the size-ratio trigger, sorted-run count trigger, max_size_amplification_percent, and all tuning options. github.com/facebook/rocksdb/wiki/Universal-Compaction.
Raphael S. Carvalho et al., Incremental Compaction Strategy — ScyllaDB's variant that fragments large SSTables to reduce peak disk usage during merges, keeping the write/read/space tradeoffs of leveled but with smaller per-compaction I/O. scylladb.com/2020/11/17/incremental-compaction-strategy.
Niv Dayan and Stratos Idreos, Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging (SIGMOD 2018) — introduces lazy leveling, where leveled is used at every level except the bottom, which uses tiered. stratos.seas.harvard.edu.
Apache Cassandra Documentation, Time Window Compaction Strategy — TWCS's design rationale as a replacement for DTCS, its bucketing rules, and guidance on window-size selection for time-series workloads. cassandra.apache.org/doc/latest/cassandra/operating/compaction/twcs.html.
Siying Dong et al., Optimizing Space Amplification in RocksDB (CIDR 2017) — Sociogram's production report on the trade-offs among compaction strategies at scale, including measured numbers for universal vs. leveled. cidrdb.org.