Tiered Compaction (Cassandra)

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Chapter 17 built leveled compaction: non-overlapping files per level, each level ten times the previous, a read touches at most one SSTable per level, and the price is roughly 10× write amplification. That bargain is wrong for write-heavy workloads. Tiered compaction — the scheme Cassandra shipped with and still defaults to as size-tiered compaction strategy (STCS) — is the opposite trade. Each "tier" holds up to N SSTables of roughly the same size; the moment the N-th one lands, they are merged into a single SSTable that graduates to the next tier. Writes pay far less (a record is rewritten about \log_N(\text{dataset}/\text{memtable}) times total, not once per level), but reads pay more (a key may live in any SSTable in any tier, so bloom filters have to cover many more files) and space amp can spike to 2\times during a merge because you hold both the inputs and the output on disk until the swap. Cassandra was built for Sociogram's inbox search — append-heavy, rarely-read, hundreds of machines. Tiered was the right default. This chapter writes the tier picker, traces a sequence of flushes and merges, and lines the three amps up side-by-side with leveled so the workload \to strategy mapping becomes obvious.

Cassandra was built for writes.

The original Cassandra paper [1] describes the workload: Sociogram Inbox Search, 2008. Every chat message, every comment, every wall post — a firehose of small writes that almost nobody ever re-read. The system had to swallow a hundred thousand writes per second per machine and do it on commodity disks where a sequential write was 100 MB/s and a random write was a cruel joke.

A write-heavy system needs cheap writes. And leveled compaction — the scheme from the last chapter — is many things, but cheap on writes it is not. Every SSTable that leaves L_0 gets rewritten roughly once per level, and with ten levels in a production deployment that is a write amplification of somewhere between 10 and 30. A 10 MB/s write rate from the application turns into 100–300 MB/s of disk I/O. For Sociogram's inbox load, the disks would have melted.

So Cassandra picked the other door. It accepted more work on reads (which were rare anyway) and more peak disk usage (disks were cheap compared to IOPS) in exchange for a write path that did almost no rewriting. That scheme is tiered compaction, and it is the subject of this chapter.

The rule, in one sentence

Group SSTables by size. When N SSTables of roughly the same size accumulate, merge them into one SSTable roughly N\times larger. The merged file joins the next tier and waits for N{-}1 siblings to arrive before being merged again.

That is the whole algorithm. N is typically 4 (Cassandra's default min_threshold). The "roughly the same size" rule is made precise by bucketing — two SSTables are in the same tier if their sizes are within some ratio, typically 0.5× to 1.5× of each other. Everything else — the compaction queue, the tombstone handling, the major-compaction knob — is implementation decoration on top of this rule.

The picture is worth more than the words.

Size-tiered compaction with $N=4$. Fresh flushes land in Tier 0 as small SSTables. When four accumulate in a tier, they are merged into a single SSTable (roughly $4\times$ larger) that joins the next tier. Each record is rewritten once per tier it passes through — that is $\log_N(\text{total}/\text{memtable})$ rewrites, not once per level.

Leveled compaction thinks in terms of levels that partition the key space — one big L_n shared across the entire key range. Tiered compaction thinks in terms of tiers of whole SSTables — many files per tier, each covering the whole key range independently. That is the one-sentence intuition behind everything that follows. Leveled slices the key space and constrains overlap per level; tiered slices time and constrains overlap per size bucket.

Writing the tier picker

The tier picker is the heart of a tiered compactor: it scans the current list of SSTables and decides which subset (if any) to merge next. Cassandra's actual picker is a couple of hundred lines of Java, but the skeleton fits in under fifty lines of Python.

# tiered_compaction.py — size-tiered picker in the spirit of Cassandra's STCS
MIN_THRESHOLD = 4        # N — merge when this many similar SSTables pile up
MAX_THRESHOLD = 32       # cap on how many we merge at once (for huge tiers)
BUCKET_LOW  = 0.5        # "similar size" = size ratio between BUCKET_LOW
BUCKET_HIGH = 1.5        #                   and BUCKET_HIGH of bucket's median

def bucket_sstables(sstables):
    """
    Group SSTables into buckets of 'similar size'. Same logic as Cassandra's
    SizeTieredCompactionStrategy.getBuckets().
    """
    sstables = sorted(sstables, key=lambda s: s.size_bytes)
    buckets = []
    for sst in sstables:
        placed = False
        for bucket in buckets:
            median = bucket[len(bucket)//2].size_bytes
            if BUCKET_LOW * median <= sst.size_bytes <= BUCKET_HIGH * median:
                bucket.append(sst)
                placed = True
                break
        if not placed:
            buckets.append([sst])
    return buckets

def pick_compaction(sstables):
    """
    Return a list of SSTables to compact together, or None if no bucket
    currently has enough similar-sized files to be worth merging.
    """
    for bucket in bucket_sstables(sstables):
        if len(bucket) >= MIN_THRESHOLD:
            # Merge up to MAX_THRESHOLD of the smallest in the bucket first;
            # that keeps the merged output aligned with the tier below it.
            bucket.sort(key=lambda s: s.size_bytes)
            return bucket[:MAX_THRESHOLD]
    return None                          # nothing to do right now

Why bucket by size ratio instead of by "tier number": SSTables don't have a tier tag written into them. They just have a size. After a delete-heavy merge, an output SSTable may be noticeably smaller than the N inputs it was made from, so it might not neatly fit the next tier. Bucketing by size ratio lets the scheme recover automatically — the small output joins whatever bucket it matches now, and the algorithm is stateless in the tier numbering.

Why a MAX_THRESHOLD: with a long-running cluster, Tier 4 or Tier 5 may briefly hold dozens of similar-sized SSTables (e.g. after a repair storm delivers many parallel streams). Merging all of them in one compaction would mean a multi-hundred-GB merge, which stalls the compaction thread for hours and spikes disk usage. Capping at 32 breaks the giant merge into a few medium ones.

The actual compaction, once the picker has chosen a set, is exactly the k-way merge from chapter 15 — keep the newest value per key, drop tombstones older than the gc-grace, write the result into one new SSTable, atomically install it, and delete the inputs. That part is identical across tiered and leveled; only the picker differs.

Tracing a sequence of flushes

Sixteen flushes, three rounds of compaction

Start with an empty store and MIN_THRESHOLD = 4. The memtable flushes every 4 MB. Trace what the SSTable set looks like as writes pour in.

After flush 1:  [ S1(4) ]
After flush 2:  [ S1(4), S2(4) ]
After flush 3:  [ S1(4), S2(4), S3(4) ]
After flush 4:  [ S1(4), S2(4), S3(4), S4(4) ]
                → bucket at ~4 MB has 4 members → compact S1..S4
                → S1..S4 merge into M1(16). Delete S1..S4.
After compact:  [ M1(16) ]

After flush 5:  [ M1(16), S5(4) ]          # S5 is in its own bucket (not similar to M1)
After flush 6:  [ M1(16), S5(4), S6(4) ]
After flush 7:  [ M1(16), S5(4), S6(4), S7(4) ]
After flush 8:  [ M1(16), S5(4), S6(4), S7(4), S8(4) ]
                → bucket at ~4 MB now has 4 members → compact S5..S8
                → M2(16). Delete S5..S8.
After compact:  [ M1(16), M2(16) ]

... repeat ...

After flush 16: [ M1(16), M2(16), M3(16), M4(16) ]
                → bucket at ~16 MB now has 4 members → compact M1..M4
                → L1(64). Delete M1..M4.
After compact:  [ L1(64) ]

Count the work. Sixteen flushes wrote 16 \times 4 = 64 MB to disk (the cost that is inherent to ingestion; no compactor can skip it). The three compactions rewrote:

Round 1: 4×4 = 16 MB in, 16 MB out.
Round 2: 4×4 = 16 MB in, 16 MB out.
(Two more 4→16 rounds happened between flushes 8 and 16, each another 16 MB rewritten.)
Final round: 4×16 = 64 MB in, 64 MB out.

Total compaction I/O = 4 \times 16 + 64 = 128 MB of writes, on top of the 64 MB flush. Write amplification = (64 + 128) / 64 = 3\times at this data size. A record in the 64 MB dataset has been copied, on average, three times: once by the flush, once by the \text{Tier-0} \to \text{Tier-1} merge, and once by the \text{Tier-1} \to \text{Tier-2} merge. Contrast this with leveled compaction on the same workload, where 64 MB of flushes and a steady state at, say, L_3 would cost closer to 4\times the rewrites per record per level boundary.

And notice the peak disk usage right before the last merge: both M1..M4 (64 MB total) and the output L1 (64 MB) are on disk simultaneously. That is space amplification spiking to 2\times — characteristic of every tiered merge.

Tiered vs. leveled — the amps

Here is the table the chapter has been building up to. T is the total dataset size; M is the memtable size. Numbers are order-of-magnitude typical in production deployments, not exact bounds.

quantity	leveled	tiered (STCS)
write amplification	\sim 10\text{–}30\times (≈ number of levels, with 10× fanout)	\sim \log_N(T/M) \approx 3\text{–}5\times
read amplification (point lookup, worst case)	~1 SSTable per level; total ≈ levels	up to N per tier; total ≈ N \cdot \log_N(T/M)
space amplification	\lesssim 1.1\times at steady state	\sim 1.3\text{–}2\times; transient 2\times during major merges
best workload	read-heavy, update-in-place, scan-friendly	append-heavy, write-heavy, rarely-read
Bloom filter pressure	low (~1 filter hit per level)	high (one per SSTable in each tier the key might live in)
peak disk during a single compaction	one L_n partition + one L_{n+1} partition (small, partitioned)	N whole SSTables + their merged output (can be large)

The pattern falls out of the geometry. Leveled amortises rewrites across many levels, each touching a small key range, so writes are expensive and reads are cheap (only one SSTable per level needs checking, because per-level files are non-overlapping). Tiered batches rewrites into a few big merges that happen rarely, so writes are cheap, but every SSTable in the store might hold any key and a read needs a bloom-filter probe per file.

Why is leveled's write amp so much worse than tiered's: in leveled, a record at L_k may be rewritten when any record from L_{k-1} that overlaps its key range enters the level. Because L_{k-1} is 1/10 the size of L_k, roughly ten compactions at the lower level happen for each one at the upper — each one rewrites ~10% of L_k. So every byte in L_k is rewritten roughly once per level-boundary traversed. With L_0 through L_6 and 10× fanout, that's 6–7 rewrites per byte from flushes alone, and higher in practice because deletes and overwrites cascade. Tiered only rewrites a byte when its tier fills up, which happens \log_N times.

Workload match — which one wins where

The short version.

Write-heavy, rarely-read (event logs, time-series, Cassandra's original metrics/inbox workload). Tiered. Every extra byte of write amp costs real IOPS; read amp costs very little because there are no reads. Cassandra still defaults to STCS for exactly this reason. [6]
Read-heavy with point lookups (Bigtable, LevelDB as a metadata store, Kafka Streams' state store). Leveled. Every point lookup pays O(\text{levels}) disk seeks; keeping that number small and every level's files non-overlapping is worth the rewrites. Also: leveled's steady-state space amp is close to 1\times, which matters when the dataset is a large fraction of the machine's disk.
Mixed with heavy deletes (TTL data, CDC, Kafka log-compacted topics). Neither default is great. Tombstones in tiered can linger for a long time in the big tier (a record is only actually deleted when the tombstone meets the final live copy in the same merge, which may be many tiers away). Cassandra added TimeWindowCompactionStrategy (TWCS) [6] as a time-bucketed variant of STCS specifically for TTL workloads; leveled with "tombstone-triggered compaction" is RocksDB's answer.
Everything else (general-purpose KV, DynamoDB-style workloads). Most engines offer both and let the operator pick per-table. Cassandra's CQL exposes compaction = { 'class' : 'LeveledCompactionStrategy' } per-table. ScyllaDB follows the same API. RocksDB lets you configure per-column-family. The default is rarely the right answer for every table.

Common confusions

"Why does tiered read worse than leveled?" Because a key can live in any SSTable. With leveled, each level has one candidate file (partitioned by key range, non-overlapping); a point lookup is levels bloom probes and at most levels SSTable seeks. With tiered, each tier has up to N candidate files, and all tiers must be checked — so the worst case is N \cdot \text{tiers} files to probe. Bloom filters help a lot (99% of files are skipped without I/O), but the filter lookups themselves aren't free, and the tail latency of a lookup that hits false positives in several filters is noticeably worse.
"How does tiered interact with Bloom filters?" Tightly. Because every SSTable is a potential home for every key, each SSTable carries a bloom filter and a read must consult all of them. A 1% false positive rate per filter means a read with 40 SSTables in the store has about a 40\% chance of at least one false positive — which is a wasted disk seek. Cassandra defaults to a tighter 0.1\% false positive rate for STCS tables to keep this under control; it costs about 10\times more bloom-filter RAM than leveled, which is the real hidden cost of tiered on large-dataset nodes. [4]
"Why is space amp worse in tiered?" Two reasons. (1) Transient doubling: during a merge, both the inputs and the output sit on disk until the inputs are deleted, which spikes disk usage to 2\times the merged size. (2) Slow obsolescence: an overwrite in a lower tier doesn't actually reclaim space until the two tiers are merged, which may be many flushes away. Leveled, in contrast, touches every key in its key-range partition every time the partition is compacted, so obsolete versions are reclaimed quickly.
"What if all my SSTables are different sizes — do they ever compact?" Yes, and this is the bucketing rule at work. An SSTable that doesn't fit any existing bucket within the [0.5\times, 1.5\times] window starts a new bucket of its own. Eventually that bucket fills up (through continued flushes/compactions) or the operator triggers a major compaction that merges every bucket together regardless. In practice, steady-state Cassandra clusters have 5–15 SSTables per table per shard, distributed across 3–5 buckets.
"Is STCS the same thing as 'universal compaction' in RocksDB?" Close but not identical. RocksDB's universal compaction [5] is a variant of tiered that adds a few more triggers (read amplification trigger, space amplification trigger) and more nuanced bucket selection. The next chapter — Universal compaction and the write/read/space trilemma — works through the differences and uses RocksDB's universal mode as the vehicle for the full trilemma story.

Going deeper

Tiered is a default with knobs. This section walks the ones that matter in production.

Tuning `min_threshold` and `max_threshold`

Cassandra exposes both as table properties. min_threshold (default 4) sets N: how many similar-sized SSTables must accumulate before a compaction triggers. max_threshold (default 32) caps the size of a single compaction.

Lowering min_threshold to 2 turns tiered into something that looks more like universal compaction: fewer SSTables per tier, lower read amp, but write amp climbs because each record is rewritten more often per tier. Raising it to 8 pushes in the other direction: more SSTables per tier, higher read amp, lower write amp. The default of 4 is a compromise that has held up well because it keeps the total SSTable count (summed over tiers) within the zone where bloom filters still buy you cheap skips.

A common failure mode: setting min_threshold very high on a write-heavy table to squeeze write amp lower, then finding that point reads have gone from 1 ms to 50 ms because a cold-cache read now touches 40 bloom filters and 2–3 SSTables. The fix is always the same — bring min_threshold back down, or (better) switch that table to TWCS if it's TTL data or leveled if it's read-heavy.

Major compactions — the big red button

Cassandra's nodetool compact on a table triggers a major compaction: merge every SSTable for that table into one. The output is a single huge file that holds the entire (live) dataset, maximally defragmented, with all obsolete versions and past-grace tombstones reclaimed.

Major compactions feel great operationally ("just squish everything into one file!") and are a terrible idea in production. The single huge SSTable no longer fits any normal bucket, so no future compaction will merge it with anything until several other files grow to be similar-sized — which can take weeks. During that time, overwrites and tombstones pile up in smaller SSTables and never reach the big file to obsolete its stale rows. Space amp stays high and never recovers. Cassandra's docs have warned against nodetool compact on production tables for a decade [6] for exactly this reason.

The correct remedy when STCS has "gotten messy" is usually to run single-SSTable compactions with nodetool garbagecollect, which rewrites one SSTable at a time to drop tombstones and overwritten cells without disturbing the size-tier geometry. Or: switch the table to a different compaction strategy.

Incremental repair — where tiered and anti-entropy collide

Cassandra is a distributed system, and replicas occasionally drift out of sync. The tool that re-syncs them is called repair, and it compares Merkle trees of the replicas' SSTable contents to find differences.

Classic repair treats every SSTable as a candidate for difference-finding — an O(\text{all data}) operation that can saturate a cluster for hours. Cassandra 2.2 introduced incremental repair: once an SSTable has participated in a successful repair, mark it as "repaired" and skip it next time. [6]

This interacts badly with STCS. The bucketing rule mixes repaired and unrepaired SSTables freely based on size alone. If a size-tier merge combines a repaired SSTable with an unrepaired one, the output can't be marked repaired (because some of its rows haven't yet been verified against other replicas). In the worst case, repeatedly mixing repaired and unrepaired SSTables delays the "repaired" flag on any given data indefinitely, and every repair cycle scans everything.

The fix, shipped with CASSANDRA-9143, is to keep repaired and unrepaired SSTables in separate compaction spaces: two parallel STCS stacks, one for each. A compaction never crosses the boundary. It is extra bookkeeping, but it preserves the incremental repair invariant. [6] This is one of the places where "tiered is simple" stops being true — a distributed write-heavy engine has to layer in per-SSTable provenance and segregate its compactor accordingly.

Tombstone-gc interaction

The specific tombstone problem is worth making concrete. A DELETE writes a tombstone record. A tombstone only becomes eligible for actual removal (a) after it is older than gc_grace_seconds (default 10 days, chosen so it survives any plausible replica-repair window) and (b) the compaction reading it is merging with the SSTable that holds the live row being shadowed.

In tiered, the live row and the tombstone might be sitting in different tiers, waiting for a rare multi-tier merge to meet. A row deleted on day 1 can visibly hold disk space until a major merge happens on day 100. Cassandra added unchecked tombstone compaction and single-SSTable tombstone compaction triggers to counteract this — if a single SSTable has more than some fraction (default 20%) of droppable tombstones and it has sat unmerged for tombstone_compaction_interval (default 1 day), compact just that one file to drop its tombstones. [6] It is a workaround built into STCS specifically because tiered's lazy merging leaves tombstones stranded.

Where this leads next

Leveled and tiered are two points in a larger design space. Leveled privileges read amp and space amp (both low); tiered privileges write amp (low). The general observation — that write, read, and space amplification form a trilemma where optimising any two hurts the third — is due to Dayan, Athanassoulis, and Idreos [3], and they proved there is no compaction strategy that beats the trilemma; you always pay somewhere.

The next chapter builds RocksDB's universal compaction — a variant of tiered with three independent triggers (size ratio, SSTable count, space amplification) and uses it as the vehicle to draw the full trilemma. The goal is to graduate the "pick one" attitude of leveled-vs-tiered into "pick your point on the curve" — the compaction strategy becomes a policy with tunable knobs, each of which moves along one axis of the trilemma.

After that, Build 3 finishes with a pair of chapters on snapshots (iterators over a dataset that is being actively compacted under your feet) and the bloom-filter / block-cache / write-buffer triad — the three data structures that every serious LSM engine layers on top of the compaction machine. Then Build 4 moves from a single-node storage engine to the transactional interface that application code actually talks to.

For now, the takeaway is one observation: Cassandra was built for writes, and its compaction reflects that. The size-tiered scheme is not a clever optimisation someone added later — it is the shape that falls out naturally when "make writes cheap" is the overriding constraint. Every other property of the system — the high bloom-filter RAM cost, the eventual tombstone lag, the major-compaction footgun, the incremental-repair bookkeeping — is a consequence of that one design decision. Once you see why STCS is the natural answer to write-heavy, the rest of the machinery falls into place.

References

Avinash Lakshman and Prashant Malik, Cassandra — A Decentralized Structured Storage System (LADIS 2009) — the original Cassandra paper; §5.1 describes the memtable/SSTable/compaction lifecycle and the write-heavy Inbox Search workload that motivated size-tiered compaction. cassandra.apache.org.
Patrick O'Neil, Edward Cheng, Dieter Gawlick, Elizabeth O'Neil, The Log-Structured Merge-Tree (LSM-Tree) (Acta DataFlow, 1996) — the foundational LSM paper; derives the write-amplification geometry that both leveled and tiered schemes descend from. citeseerx.ist.psu.edu.
Niv Dayan, Manos Athanassoulis, Stratos Idreos, Monkey: Optimal Navigable Key-Value Store (SIGMOD 2017) — formalises the write/read/space amplification trilemma for LSM compaction and shows there is no strategy that wins on all three axes. stratos.seas.harvard.edu.
Siying Dong et al., Optimizing Space Amplification in RocksDB (CIDR 2017) — discusses the write/read/space trade-offs explicitly, with production numbers from Sociogram for both tiered-like and leveled configurations. cidrdb.org.
RocksDB Team, Universal Compaction — RocksDB's documentation for its tiered-family compaction mode, with bucket selection, size-ratio triggers, and space-amp bounds. github.com/facebook/rocksdb/wiki/Universal-Compaction.
Apache Cassandra Documentation, How is data maintained? — Compaction strategies — describes STCS, LCS, TWCS, min_threshold/max_threshold, major compaction caveats, tombstone-triggered compaction, and the repaired/unrepaired SSTable split. cassandra.apache.org/doc/latest/cassandra/operating/compaction/stcs.html.