Compaction — Reclaiming Deleted and Overwritten Keys

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

The segmented Bitcask from chapter 7 solves one problem and ignores another. Write the same key ten times and the log holds ten records; the index points at only the last one. The first nine are dead bytes — reachable by nobody, deletable in principle, still taking disk forever. After a day of traffic most of your disk is dead. Compaction is the background job that scans the closed segments in write order, keeps only the records the index currently points at, writes them to a fresh compacted segment, and atomically swaps it in. Fifty lines of Python do it. The atomic swap is the only hard part — you must update the index pointers and rename the file without a concurrent get() ever reading a half-swapped state. This chapter writes the code, draws the before/after, walks through a concrete compaction, and previews how LSM-trees (Build 3) take the same idea and push it orders of magnitude further.

Imagine a single counter key being incremented once a second. views=1, views=2, views=3, ... After one day the log holds 86,400 records for that one key and the index points at exactly one of them — the last. The other 86,399 are still on disk, still sealed in closed segment files, still consuming bytes, and nothing in the world can reach them. They are garbage the filesystem does not know is garbage. The get("views") path does not scan them (the index sends it straight to the live record), so they never slow a read. They just sit there taking space.

This is the structural problem the previous chapter left open. Segmenting the log turned one ever-growing file into a series of fixed-size files, which made deletion of old data physically possible — you can now os.remove() a whole segment in one call. But nothing yet decides which segments are safe to delete, because every segment holds a mix of live and dead records. You cannot delete segment 4 wholesale; it still contains the most-recent record for some keys. You need a way to sift each old segment, keep the live records, and drop the dead ones.

That sifting process has a name. It is called compaction, and it is one of the most important operations in log-structured storage. Every real system in this family runs it — Bitcask, LevelDB, RocksDB, Cassandra, Kafka log-compacted topics, BigQuery's columnar reorgs, even Redis's AOF rewrites. The shape is always the same.

The problem, drawn

Three closed segments with 18 records between them; the index points at only 4. The other 14 are dead — still on disk, still in closed files, still consuming space. Compaction's job is to write a new segment containing only the 4 live ones and delete seg-0, seg-1, seg-2.

Count the records. Eighteen in total. Live ones — the ones the index currently points at — are four. The space amplification is 18/4 \approx 4.5\times; you are paying for four and a half bytes on disk for every live byte of data. For a counter-incrementing workload the ratio grows without bound. For a typical web app with a mix of reads and updates, steady-state space amplification before compaction is commonly 3–10×.

If you could walk through seg-0, seg-1, seg-2 and copy only the live records into a new file, you would end up with one compacted segment holding exactly 4 records, and you could delete the other three segments. Space amplification drops to 1\times. That is compaction, in one sentence.

The algorithm

It is three lines of English.

Freeze a set of closed segments you want to compact. They must be immutable — never the active segment being written to.
Scan them in write order. For each record, ask the index: "does this key currently point at this exact (segment, offset)?" If yes, it is live — copy it to a new compacted segment. If no, it is dead — skip it.
Swap. Update the in-memory index so every live key points at its new location in the compacted segment. Delete the old segment files. Done.

The test in step 2 is the crux of the algorithm, and it is delightfully simple. The in-memory index already knows, for every live key, exactly where its latest record sits — which segment, which byte offset. A record in some old segment is live iff the index entry for its key points exactly here. Not "at this key", not "at a newer record for this key" — at this exact location. Any other location means some later put() overwrote this record, and it is dead.

Why (seg, offset) and not just "key exists in the index": a record for key age in seg-0 at offset 130 is dead if the index says age → (seg-2, 52). The key is live in the store, but this particular record is stale. The index entry is the single source of truth about which record for each key is current.

Writing it — `compact()` on `BitcaskSegmented`

In the previous chapter you built BitcaskSegmented, which tracks closed segments in a list and an active segment being appended to. The index is now {key: (segment_id, offset)} instead of {key: offset}. We extend that class with a compact() method.

# bitcask_compact.py — compaction extends the segmented Bitcask
import os

class BitcaskSegmented:
    # ... __init__, put, get, _rotate_segment, _rebuild_index from chapter 7 ...

    def compact(self):
        """Rewrite all closed segments into one compacted segment.

        The active segment is untouched — writes continue normally throughout."""
        frozen_ids = list(self._closed_ids)   # snapshot — the active segment is excluded
        if len(frozen_ids) < 2:
            return                            # nothing worth compacting yet

        # 1. open a new segment with a temp name so a crash mid-rewrite leaves no trail
        new_id = self._next_segment_id()
        tmp_path = os.path.join(self.dir, f"compact-{new_id}.tmp")
        new_path = os.path.join(self.dir, f"seg-{new_id:06d}.log")

        # 2. build the new segment and a fresh index fragment for the keys we copy
        new_index = {}
        with open(tmp_path, "wb") as out:
            for seg_id in frozen_ids:
                with open(self._seg_path(seg_id), "rb") as src:
                    offset = 0
                    for line in src:
                        key_bytes, eq, _ = line.partition(b"=")
                        if not eq:
                            offset += len(line); continue
                        key = key_bytes.decode("utf-8")
                        # liveness test: does the index still point at *this* record?
                        if self._index.get(key) == (seg_id, offset):
                            new_offset = out.tell()
                            out.write(line)
                            new_index[key] = (new_id, new_offset)
                        offset += len(line)
            out.flush(); os.fsync(out.fileno())

        # 3. atomic swap — rename temp to final name, then redirect the index
        os.rename(tmp_path, new_path)          # atomic on POSIX within one filesystem
        self._closed_ids.append(new_id)
        for key, loc in new_index.items():
            # only redirect entries that still point at a frozen segment
            if self._index.get(key, (None,))[0] in frozen_ids:
                self._index[key] = loc

        # 4. unlink the old segment files — the index no longer points at them
        for seg_id in frozen_ids:
            old_path = self._seg_path(seg_id)
            self._closed_ids.remove(seg_id)
            os.remove(old_path)

About fifty lines. Every line earns its place.

frozen_ids = list(self._closed_ids). A snapshot. Between this line and the end of the method, the active segment is still receiving writes, and it will eventually close and join _closed_ids. We only compact segments that were closed at the moment compaction started — anything that closes during the compaction waits for the next round. This is what makes compaction safe without a write lock.

The liveness test — self._index.get(key) == (seg_id, offset). This one comparison is the entire algorithm. A tuple equality in Python; a few nanoseconds; and it decides whether the record at this position in this segment is live or dead. No reference counting, no mark-and-sweep, no background bookkeeping. The index is the liveness oracle, because by construction it always points at the latest record for every key.

Writing to a temp name first. If the process crashes halfway through the rewrite, the partial file is called compact-007.tmp, not seg-000007.log, and on restart _rebuild_index() ignores anything that does not match the seg-*.log pattern. You can delete stray .tmp files on startup with no risk. Crash safety for free.

os.rename(tmp_path, new_path). On POSIX, rename within a single filesystem is atomic — either the old name exists or the new name exists, never both, never neither. This is the swap primitive. After this line the new segment is visible to anyone walking the directory; before this line it is invisible under a name nobody looks up.

Conditional redirect — if self._index.get(key, (None,))[0] in frozen_ids. Subtle, important. Between step 2 (building new_index) and step 3 (applying it), another thread could have called put(key, newvalue), which appended to the active segment and updated the index to point there. That write is newer than anything compaction saw. We must not trample it. So before redirecting the index, we check: does the key still point at a frozen segment? If yes, redirect. If no, a concurrent write already made the key live somewhere newer — leave it alone.

os.remove(old_path) last. Only after the index has been fully redirected. If we deleted first and then tried to redirect, any concurrent reader could race us and fail its open(old_path) call. Order matters: publish the new, then unpublish the old.

A concrete compaction, step by step

Walking through the compaction of the three segments above

The store starts in the state drawn above:

seg-0:  offset 0   name=dipti\n       (dead — overwritten in seg-1)
        offset 11  views=1\n           (dead)
        offset 20  views=2\n           (dead)
        offset 29  city=chennai\n      (LIVE)
        offset 42  views=3\n           (dead)
        offset 51  age=15\n            (dead)

seg-1:  offset 0   views=4\n           (dead)
        offset 9   age=16\n            (dead)
        offset 16  views=5\n           (dead)
        offset 25  name=dipti\n        (LIVE)
        offset 36  views=6\n           (dead)
        offset 45  age=17\n            (dead)

seg-2:  offset 0   views=7\n           (dead)
        offset 9   views=8\n           (dead)
        offset 18  age=18\n            (LIVE)
        offset 25  views=9\n           (dead)
        offset 34  views=10\n          (LIVE)

index:  name  → (seg-1, 25)
        age   → (seg-2, 18)
        city  → (seg-0, 29)
        views → (seg-2, 34)

compact() runs. It iterates seg-0, seg-1, seg-2 in that order. For each record it asks the index.

seg-0 @ 0   key="name" index["name"]=(seg-1,25) ≠ (seg-0,0)   → skip
seg-0 @ 11  key="views" index["views"]=(seg-2,34) ≠ (seg-0,11) → skip
...
seg-0 @ 29  key="city" index["city"]=(seg-0,29) == (seg-0,29)  → COPY → new offset 0
seg-0 @ 42  key="views" ...                                    → skip
...
seg-1 @ 25  key="name" index["name"]=(seg-1,25) == (seg-1,25)  → COPY → new offset 13
...
seg-2 @ 18  key="age"  index["age"]=(seg-2,18)  == (seg-2,18)  → COPY → new offset 24
seg-2 @ 34  key="views" index["views"]=(seg-2,34) == (seg-2,34)→ COPY → new offset 31

The compacted segment (called seg-3) now holds, in write order:

seg-3:  offset 0   city=chennai\n
        offset 13  name=dipti\n
        offset 24  age=18\n
        offset 31  views=10\n

The index is redirected to {name: (seg-3, 13), age: (seg-3, 24), city: (seg-3, 0), views: (seg-3, 31)}. The files seg-0, seg-1, seg-2 are unlinked. 18 records on disk became 4. Space amplification dropped from 4.5\times to 1\times.

Compaction input and output. Three segments with most of their bytes dead become one segment with no dead bytes. The index silently redirects every live key to its new location in seg-3. Readers do not notice.

The atomic-swap problem

The line that does all the real work is os.rename(tmp_path, new_path). Everything else is bookkeeping around it. The question this section answers is: why is that rename safe in the middle of a busy get workload?

A get() on our store does three things: look up the key in the index, open the file for the segment named in the index entry, seek to the offset and read. Compaction's job is to change which segment that entry points at, without any of those three steps seeing an inconsistent view.

There are two things that could go wrong and do not.

A reader could try to open a segment we just deleted. They cannot — we delete the old segment only after redirecting every index entry that pointed at it. A reader that loaded the index entry before the redirect got the old (segment, offset) and can still open the old file; Unix lets them, because an open file descriptor survives unlink. A reader that loads the index entry after the redirect gets the new (segment, offset) and opens the new file. There is no window where a reader sees the old index entry and the old file is already gone.

A reader could see the index entry mid-update. In CPython, a single dictionary assignment self._index[key] = loc is atomic under the GIL — the reader sees either the old entry or the new entry, never a torn one. So individual key redirects are safe. In a multi-threaded C implementation you would guard the dict with a mutex; Bitcask's real implementation in Erlang uses per-key message passing, which serialises the update.

Why rename is atomic: the POSIX rename(2) syscall is specified to be atomic with respect to the target name — on a single filesystem, either a process looking up the new name finds nothing (the rename hasn't happened) or it finds the fully written new file (the rename has happened). Half-written renames are not observable. The implementation lives inside the filesystem driver (ext4, xfs, apfs) and uses a filesystem-journal transaction to commit the directory-entry change. Windows semantics differ; MoveFileEx with MOVEFILE_REPLACE_EXISTING is the rough equivalent. This is why production Bitcask clones almost always target POSIX.

The pattern — write-to-temp, fsync, rename — is the single most common "safe publish" idiom in all of systems programming. You will see it in editor autosave, in package managers, in git (every object is written under a random name and renamed into .git/objects/xx/yy...), in the Kubernetes atomic configmap mount, in how journalctl commits its data files. Learning it in Bitcask is learning it for everywhere.

When to run it

You do not want to compact too often — it is a background I/O storm, reading and rewriting every closed byte. You do not want to compact too rarely — dead bytes are piling up on disk the whole time. The common triggers are three.

By segment count. "If there are more than 10 closed segments, compact the oldest 8." Cheap to check; a segment count is one integer. The policy Basho's Bitcask ships with. It keeps the total file count bounded regardless of workload, which matters because many filesystems slow down past tens of thousands of files in one directory.
By dead-space ratio. Track, per segment, the fraction of bytes that are dead. When the total dead-bytes across all closed segments exceeds some threshold (say 50% of their combined size), trigger compaction. More adaptive — a workload with very few overwrites produces little dead space and rarely compacts. Needs you to count dead bytes; the cleanest way is to maintain a per-key "previous location" in the index and update a counter on every overwrite. RocksDB calls a similar metric "space amplification" and builds its leveling strategy around it.
By manual trigger or schedule. The operator runs compaction at 3 AM when traffic is low. Crude but common in small deployments. Riak's default setup exposes a riak-admin bitcask compact command for exactly this.

Real systems combine all three — a high-water trigger based on dead-space ratio, a daily cron as a safety net, and a hard cap on file count so a pathological workload cannot run you out of inodes.

def _should_compact(self):
    # a tiny combined policy: run if many segments, or a lot is dead
    if len(self._closed_ids) >= 10:
        return True
    total = sum(os.path.getsize(self._seg_path(s)) for s in self._closed_ids)
    dead  = total - self._live_bytes   # maintained on every put/overwrite
    return total > 0 and dead / total > 0.5

Dropping this into a background thread that wakes every few seconds, checks _should_compact, and calls compact() if it returns true, is the complete policy loop.

Common confusions

"Why not just overwrite the record in place when it gets overwritten?" Because the new value is usually a different size from the old one. views=999 is shorter than views=1000. Overwriting in place would require leaving padding bytes, or shifting every byte after it, or maintaining a free-list of holes in the file. All of those are exactly the complications that log-structured storage was invented to avoid. The whole point of the append-only design is that writes go to the end and nothing in the middle ever moves. Compaction preserves that property: it does not rewrite records in place; it writes a whole new file and deletes the old one.
"Isn't writing every live record again a waste?" Yes, and the name for the waste is write amplification. If you write 1 GB of user data and compaction later rewrites 400 MB of it (the live portion of closed segments), your disk actually wrote 1.4 GB — a 1.4× write-amp. Bitcask's amp is typically 1.5–3× over a long workload. LSM-trees, which compact in layers, can reach 10–30× in the worst case; a lot of RocksDB tuning is about bounding this. B-trees avoid it (they overwrite pages) but pay in other ways. No design wins on all three of read-amp, write-amp, space-amp — you trade along this axis. This is the "RUM" conjecture if you want the academic name.

"What if a read races a compaction?" Nothing bad happens. Either the reader resolves the index entry before the redirect (reads from the old segment — fine) or after (reads from the new segment — fine). The unlink of the old segment happens after all redirects, so the reader's open file descriptor is always valid; on POSIX the file's bytes stay allocated until every fd is closed, even if the directory entry is gone. A slow reader holding a seg-2 fd while compaction unlinks seg-2 still reads the correct bytes; the kernel defers actual reclamation.
"What if the machine crashes during compaction?" The temp file compact-007.tmp exists but is incomplete; no closed segment has been deleted because the rename has not fired yet. On restart, _rebuild_index() ignores files not matching seg-*.log, and we delete stray .tmp files. The store is in its pre-compaction state, no data lost, compaction will retry on the next trigger. The whole operation is effectively transactional, with the rename as the commit point.
"Can two compactions run at once?" No — you serialise compaction with a lock or by structuring it as a single background thread. Two concurrent compactions would both claim the same frozen segments and could produce duplicate output. Bitcask's implementation uses an Erlang process per storage directory; only one runs at a time.
"Does compaction block writes?" It does not, which is the whole reason for the active/closed distinction. New put() calls land in the active segment the whole time compaction is chewing on the old ones. Only the final index-redirect step needs synchronisation with writes; that step is microseconds.

Going deeper

Everything above is enough to write a correct compactor and understand its place in a Bitcask-family store. The three sections that follow contextualise it: lock-step versus background designs, the formal space-amplification metric, and a sneak preview of how LSM-trees generalise this one operation into their entire structural policy.

Background-thread vs lock-step compaction

Two designs exist for when compaction runs.

Lock-step (inline) compaction runs on the put() path itself — every so many writes, the writer stops and runs a compaction pass. This is what you get for free if you call compact() at the end of _rotate_segment. Pros: simple, no threads, predictable memory. Cons: write latency spikes whenever compaction fires. A 200 ms rewrite pause inside a write-heavy request is a disaster for tail latency. Almost no production system does this; teaching code sometimes does.

Background-thread compaction runs on a dedicated thread that wakes up, checks the policy, and compacts if needed while the main writer keeps going. This is what real Bitcask does and what RocksDB does (with multiple compaction threads, in fact). Pros: hot-path latency is not affected. Cons: you need thread-safety on the index and the closed-segments list, and you compete for I/O bandwidth with foreground reads and writes. Most systems let you configure compaction I/O throttling (e.g. max_background_compactions and rate_limiter_bytes_per_sec in RocksDB).

A middle-ground approach is to run compaction on the user's own thread but only during an idle window (no put() in the last few seconds). SQLite's PRAGMA auto_vacuum works like this. It fits single-threaded toy deployments but breaks down under continuous traffic.

The space-amplification metric

Space amplification is the ratio of bytes on disk to bytes of live data:

\mathrm{SA} = \frac{\text{total bytes in all segments}}{\text{bytes of currently live records}}.

\mathrm{SA} = 1 means zero dead bytes — you just compacted, or your workload never overwrites. \mathrm{SA} = 2 means half your disk is garbage. \mathrm{SA} = 10 means your 1 TB disk has 100 GB of real data and 900 GB of zombie records.

Bitcask-family stores under a uniform overwrite workload converge to \mathrm{SA} = 1 + c where c is the fraction of the active segment that has not yet turned over — typically \mathrm{SA} \approx 1.5 steady state with aggressive compaction, \mathrm{SA} \approx 3 with lazy compaction. LSM-trees with leveled compaction approach \mathrm{SA} \approx 1.1; they pay for that lower space amplification with much higher write amplification. B-trees run at \mathrm{SA} \approx 1.3 (page fill factor is typically 60–70%) without any compaction thread, paying in random writes instead.

The numbers matter for hardware sizing. If your workload produces 100 GB of live data per node, you buy a 200 GB disk for a Bitcask deployment, a 150 GB disk for an LSM deployment, and a 140 GB disk for a Postgres deployment. Multiplied over a fleet, the choice of engine is a real line item.

How LSM-trees generalise this

The compaction you just wrote has one pool of closed segments and collapses them all into one new one whenever policy fires. Every compact() call is a full rewrite of every dead byte.

LSM-trees — log-structured merge trees, Build 3 of this track — split the closed segments into levels. Level 0 is the youngest (the closed segments straight from the write buffer). Level 1 is roughly 10× bigger. Level 2 is 10× bigger than that. And so on. When any level fills up, it is compacted into the next level, not into a single destination. Each compaction is much smaller than a full rewrite; across levels, the total work per byte is bounded by a constant times the number of levels — typically \log_{10} of the dataset size.

This is the same idea as Bitcask's compact(), run partitioned and in layers instead of globally. It buys you scale — LSM handles multi-terabyte datasets per node — at the cost of read amplification (a get() must check each level in order until it finds the key), which Bloom filters and level summaries then paper over.

Build 3 develops LSM from scratch, starting from exactly the BitcaskSegmented.compact() you just wrote. The mental move is small; the leverage is enormous.

Compaction as garbage collection

Zoom out one more step. A log-structured store is morally a heap. Records are objects. Overwriting a key is "allocating a new object and dropping the reference to the old one." The index is the GC root. Dead records are unreachable objects. compact() is a tracing collector — it walks the live set, copies it to a fresh arena (the new segment), and frees the old arena (os.remove).

This is not an analogy; it is the same algorithm. Go's garbage collector, the JVM's G1 GC, a Lisp compacting GC — they all do precisely what compact() does, on a different substrate. Even the vocabulary transfers: "write amplification" in databases and "GC overhead" in runtimes are the same notion. "Compacting collector" in the GC literature and "compaction" in the database literature are the same verb. You have just learned garbage collection by writing a database.

Where this leads next

Compaction handles overwrites cleanly — record superseded, index pointer moves, old record skipped at compaction time. It does not handle deletes. If you want to remove a key entirely, you cannot just drop its index entry, because a crash then restart would _rebuild_index() and the scan would find the old record and re-add the key.

The fix is another append: a special tombstone record that says "this key is deleted." The index stores the tombstone like any other record, get() learns to return None when it resolves to a tombstone, and compaction learns to drop tombstones (and any records for their key) that are old enough. That is chapter 9.

After that, chapter 10 makes restart instant by writing a hint file next to each closed segment, so _rebuild_index() does not have to scan gigabytes on startup.

Two more chapters and your code is the full Bitcask design — about 300 lines of Python that run real production workloads. Then Build 3 will take this same engine and crank it into an LSM-tree.

References

Justin Sheehy and David Smith, Bitcask: A Log-Structured Hash Table for Fast Key/Value Data (Basho Technologies, 2010) — §4 describes merge (their word for compaction) and the atomic-rename swap. riak.com/assets/bitcask-intro.pdf.
Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 3 Storage and Retrieval, "Making an LSM-tree out of SSTables" — compaction as the unifying idea behind Bitcask and LevelDB. dataintensive.net.
Manos Athanassoulis et al., Designing Access Methods: The RUM Conjecture (EDBT 2016) — formal statement that read-amp, update-amp (write-amp), and memory/space-amp cannot all be minimised at once. stratos.seas.harvard.edu/publications/rum-conjecture.
Siying Dong et al., Optimizing Space Amplification in RocksDB (CIDR 2017) — production-scale analysis of compaction strategies and the dials that matter. cidrdb.org/cidr2017/papers/p82-dong-cidr17.pdf.
Michael Rosenblum and John Ousterhout, The Design and Implementation of a Log-Structured File System (ACM TOCS, 1992) — the original log-structured paper; its segment cleaner is the direct ancestor of every compactor discussed here. web.stanford.edu/~ouster/cgi-bin/papers/lfs.pdf.
Patrick O'Neil et al., The Log-Structured Merge-Tree (LSM-Tree) (Acta DataFlow, 1996) — leveled compaction formalised; the structural generalisation this chapter previewed. Springer.