Tombstones and the Delete Problem

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

You have built a store whose writes only ever go forward: an append-only log, segmented, with a background compactor that reclaims dead bytes. Now a user calls db.delete("alice"). The obvious implementation — "remove the index entry and go home" — is wrong in two ways. First, the old alice=... record is still sitting in some closed segment under fsync; a process restart would scan it and put alice right back. Second, in a distributed setting, a neighbour replica that missed the delete would gossip its stale value back to you and resurrect the key. A delete that leaves no durable mark is a delete that does not stick. The fix is the oldest trick in log-structured storage: write a tombstone. A special record whose payload is not a value but a marker saying "whoever reads this, the key is gone." get() learns to return None when it lands on one; compaction learns not to drop it until every older record for the same key is also gone; and after some grace period, the tombstone itself is collected. This chapter writes the code, draws the log with a tombstone in it, walks through a complete put-delete-get-compact trace, and then spends the rest of its pages on the subtle failure modes — zombie records, tombstone TTLs, per-cell and range tombstones in Cassandra, and the notorious gc_grace_seconds parameter that has deleted production data at more than one company.

In the previous chapter you taught the store to garbage-collect overwrites. Write age=15, later write age=16, and compaction eventually drops the first record because the index no longer points at it. The old value evaporates from disk.

That mechanism handles overwrites cleanly. It does not handle deletes at all. There is no way, with the vocabulary you have so far, for a user to say "this key should no longer exist." Your API has put(k, v) and get(k). It has no delete(k). Try to add one the naive way and watch it fall apart.

def delete(self, key):      # the wrong implementation
    del self._index[key]    # forget the key exists
    return

Run this, restart the process, and the deleted key comes back from the dead. _rebuild_index() scans every segment from the oldest, sees the alice=... record that is still sitting there under fsync, and helpfully puts it in the index. The delete was a lie. It existed only inside the running process's RAM and died with it.

You cannot fix this by rewriting the old record. You cannot fix it by poking a hole in the old segment. Append-only is a contract; the whole rest of the engine — compaction's liveness test, segmented backups, crash recovery — depends on it. So if you want delete to work, the contract forces a specific shape on the fix: a delete must itself be an append.

That append is called a tombstone.

The delete problem, stated plainly

A tombstone is a record that says, by its mere presence, "the key named here is deleted as of this point in the log." The machinery that reads the log must understand the marker. get() checks the index, finds an entry, opens the segment, reads the record, sees the tombstone byte, and returns None — just as if the key had never been written. Compaction sees the tombstone during its liveness scan and treats it as live (it still points to the tombstone) until it can prove that no older record for the same key survives anywhere in the store. Only then is it safe to drop.

A log after a delete. The key age was written twice and then deleted. The two old age=... records are dead (older than the delete). The tombstone record — drawn with a skull — is the newest record for the key and is pointed at by the index. A get("age") follows the index to the tombstone and returns None.

Notice what this diagram makes visible. The delete is not the absence of a record. It is the presence of a specific kind of record. Deletion in an append-only store is a form of writing. The marker is durable (it is in the log, it survives restart), it has an offset (compaction can reason about it), and it has a shape that get() recognises (a sentinel byte or string that no valid value can ever equal).

Encoding a tombstone

You need a byte pattern in the record's payload that cannot be mistaken for a real value. Two approaches are common.

A sentinel string. Write key=\x00__TOMB__\x00\n. Readers compare the value bytes against the sentinel and return None if they match. Cheap and text-framing-friendly. Relies on the sentinel never appearing in genuine data, which is safe if you escape or restrict user input but can bite you in ways you did not expect.
A type byte in the record header. Prefix every record with one byte: 0x00 for a normal put, 0x01 for a tombstone. Readers dispatch on the byte; they never look at the value for a tombstone record. Robust, language-agnostic, and sets you up for more record types later (range deletes, expiring keys, transactional markers).

Real Bitcask does option 2, as does RocksDB, Cassandra, Kafka's log-compacted topics, and essentially every mature log-structured engine. For the teaching code we will do it too — the cost is one byte per record, paid by everyone, and it cleanly separates "what kind of record is this" from "what is its payload."

Our existing segment records have the text shape key=value\n. Extending them to carry a type byte keeps the same framing readable by hand:

normal put:    \x00 name=dipti\n
tombstone:     \x01 age=\n

The leading byte is the record type. For a tombstone we leave the value empty — there is no meaningful payload — but the key= prefix remains, because every record must carry its key (compaction and recovery scan records and look up keys without any external hint).

Writing it — `delete()` on `BitcaskSegmented`

We extend the class from the previous chapters. Two small surgeries.

# bitcask_tombstone.py — adds delete() to the compact-aware BitcaskSegmented
RECORD_PUT  = b"\x00"
RECORD_TOMB = b"\x01"

class BitcaskSegmented:
    # ... __init__, _rotate_segment, _rebuild_index, compact() from earlier ...

    def put(self, key, value):
        record = RECORD_PUT + f"{key}={value}\n".encode("utf-8")
        offset = self._active.tell()
        self._active.write(record)
        self._active.flush()
        self._index[key] = (self._active_id, offset)
        self._rotate_if_needed()

    def delete(self, key):
        """Append a tombstone record. The key is now logically absent.

        The index still tracks the tombstone's location so compaction can reason
        about it — you only *reclaim* a tombstone once every older record for the
        same key has also been compacted away."""
        record = RECORD_TOMB + f"{key}=\n".encode("utf-8")
        offset = self._active.tell()
        self._active.write(record)
        self._active.flush()
        self._index[key] = (self._active_id, offset)   # <-- point at the tombstone
        self._rotate_if_needed()

    def get(self, key):
        entry = self._index.get(key)
        if entry is None:
            return None
        seg_id, offset = entry
        f = self._handles[seg_id]
        f.seek(offset)
        type_byte = f.read(1)                          # <-- dispatch on type
        line = f.readline()
        if type_byte == RECORD_TOMB:
            return None                                # the key was deleted
        _, _, value = line.rstrip(b"\n").partition(b"=")
        return value.decode("utf-8")

The whole change is about fifteen lines. delete() is structurally identical to put() — append a record, update the index to point at it. The difference is entirely in the type byte and in what get() does when it reads it.

Why the tombstone lives in the index: if we dropped the index entry instead of pointing it at the tombstone, we would have to make get() scan the log for deleted keys. That is exactly the O(n) cost we added the index to avoid. Keeping the tombstone in the index — treating it as "just another record, whose value happens to mean absence" — preserves O(1) reads for deleted keys too. A get() on a deleted key is one dict lookup, one seek, one byte compare. Fast and correct.

_rebuild_index() — the startup scan that walks every segment and populates the index — also needs to know about type bytes. The change is two lines:

def _scan_segment(self, seg_id, path):
    offset = 0
    with open(path, "rb") as f:
        while True:
            type_byte = f.read(1)
            if not type_byte: break
            line = f.readline()
            if not line: break
            key_bytes, eq, _ = line.partition(b"=")
            if eq and key_bytes:
                # tombstones overwrite puts, last-write-wins — same as any record
                self._index[key_bytes.decode("utf-8")] = (seg_id, offset)
            offset += 1 + len(line)     # +1 for the type byte

Two things are worth staring at here. First, on startup we keep the tombstone's location in the index, same as any other record. On the very next get() for that key the type byte will resolve to "deleted" and the user gets None. Second, we do not do del self._index[key] on seeing a tombstone. Tempting — after all, the key is logically gone — but wrong. If we removed the entry, we would lose track of the tombstone, and compaction might later delete the tombstone record without realising that an even older age=15 record is still sitting in a deeper segment. On the next restart, _scan_segment() would walk that deep segment, find age=15, and happily add it back to the index. The dead rises. The tombstone has to stay in the index for as long as any older record for the same key could exist on disk.

A concrete put-delete-get-compact trace

Walking through a full delete cycle

Start with a fresh store. Shrink SEGMENT_SIZE to 80 bytes so rotations and compactions fire quickly. Then run:

db = BitcaskSegmented("demo")
db.put("alice",   "35")    # record in seg-0 at offset 0
db.put("bob",     "42")    # record in seg-0
db.put("alice",   "36")    # alice overwritten — still in seg-0
db.delete("alice")         # tombstone written — rotation likely fires next
db.put("carol",   "28")    # probably goes into seg-1

# state so far
db._index == {
    "alice": (0, 24),      # points at the TOMBSTONE, not at alice=36
    "bob":   (0, 9),
    "carol": (1, 0),
}

print(db.get("alice"))     # None            (reads the tombstone)
print(db.get("bob"))       # "42"            (reads the put)
print(db.get("carol"))     # "28"
print(db.get("dave"))      # None            (never written)

seg-0 on disk looks like this (each record preceded by its 1-byte type):

offset 0   \x00 alice=35\n      <- dead (overwritten)
offset 9   \x00 bob=42\n        <- live
offset 17  \x00 alice=36\n      <- dead (superseded by tombstone)
offset 27  \x01 alice=\n         <- TOMBSTONE, live (index points here)

Now run db.compact(). Compaction walks seg-0 record by record. For each record it does the usual liveness test: does the index for this key point at this exact (segment, offset)?

seg-0 @ 0   key="alice" index["alice"]=(0,27) ≠ (0,0)   -> skip
seg-0 @ 9   key="bob"   index["bob"]  =(0,9)  == (0,9)  -> COPY (put)
seg-0 @ 17  key="alice" index["alice"]=(0,27) ≠ (0,17)  -> skip
seg-0 @ 27  key="alice" index["alice"]=(0,27) == (0,27) -> COPY (tombstone)

Both the live put for bob and the live tombstone for alice survive. The two dead alice=... records do not. The compacted output segment seg-2 contains:

offset 0   \x00 bob=42\n
offset 8   \x01 alice=\n

The tombstone is still there. It has to be: if compaction dropped it now, and the log were replayed or another replica shipped us its older alice=35 from its segment, alice would rise from the grave. The tombstone is the proof of absence that the rest of the system relies on.

What about dropping the tombstone later? That is exactly the question the next section answers.

The "when can we drop a tombstone" problem

Compaction's liveness test says a record is live if the index points at it. A tombstone satisfies that test — the index does point at it — so compaction preserves it. Fine. The question is: once the tombstone is the only record for its key left anywhere on disk, can we finally drop it?

In our single-node BitcaskSegmented, with no replication and no backups, the answer is yes, the moment compaction has finished sweeping every segment older than the tombstone. After that sweep, there is no alice=... record anywhere in the store. The index entry can be removed and the tombstone's bytes reclaimed.

The tricky word in that paragraph is older. Tombstones are positioned in the log at a specific point. Any record that appeared earlier in the log is older than the tombstone; any record that appeared later is newer. Compaction must process segments in a way that respects this ordering, and the "when can I drop this tombstone" predicate is: every segment that existed when the tombstone was written has been compacted, with the tombstone's key not appearing in any of them.

The easy case: the tombstone landed in seg-N, and by the time the compactor gets around to reconsidering it, every earlier segment seg-0...seg-N-1 has been compacted away, and no compacted output carries an alice=... put. The tombstone can be dropped on the next compaction pass.

The hard case is the one that has bitten real systems. Consider:

alice=35 is written into seg-0. The bytes hit disk and are fsync-ed.
delete(alice) is written into seg-1 — a tombstone.
The compactor runs on seg-1's era first (perhaps because segments are compacted out of order under some policies, perhaps because seg-0 is cold-tiered and wasn't visible to the compactor, perhaps because of a bug). It sees the tombstone, notes "no put for alice exists in the segments I just scanned," decides the tombstone is harmless and drops it.
Later, seg-0 is compacted. alice=35 is the live record for alice (nothing overwrites it — the tombstone is gone). Compaction happily copies alice=35 into a new segment.
Now the store thinks alice exists with value 35. The delete silently unhappened.

That is the zombie-record bug, and a close cousin of it is why Cassandra has a parameter called gc_grace_seconds that production operators talk about with real feeling.

Tombstone TTLs and the zombie-record bug

The fix, in one sentence: do not drop a tombstone until you can prove no older record for its key could possibly exist anywhere in the system.

In a single-node store the proof is easy — you own every segment and you can check. In a distributed store with replicas, write buffers, snapshots, and asynchronous repair, the proof is hard, and the systems that got this wrong have paid for it. The pragmatic strategy used by Cassandra and its cousins is simply: keep every tombstone for at least as long as any realistic repair or recovery window. That window is the tombstone TTL — a wall-clock duration, not a log-position argument.

# sketch: "drop this tombstone only if it is older than TTL and all earlier
# segments have been compacted with no puts for this key"
def _tombstone_safe_to_drop(self, seg_id, offset, age_seconds):
    if age_seconds < TOMBSTONE_TTL:
        return False
    # every segment older than this one has been compacted through at least once
    older = [s for s in self._closed_ids if s < seg_id]
    return all(s in self._compacted_through for s in older)

TOMBSTONE_TTL is a knob. In single-node Bitcask-family code, it can be zero — the moment all older segments are compacted, the tombstone is redundant. In Cassandra the default is ten days (gc_grace_seconds = 864000), chosen to be much longer than the longest plausible gap between replica repairs. LevelDB, being single-node, uses the log-position argument and drops tombstones as soon as compaction into the deepest level has seen them.

Why time, not just log position, in a distributed store: a replica that has been offline for five minutes does not know anything happened while it was away. When it rejoins, it will exchange data with its peers; if one of them has a pre-delete version of alice and the tombstone has already been collected, the pre-delete version wins (there is nothing on the peer to say it was deleted). The TTL buys time for anti-entropy repair to propagate the tombstone to every replica before any one of them is allowed to forget it. This is a property of the global system, not of any one node's log, which is why it has to be wall-clock time.

The zombie bug is not purely theoretical. It has shown up in Cassandra clusters where an operator set gc_grace_seconds = 0 to reclaim space faster and a late-rejoining replica then resurrected deleted rows; in Riak deployments where a cold-tiered segment was restored from a backup that predated a compacted-away tombstone; in Dynamo-family write paths where a retried put landed after a tombstone in one replica but before an already-collected tombstone in another, leaving the two replicas unable to agree the key was gone. Every one of them is "the tombstone was dropped too eagerly." The fix, universally, is to keep tombstones around longer than you think you need them — at the cost of some disk space.

The zombie-record bug, in one picture. Node A deletes the key and compacts the tombstone away before node B's delayed replica has been brought in sync. When repair runs, node B ships its stale put to node A; with no tombstone anywhere to say "this key is gone," the put wins and the user sees a resurrected value. A long gc_grace_seconds would have kept A's tombstone around until after repair.

What else can be a tombstone

Once you have the "record whose type means absence" idea, you can extend it. A tombstone is not a single thing; it is a category of markers. Three variants show up in production systems.

Range tombstones. Instead of marking one key as deleted, a record marks an entire key range — "delete everything from users:1000 to users:1999 as of this time." get() on any key in the range returns None until a later put for that key overrides the range tombstone. This is how Cassandra implements DELETE FROM users WHERE id >= 1000 AND id < 2000 without needing to write a million individual tombstones. RocksDB calls them "range deletions" and added them in 2016 specifically to make prefix deletes efficient.

Expiring keys (TTLs). A record is written with a wall-clock expiry timestamp in its header. After the timestamp passes, get() treats the record as if it were a tombstone — returns None — and compaction drops the record (and any older puts for its key) the next time it runs. No explicit delete() call is ever made; the key "tombstones itself" when its time is up. Redis's EXPIRE, Cassandra's USING TTL 3600, and Memcached's default behaviour all work this way. It is the same mechanism as a tombstone with the wall clock as trigger rather than an API call.

Partition / row / cell tombstones. In wide-row stores like Cassandra, "the key" is really a three-level address: partition key, clustering key (sort key inside the partition), column name. A delete can target any of these levels — delete one cell in one row, delete an entire row, delete the entire partition. Each level has its own tombstone record type, and compaction reasons about them at the right granularity. A single partition tombstone can logically delete millions of rows without a million record writes.

The underlying mechanism is always the same: an append-only write of a marker record, whose type byte tells the reader "this is not a value, this is an erasure." The only thing that changes across variants is the scope the marker applies to.

Common confusions

"Why can't we just drop the tombstone immediately after the last put for its key is compacted?" Because "the last put" is a global property, and any one compactor run sees only a subset of segments. A segment you haven't scanned yet — maybe it is on another disk, maybe it came back from a backup, maybe it is a cold-tiered segment that paged in last week — can still hold a put. Until you have proof no such segment exists, the tombstone must stay. In a single-node store proof is cheap (walk every segment); in a distributed store proof is impossible without a time bound, which is why TTLs exist.
"Doesn't keeping all those tombstones bloat the store?" Yes, which is the known tax on this design. A workload that deletes and re-inserts frequently — e.g. a queue modelled on a KV store — can reach a steady state where a large fraction of the log is tombstones. Cassandra has a dedicated warning (tombstone_warn_threshold) that fires when a read scans more than 1,000 tombstones, because that many dead markers on a read path is usually a symptom of a modelling mistake. The fix is usually not to shorten gc_grace_seconds but to change the data model so the hot path does not involve a lot of deletes — for example, bucketing by date instead of deleting old rows.
"Why not just have a null value instead of a tombstone type byte?" You could — and some systems do — but then you cannot tell "the user wrote the literal bytes for null" from "the user deleted this key." If your store only carries opaque byte values it is fine; if you want any richer type system, you want deletion to be its own kind of record, not a sentinel value.
"What if a put for the same key arrives after the delete, in the same segment?" It is just a record. _scan_segment() processes records in order, so the later put overwrites the tombstone in the index; get() then returns the put's value. That is the correct semantics: put; delete; put leaves the key live. The tombstone itself is still on disk, still dead-lettered by the later put; it becomes eligible for compaction along with any other superseded record.
"Is the process of writing a tombstone atomic?" Yes, to the extent that any single append is atomic — the tombstone record is either fully in the log (and fsync-ed, on the next flush) or not. If the process crashes mid-write, the trailing bytes are torn and the scanner on recovery will skip them, leaving the old puts in the index. The delete, in that case, silently did not happen; the user is expected to retry. This is the same guarantee as any other append — delete and put have identical durability semantics because they are the same mechanism under the hood.
"Can two deletes collide?" They just both write tombstones. The second one updates the index to point at itself; the first one is superseded and becomes dead. Compaction will drop the earlier tombstone the same way it drops any overwritten record.

Going deeper

Everything above is enough to build a correct delete() in Bitcask. The sections that follow put the idea in its broader context — Cassandra's four kinds of tombstones, real-world incidents, and why gc_grace_seconds is not a knob you want to lower.

Tombstones in Cassandra — four levels

Cassandra's data model is hierarchical: a partition contains many rows, each row has many cells (columns). A DELETE statement can target any level, and Cassandra records a different tombstone for each.

Cell tombstone. The smallest unit — one column in one row is deleted. The row itself still exists; other columns are still live. DELETE value FROM users WHERE id = 42.
Row tombstone. A whole row is deleted. DELETE FROM users WHERE id = 42. Equivalent to a cell tombstone for every known column and a marker that says "do not materialise this row even if a later insert adds columns to it" — at least, until the tombstone is GC'd.
Range tombstone. A contiguous slice of clustering keys within a partition is deleted. DELETE FROM timeline WHERE user_id = 7 AND ts >= '2026-01-01' AND ts < '2026-02-01'. Recorded as a pair of endpoints plus a timestamp; during reads, any row in that range is suppressed. One record, potentially millions of rows hidden.
Partition tombstone. The whole partition is deleted. DELETE FROM users WHERE id = 42 where id is the partition key. All rows in the partition are suppressed, including any that might be replicated in later. The shadow lasts until gc_grace_seconds has elapsed everywhere.

On a read, Cassandra has to merge data from multiple SSTables and honour every tombstone that covers any piece of the requested key. A read that hits a pathological volume of tombstones — the famous case is a queue where rows are inserted and deleted continuously — can take thousands of times longer than a normal read, to the point of timing out. The tombstone is an anti-entry, and anti-entries on the hot read path cost just as much as entries.

The `gc_grace_seconds` disaster class

Cassandra's tombstone TTL is gc_grace_seconds, default 864000 seconds = 10 days. Operators sometimes lower it under disk pressure from accumulated tombstones without realising the repair-window invariant it protects. Three failure modes dominate the post-mortems: a replica offline longer than gc_grace rejoins with stale data and anti-entropy repair copies its pre-delete values back to peers that have already GC'd the tombstones; a backup restored that predates a tombstone's GC reintroduces the deleted values; a partition that becomes hot unexpectedly produces tombstone pressure, operators "fix" it by lowering gc_grace, and the first two bugs fire.

The right response to tombstone pressure is almost always to model data differently — shorter-lived partitions, explicit buckets, no hot-delete patterns — rather than to weaken the tombstone contract. "Never lower gc_grace_seconds" is, in many Cassandra shops, a rule on par with "never rm -rf the data directory."

Kafka, WALs, and the same pattern everywhere

Kafka's log-compacted topics use the same mechanism with different vocabulary: a record with a null value is a tombstone, and log compaction preserves the latest record per key while dropping earlier ones. Kafka calls its gc_grace_seconds analogue delete.retention.ms (default 24 hours). Setting it too low produces the same "deletes didn't stick" class of stories, in a streaming context rather than a storage one.

The pattern lifts one more level, to write-ahead logs for B-trees and row stores. A transaction that deletes a row writes a DELETE record into the WAL with the primary key; crash recovery replays it against the heap. Drop that WAL record before the heap has been checkpointed past it and you have a parallel bug. Postgres, MySQL, and SQL Server all have WAL-retention policies that are morally identical to gc_grace_seconds. Every system with "append-only log + periodic garbage collection" has a tombstone-TTL knob, whether it calls it that or not. The design space is small; the landmines are shared.

Where this leads next

Your Bitcask-family store now handles the three operations it needs: put, get, delete. Writes are O(1), reads are O(1), deletes are O(1), and compaction slowly reclaims dead bytes (including dead tombstones) in the background. The disk can grow or shrink. Restart reconstructs the index correctly, tombstones and all.

One last problem remains: restart is slow. On a 10 GB store, _rebuild_index() has to read every segment and re-derive the index. That takes minutes. Production Bitcask solves it with hint files — a small companion file per closed segment that stores the keydir entries directly, so startup is reading a megabyte of hints per segment instead of scanning gigabytes of data. That is the next chapter.

Hint files: rebuilding the index in milliseconds — write a compact index snapshot to a companion file when a segment is closed; on startup, read the hint files instead of scanning the segments.

After that, Build 2 closes with chapter 11, which names the two walls Bitcask cannot climb — working sets larger than RAM, and range queries — and motivates Build 3's introduction of LSM-trees.

References

Justin Sheehy and David Smith, Bitcask: A Log-Structured Hash Table for Fast Key/Value Data (Basho Technologies, 2010) — §3.3 covers the tombstone record format and the merge-time reaping rules. riak.com/assets/bitcask-intro.pdf.
Apache Cassandra documentation, About deletes and tombstones — the canonical explanation of gc_grace_seconds, the four tombstone levels, and the read-path cost of tombstone-heavy partitions. cassandra.apache.org/doc/latest/cassandra/operating/compaction/index.html.
Peter Bailis et al., Feral Concurrency Control (SIGMOD 2015) — background on why eventually-consistent deletes need explicit TTL semantics, not just log ordering. arxiv.org/abs/1503.00107.
Siying Dong, Range Deletes in RocksDB (Sociogram Engineering, 2016) — the design and implementation of range tombstones in RocksDB, with the motivation and edge cases. rocksdb.org/blog/2018/11/21/delete-range.html.
Jay Kreps et al., Kafka: a Distributed Messaging System for Log Processing (NetDB 2011) — log compaction, null-value tombstones, and delete.retention.ms. notes.stephenholiday.com/Kafka.pdf.
Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 3 — the "tombstones and tombstoning" discussion that ties Bitcask, LSM-trees, and Cassandra's model together. dataintensive.net.