In short

The previous chapter gave you a working read path — check the memtable, then every SSTable from newest to oldest, return the first hit. It works, but it has one nasty cost: if the key you want is not in the store, or if it is only in the newest memtable, the reader still opens every SSTable, seeks to its index, and probes. For a store with a hundred SSTables, that is a hundred seeks per miss. On spinning disk, that is a second. On SSD it is ten milliseconds. Both are orders of magnitude more than one hash lookup, and almost all of that work is wasted — the key was never in those files. The fix is a Bloom filter: a tiny in-RAM bitmap, about ten bits per key, that sits in front of each SSTable and answers one question — "could this key be in the table?" — with two possible replies: definitely not (skip the seek) or maybe (do the seek). False positives are allowed (you do an unnecessary seek occasionally); false negatives are not (you never miss a key that is actually there). This chapter implements the filter in forty lines of Python with k hash functions and m bits, derives the false-positive rate formula (1 - e^{-kn/m})^k and uses it to pick optimal k and m for a target 1% FPR (~10 bits per key, 7 hash functions), walks through a worked example with a 100-bit filter holding 10 keys, and wires a Bloom filter per SSTable into the read path so a typical miss is now one RAM probe per filter instead of one disk seek per SSTable. This single widget is the biggest single-line speedup in the LSM read path, and every production engine — LevelDB, RocksDB, Cassandra, HBase — uses it.

You have a read path. It is short and it is correct: check the memtable, then each SSTable from newest to oldest, stop at the first hit. The previous chapter built it in about fifteen lines of Python.

It is also, under one very common workload, dreadful.

Consider what happens when an app calls store.get("user:42:email") and the key does not exist. Maybe the user was never created. Maybe the caller is checking "does this key exist yet?" before an insert. Whatever the reason, the read path has no way to know the key is absent without looking — so it checks the memtable (miss), then opens SSTable 1, binary-searches its sparse index, seeks to the right block, scans it (miss), then SSTable 2 (miss), then SSTable 3... every one of them a disk seek that was guaranteed to fail.

If a store has 100 SSTables on disk, a single missing-key lookup costs 100 seeks. On a decent NVMe SSD each seek is about 100 microseconds, so that is 10 milliseconds per miss. On a rotating disk each seek is about 10 milliseconds, so that is a full second per miss. For a database that is supposed to serve tens of thousands of requests per second, this is catastrophic.

The annoying part is that most of those SSTables obviously do not contain the key. If SSTable 4 only holds keys starting with session: and you asked for user:42:email, opening and probing it is pure waste. You just need a very fast way to ask each SSTable "do you contain this key?" — cheap enough to do for every SSTable on every get(), and conservative enough that if it says "no" you can believe it.

That widget is a Bloom filter. It is a probabilistic set membership test that uses about ten bits per key, lives in RAM, and answers in two hash computations' worth of time. It is the single biggest optimisation in the LSM read path, and it is small enough that you will have it built and integrated by the end of this chapter.

The idea

A Bloom filter is a fixed-size bit array of length m and a fixed set of k hash functions that each map a key to a bit position in [0, m).

The asymmetry is the whole point. A "definitely not" is free — no disk I/O, no further work. A "maybe" is exactly as expensive as the old read path. The Bloom filter only has to say "definitely not" often enough, for most missing keys, to save the bulk of the wasted seeks.

Bloom filter insert and queryA horizontal bit array of 20 cells labelled 0 through 19, most containing 0 and a few containing 1. Above the array, three keys labelled x, y, z each draw three arrows down to three bit positions, representing three hash functions per key. The bits hit by x, y, and z are shown as 1. Below the array, a query for key q draws three arrows to three positions; one of those positions is 0, so the label reads "at least one bit is 0 → definitely not in set". A second query for key r draws arrows to three positions that are all 1, labelled "all bits are 1 → maybe in set (possibly a false positive)".A Bloom filter: bit array + k hashes per key01010101011010100101bits:036101519add("x")add("y")add("z")query("q")hits bits 1, 2, 6bit 2 is 0 → NOT in setquery("r")hits bits 10, 13, 17all 1 → MAYBE (check SSTable)
A Bloom filter of `m = 20` bits with `k = 3` hash functions. Inserts set three bits per key. A query for "q" hits one bit that is 0, which is proof that "q" was never inserted — no seek needed. A query for "r" hits three bits that are all 1, so "r" might be in the set — the caller must now actually look at the SSTable to be sure. If "r" is not in fact in the set, this is a false positive; the math below tells you how often that happens.

No false negatives. That is the property that makes a Bloom filter safe to gate a real lookup: if it says "no", the read path can skip the SSTable with full confidence. If it says "maybe", the read path does what it would have done anyway — read the SSTable, find the key (or not). You never lose data to a Bloom filter.

The code

A Bloom filter is one of those data structures where the code is shorter than the explanation. Here are the forty lines:

# bloom.py — a Bloom filter with k hash functions and m bits
import hashlib, math

class BloomFilter:
    def __init__(self, m_bits: int, k_hashes: int):
        """
        m_bits  : size of the bit array (pick ~10 * expected_n for 1% FPR)
        k_hashes: number of hash functions (pick ~0.693 * (m/n), rounded)
        """
        self.m = m_bits
        self.k = k_hashes
        self.bits = bytearray((m_bits + 7) // 8)      # 1 bit per bit, packed

    def _positions(self, key: str):
        """Yield k bit positions for this key, in [0, m)."""
        kb = key.encode("utf-8")
        # Double-hashing trick: one MD5, split into two 64-bit halves,
        # combine as h_i(x) = (a + i*b) mod m for i in 0..k-1.
        h = hashlib.md5(kb).digest()
        a = int.from_bytes(h[:8], "big")
        b = int.from_bytes(h[8:], "big")
        for i in range(self.k):
            yield (a + i * b) % self.m

    def add(self, key: str) -> None:
        for p in self._positions(key):
            self.bits[p >> 3] |= 1 << (p & 7)        # set bit p

    def might_contain(self, key: str) -> bool:
        for p in self._positions(key):
            if not (self.bits[p >> 3] & (1 << (p & 7))):
                return False                          # definitely not
        return True                                   # maybe

    @staticmethod
    def optimal_params(n: int, target_fpr: float) -> tuple[int, int]:
        """Pick (m, k) that minimise memory at the given FPR."""
        m = math.ceil(-n * math.log(target_fpr) / (math.log(2) ** 2))
        k = max(1, round((m / n) * math.log(2)))
        return m, k

Forty lines, counting the parameter helper. Five pieces deserve a look.

Why pack bits into a bytearray instead of using a list of booleans: a Python list of True/False uses about 28 bytes per entry on CPython. A bytearray uses one bit per entry after the >> 3 and & 7 packing — 224× denser. For a filter holding a million keys with 10 bits each, that is 1.25 MB versus 280 MB. Bloom filters are supposed to be tiny; using a list defeats the purpose.

Why the double-hashing trick: you need k independent hash functions, but computing k different cryptographic hashes per key is wasteful. Kirsch and Mitzenmacher showed in 2006 that h_i(x) = h_a(x) + i \cdot h_b(x) \bmod m behaves statistically like k independent hashes, as long as h_a and h_b are themselves good hashes. [1] MD5 gives you 128 bits in one call; split into two 64-bit halves and you have h_a and h_b for free. RocksDB uses this exact pattern; Cassandra uses MurmurHash3's 128-bit output the same way.

Why MD5 and not SHA-256: MD5 is cryptographically broken, but Bloom filters do not need cryptographic hashes — they need uniform hashes. MD5 distributes bits uniformly at a fraction of the cost of SHA-256. Any non-cryptographic hash with good avalanche (MurmurHash3, xxHash, FarmHash) is even better; MD5 is just the one in the Python standard library with a large enough output to split in two.

Why optimal_params returns two numbers, not one: a Bloom filter has two independent knobs. m controls the raw memory. k controls the trade-off between how many bits each insert sets (higher k means more bits set per key, so the array fills up faster) and how strict the query is (higher k means more bits to check, so a false positive needs more accidental 1s). For a given memory budget m/n, there is an optimal k — derived below — that minimises the false-positive rate. Picking k badly can double the FPR at the same memory.

Why the might_contain return values are named that way, not contains: the API has to loudly tell the caller "this is not a set, do not trust a True". Python's in operator would invite mistakes (if key in bloom: would look authoritative). might_contain reads as "this might be a yes, go verify" — which is exactly the contract.

The math

The false-positive rate of a Bloom filter is one of the few formulas in systems design that is both short and actually useful. The derivation is three steps.

Step 1. The probability that a specific bit is still 0 after all inserts.

A single insert picks k bits uniformly at random out of m (to a good approximation — the hashes are uniform). The probability that a specific bit is not set by that one insert is (1 - 1/m)^k — each of the k hashes misses the bit with probability (1 - 1/m).

After n inserts (each with its own k hashes), the probability that a specific bit is still 0 is:

P(\text{bit still } 0) = \left(1 - \frac{1}{m}\right)^{kn} \approx e^{-kn/m}

The approximation uses (1 - 1/m)^m \approx e^{-1} for large m, which is accurate to within a fraction of a percent for any filter you would build in practice.

Step 2. The probability that a specific bit is 1.

Just the complement:

P(\text{bit is } 1) \approx 1 - e^{-kn/m}

Step 3. The probability of a false positive.

A false positive on query q means: q was never inserted, but all k of its hashed positions happen to be 1 anyway. Assuming the bit-set events are independent (a small lie, but close enough for these sizes), this is just the k-th power of Step 2:

\boxed{\text{FPR}(k, n, m) \approx \left(1 - e^{-kn/m}\right)^k}

This is the formula that gets quoted in every textbook, and it is the formula that lets you pick m and k with confidence.

Tuning k for a given m and n. For a fixed memory budget (fixed m/n), the FPR is minimised by picking:

k^* = \frac{m}{n} \ln 2 \approx 0.693 \cdot \frac{m}{n}

Plug k^* back in and the optimal FPR becomes:

\text{FPR}^* \approx (0.5)^{k^*} = (0.6185)^{m/n}

So ten bits per key — m/n = 10 — gives k^* \approx 7 and FPR \approx 0.82\%. That is the classic "Bloom filter: 1% miss rate at ten bits per key" number every LSM engine quotes.

Sizing m for a target FPR. Flipping the formula to solve for m given n and a target \varepsilon:

m = -\frac{n \ln \varepsilon}{(\ln 2)^2}

For \varepsilon = 0.01 (1% FPR), this gives m \approx 9.585 \cdot n — just under ten bits per key. For \varepsilon = 0.001 (0.1%), it climbs to m \approx 14.38 \cdot n. For \varepsilon = 0.0001 (one in ten thousand), m \approx 19.17 \cdot n. Every factor-of-10 improvement in FPR costs roughly five extra bits per key — a beautifully gentle curve.

False-positive rate vs bits per keyA line chart plotting the Bloom filter false-positive rate on the vertical axis against bits per key m/n on the horizontal axis from 1 to 20. The curve starts near 0.5 at m/n = 1, drops steeply through about 10% at m/n = 4.8, through about 1% at m/n = 9.6, to about 0.1% at m/n = 14.4, and flattens near 0.01% at m/n = 19. Grid lines are drawn at FPR = 10%, 1%, 0.1%, and 0.01%. Three annotations mark the common sizing choices: "RocksDB default: 10 bits, ~0.8% FPR", "Cassandra default: 14 bits, ~0.1% FPR", and "every extra 5 bits/key → 10× lower FPR".False-positive rate vs bits per key (at optimal k)100%10%1%0.1%0.01%05101520m/n (bits per key)10 bits → 0.82% FPR (RocksDB default)14 bits → 0.1% (Cassandra default)each +5 bits/key ≈ ×10 lower FPR
FPR as a function of bits per key at optimal `k`. The curve is straight on a log scale — every five extra bits per key divides the false-positive rate by ten. This is the single chart that guides Bloom filter sizing: pick a target FPR on the y-axis, read off bits-per-key on the x-axis. RocksDB defaults to ten bits per key (~1% FPR, the classic choice). Cassandra defaults to a tighter 14 bits per key (~0.1%) because its read path amplifies false positives across a cluster.

A worked example

A 100-bit filter holding 10 keys

Build a Bloom filter with m = 100 bits and n = 10 expected keys. The optimal k is \lceil 0.693 \cdot 10 \rceil = 7, and the expected FPR is (0.6185)^{10} \approx 0.82\%.

Insert ten SSTable keys (imagine this is the Bloom filter attached to a real SSTable on disk):

bf = BloomFilter(m_bits=100, k_hashes=7)
for k in ["age", "city", "email", "locale", "name",
          "phone", "role", "state", "views", "zip"]:
    bf.add(k)

After all ten inserts, approximately 100 \cdot (1 - e^{-7 \cdot 10 / 100}) = 100 \cdot (1 - 0.4966) \approx 50 of the 100 bits are 1. The filter is half-full — which is exactly what theory says is optimal. (At the optimal k, each bit has a 50% chance of being 1; stray further in either direction and the FPR climbs.)

Now probe:

bf.might_contain("age")      # True  (inserted — all 7 bits are 1 for sure)
bf.might_contain("name")     # True  (inserted)
bf.might_contain("user:42")  # False (some of its 7 bits are 0 → definitely not)
bf.might_contain("score")    # False (some of its 7 bits are 0 → definitely not)
bf.might_contain("views")    # True  (inserted)
bf.might_contain("age2")     # True  ← FALSE POSITIVE
                             #         (all 7 of its bits happen to be 1 because
                             #          other inserts set them; verify by reading
                             #          the SSTable and finding no "age2" there)

What does the read path do with each answer? Three cases:

  1. True positive (age): filter says maybe → reader opens SSTable → key is there → return value. This is the normal hot path for keys that exist.
  2. True negative (user:42): filter says definitely not → reader skips this SSTable entirely. Zero disk I/O. This is the payoff — it used to cost a full seek.
  3. False positive (age2): filter says maybe → reader opens SSTable → key is not there → skip to next SSTable. Wasted seek, but it happens on fewer than 1% of lookups at ten bits per key.

With ten bits per key and 1% FPR, in a store with 100 SSTables, a single missing-key lookup now does on average about one wasted seek (1% × 100 SSTables), down from 100 wasted seeks with no filter — a 100× improvement in the worst case, at a memory cost of ten bits per stored key.

Integration into the LSM read path

A Bloom filter per SSTable, loaded at startup, probed before any disk I/O. The change to the read path is one if statement per SSTable:

# sstable.py — SSTable with a Bloom filter sidecar
class SSTableReader:
    def __init__(self, path):
        self.path = path
        self.index = load_sparse_index(path)         # from chapter 14
        self.bloom = load_bloom(path + ".bloom")     # built during flush

    def get(self, key):
        if not self.bloom.might_contain(key):
            return None                              # skip, zero I/O
        # normal read path: binary-search index, seek to block, scan block
        block = read_block_for(key, self.index, self.path)
        return scan_block(block, key)

The Bloom filter for each SSTable is built during the flush — iterating the sorted memtable also iterates exactly the keys going into that SSTable, so for key, value in mt.items(): bloom.add(key) adds zero extra passes to the flush. The bitmap is then written to a sidecar file <sstable>.bloom (or appended as a "meta block" inside the SSTable; both layouts are common).

At startup, every .bloom sidecar is loaded into RAM. With ten bits per key, a store holding a billion keys keeps all its Bloom filters in about 1.25 GB of RAM. RocksDB and Cassandra load them lazily (first access of each SSTable's filter) to avoid paying for SSTables that never get read.

In the overall LSM read path, the Bloom filter sits between the SSTable selection loop and the block-level I/O:

# store.py — LSM get with Bloom filters
def get(self, key):
    # 1. check memtable(s) — no Bloom filter, just O(log n) skip-list lookup
    if (v := self.active.get(key)) is not None:
        return None if is_tombstone(v) else v
    if self.frozen and (v := self.frozen.get(key)) is not None:
        return None if is_tombstone(v) else v

    # 2. check SSTables newest first, but ASK THE BLOOM FILTER FIRST
    for sst in self.sstables:                        # newest first
        if not sst.bloom.might_contain(key):
            continue                                 # definitely not here
        if (v := sst.get(key)) is not None:          # maybe — verify on disk
            return None if is_tombstone(v) else v
    return None

This is the only change to the read path from the previous chapter. One line — if not sst.bloom.might_contain(key): continue — and the worst case collapses from a hundred seeks to one.

LSM read path with Bloom filtersA horizontal row of six boxes from left to right, labelled memtable, SSTable-0, SSTable-1, SSTable-2, SSTable-3, SSTable-4, each with a small rectangle labelled "bloom" floating above it. A query arrow labelled 'get("user:42:email")' enters at the memtable box and passes through. At each SSTable box, the bloom rectangle shows either a red X ("definitely not") or a green check ("maybe"). For this query, SSTable-0 bloom shows X, SSTable-1 bloom shows X, SSTable-2 bloom shows green check and arrow drops into the SSTable for a disk read, SSTable-3 bloom shows X, SSTable-4 bloom shows X. Below the row, a summary reads "6 Bloom probes in RAM, 1 disk seek".A single get() with a Bloom filter per SSTablememtableskip listmissX not hereSSTable-0skippedno I/OX not hereSSTable-1skipped? maybeSSTable-2seek diskHIT!X not hereSSTable-3skippedX not hereSSTable-4skipped6 in-RAM Bloom probes + 1 disk seek (the one hit) — down from 6 seeks.
A `get("user:42:email")` with Bloom filters on every SSTable. The memtable is checked first (skip-list miss, no I/O). Each SSTable is then gated by its filter. Four SSTables say "definitely not" and are skipped — zero disk I/O for those. One says "maybe", gets opened, and the key is there. Total cost: six RAM probes, one disk seek. Without Bloom filters, the same query would have cost five disk seeks (one per SSTable). At a hundred SSTables the savings compound linearly, and the hot-key pattern (one hit among many files) becomes cheap.

Common confusions

Going deeper

You have a Bloom filter wired into the LSM. This section covers three upgrades: counting and cuckoo filters for delete support, the sizing math for multi-level LSMs, and how RocksDB's per-data-block filters interact with block-cache and partitioned indexes.

Counting Bloom filters and cuckoo filters

A counting Bloom filter replaces each bit with a small counter (typically 4 bits). Insert increments every counter at the k positions; delete decrements them. The filter now supports removal, at a 4× memory cost and the same FPR. The 4-bit width is chosen to make counter overflow essentially impossible — with k = 7 and a half-full filter, the chance any counter reaches 15 is astronomically small.

Cuckoo filters do the same job more efficiently. Instead of setting bits, each key stores a small fingerprint (typically 8 bits) at one of two candidate buckets; on collision, one occupant is evicted and re-hashed into its other candidate bucket — the cuckoo-hashing kick. Delete is supported by removing the fingerprint from whichever bucket holds it. At low load factors a cuckoo filter uses less space than a Bloom filter at the same FPR, and supports deletion natively; at high load factors it can fail to insert (and has to rehash).

LSM engines almost never use counting Bloom filters — deletion is not needed because SSTables are immutable, and the 4× memory cost is wasted. Cuckoo filters are starting to appear in experimental branches (e.g. RocksDB's CuckooTable) for the specific case where get() is highly dominated by misses. [2]

Per-level filter sizing in a tiered LSM

A real LSM has many SSTables organised into levels (next chapter 17). The lowest level (L0) has ~10 SSTables; higher levels are larger and fewer. Since a read checks L0 first, then L1, then L2, up to L6 or L7, the read amplification skews toward the lower levels — L0 is checked on every read, L6 only on misses for keys that existed a long time ago.

Production engines tune the Bloom filter size per level:

The numbers change with workload. A write-heavy store with many misses wants tight filters everywhere; a read-heavy store with almost no misses can save RAM by loosening them.

RocksDB's per-block filters and partitioned filters

The Bloom filter described in this chapter is per-SSTable — one filter per file. RocksDB goes finer: per-data-block filters, optionally partitioned.

Both optimisations are described in the RocksDB blog and docs. [3] They do not change the math — a block-level filter uses the same FPR formula — but they change the failure mode of "Bloom filter eats too much RAM" into "Bloom filter cached like any other block, evicted under pressure, re-read on demand".

Sizing the filter memory budget

A practical rule for an LSM designer: the Bloom filter memory budget is a fixed fraction of the data size, and that fraction is directly controlled by bits-per-key.

For a 100 GB database, 10 bits per key is 1.25 GB of RAM for Bloom filters. That is usually less than the sparse-index memory, and usually less than the OS's page cache for hot blocks. In other words: Bloom filters are cheap enough that "just turn them on everywhere" is the default operational posture. The decision is not "do I want them?" but "how many bits per key?" — and the answer is "~10 unless measurements say otherwise".

Where this leads next

You now have a read path that is fast for hits and fast for misses. A Bloom filter per SSTable, ten bits per key, ~1% FPR, and a single line of code in the read loop — that is the widget that turns a hundred-SSTable lookup from a hundred seeks into roughly one.

But there is a problem lurking just behind all this cleverness: the number of SSTables keeps growing. Every flush adds one more file. Within a few hours of sustained writes you can have thousands of SSTables, and even at ~1% FPR per filter, a query that truly does not exist will pay ten wasted seeks across ten unlucky filters. More importantly, each SSTable may hold an older version of a key that a newer SSTable overrode — a delete or an update — and those stale records are pure waste, consuming disk space and dragging the read path.

The answer is compaction: periodically pick a set of SSTables and merge them into a smaller number of larger, more up-to-date SSTables, dropping overridden keys and tombstones in the process. You already have the merge primitive from chapter 12 — the streaming k-way merge. What you do not yet have is the policy: when to compact, which files to compact together, and how to organise the output to keep read amplification bounded.

Chapter 17 builds the first of the two main policies — leveled compaction, the LevelDB/RocksDB design — and derives the read-amplification and write-amplification numbers that make it the default choice for read-heavy workloads. Chapter 18 covers the opposite extreme, tiered compaction (Cassandra's default), which trades read amplification for write amplification.

For now, take a moment with the whole machine. You have a memtable, a flush, an SSTable layout with a sparse index, a read path that checks all of them newest-first, and a Bloom filter that cheaply skips the ones that cannot hold your key. That is a working LSM-tree key-value store. It absorbs writes in O(log n) per operation, serves point reads in ~1 disk seek even across hundreds of SSTables, and is ready to be made durable (Build 5's WAL) and space-efficient (Build 3's compaction chapters). The Bloom filter is forty lines of Python, but it is also the difference between "an LSM that works" and "an LSM that is fast enough to ship".

References

  1. Adam Kirsch and Michael Mitzenmacher, Less Hashing, Same Performance: Building a Better Bloom Filter (ESA 2006) — the double-hashing trick that lets one pair of hashes stand in for k independent hashes with no measurable FPR loss. people.seas.harvard.edu/~michaelm.
  2. Bin Fan et al., Cuckoo Filter: Practically Better Than Bloom (CoNEXT 2014) — the cuckoo-filter design, with delete support and better space efficiency at common FPRs. dl.acm.org.
  3. Facebook RocksDB team, Full Filter Block and Partitioned Index/Filters (RocksDB wiki) — per-data-block and partitioned Bloom filter implementations in production. github.com/facebook/rocksdb/wiki.
  4. Huanchen Zhang et al., SuRF: Practical Range Query Filtering with Fast Succinct Tries (SIGMOD 2018) — range-query-aware filters, which plain Bloom filters cannot support. www.cs.cmu.edu/~huanche1.
  5. Burton H. Bloom, Space/Time Trade-offs in Hash Coding with Allowable Errors (CACM 1970) — the original paper. Still readable, still correct. dl.acm.org/doi/10.1145/362686.362692.
  6. Sanjay Ghemawat and Jeff Dean, LevelDB Implementation Notes — the per-SSTable Bloom filter design, sidecar-block layout, and the FilterPolicy interface that RocksDB inherited. github.com/google/leveldb.