In short
The previous chapter gave you a working read path — check the memtable, then every SSTable from newest to oldest, return the first hit. It works, but it has one nasty cost: if the key you want is not in the store, or if it is only in the newest memtable, the reader still opens every SSTable, seeks to its index, and probes. For a store with a hundred SSTables, that is a hundred seeks per miss. On spinning disk, that is a second. On SSD it is ten milliseconds. Both are orders of magnitude more than one hash lookup, and almost all of that work is wasted — the key was never in those files. The fix is a Bloom filter: a tiny in-RAM bitmap, about ten bits per key, that sits in front of each SSTable and answers one question — "could this key be in the table?" — with two possible replies: definitely not (skip the seek) or maybe (do the seek). False positives are allowed (you do an unnecessary seek occasionally); false negatives are not (you never miss a key that is actually there). This chapter implements the filter in forty lines of Python with k hash functions and m bits, derives the false-positive rate formula (1 - e^{-kn/m})^k and uses it to pick optimal k and m for a target 1% FPR (~10 bits per key, 7 hash functions), walks through a worked example with a 100-bit filter holding 10 keys, and wires a Bloom filter per SSTable into the read path so a typical miss is now one RAM probe per filter instead of one disk seek per SSTable. This single widget is the biggest single-line speedup in the LSM read path, and every production engine — LevelDB, RocksDB, Cassandra, HBase — uses it.
You have a read path. It is short and it is correct: check the memtable, then each SSTable from newest to oldest, stop at the first hit. The previous chapter built it in about fifteen lines of Python.
It is also, under one very common workload, dreadful.
Consider what happens when an app calls store.get("user:42:email") and the key does not exist. Maybe the user was never created. Maybe the caller is checking "does this key exist yet?" before an insert. Whatever the reason, the read path has no way to know the key is absent without looking — so it checks the memtable (miss), then opens SSTable 1, binary-searches its sparse index, seeks to the right block, scans it (miss), then SSTable 2 (miss), then SSTable 3... every one of them a disk seek that was guaranteed to fail.
If a store has 100 SSTables on disk, a single missing-key lookup costs 100 seeks. On a decent NVMe SSD each seek is about 100 microseconds, so that is 10 milliseconds per miss. On a rotating disk each seek is about 10 milliseconds, so that is a full second per miss. For a database that is supposed to serve tens of thousands of requests per second, this is catastrophic.
The annoying part is that most of those SSTables obviously do not contain the key. If SSTable 4 only holds keys starting with session: and you asked for user:42:email, opening and probing it is pure waste. You just need a very fast way to ask each SSTable "do you contain this key?" — cheap enough to do for every SSTable on every get(), and conservative enough that if it says "no" you can believe it.
That widget is a Bloom filter. It is a probabilistic set membership test that uses about ten bits per key, lives in RAM, and answers in two hash computations' worth of time. It is the single biggest optimisation in the LSM read path, and it is small enough that you will have it built and integrated by the end of this chapter.
The idea
A Bloom filter is a fixed-size bit array of length m and a fixed set of k hash functions that each map a key to a bit position in [0, m).
- Insert. To add key
x, computeh_1(x), h_2(x), ..., h_k(x)modulomand set allkbits to 1. - Query. To ask "is
xin the set?", compute the samekhashes. If any of thekbits is 0, return definitely not — no insert path could have left that bit clear. If allkbits are 1, return maybe — those bits might have been set by some combination of other keys (a false positive), so the caller must go verify by reading the actual SSTable.
The asymmetry is the whole point. A "definitely not" is free — no disk I/O, no further work. A "maybe" is exactly as expensive as the old read path. The Bloom filter only has to say "definitely not" often enough, for most missing keys, to save the bulk of the wasted seeks.
No false negatives. That is the property that makes a Bloom filter safe to gate a real lookup: if it says "no", the read path can skip the SSTable with full confidence. If it says "maybe", the read path does what it would have done anyway — read the SSTable, find the key (or not). You never lose data to a Bloom filter.
The code
A Bloom filter is one of those data structures where the code is shorter than the explanation. Here are the forty lines:
# bloom.py — a Bloom filter with k hash functions and m bits
import hashlib, math
class BloomFilter:
def __init__(self, m_bits: int, k_hashes: int):
"""
m_bits : size of the bit array (pick ~10 * expected_n for 1% FPR)
k_hashes: number of hash functions (pick ~0.693 * (m/n), rounded)
"""
self.m = m_bits
self.k = k_hashes
self.bits = bytearray((m_bits + 7) // 8) # 1 bit per bit, packed
def _positions(self, key: str):
"""Yield k bit positions for this key, in [0, m)."""
kb = key.encode("utf-8")
# Double-hashing trick: one MD5, split into two 64-bit halves,
# combine as h_i(x) = (a + i*b) mod m for i in 0..k-1.
h = hashlib.md5(kb).digest()
a = int.from_bytes(h[:8], "big")
b = int.from_bytes(h[8:], "big")
for i in range(self.k):
yield (a + i * b) % self.m
def add(self, key: str) -> None:
for p in self._positions(key):
self.bits[p >> 3] |= 1 << (p & 7) # set bit p
def might_contain(self, key: str) -> bool:
for p in self._positions(key):
if not (self.bits[p >> 3] & (1 << (p & 7))):
return False # definitely not
return True # maybe
@staticmethod
def optimal_params(n: int, target_fpr: float) -> tuple[int, int]:
"""Pick (m, k) that minimise memory at the given FPR."""
m = math.ceil(-n * math.log(target_fpr) / (math.log(2) ** 2))
k = max(1, round((m / n) * math.log(2)))
return m, k
Forty lines, counting the parameter helper. Five pieces deserve a look.
Why pack bits into a bytearray instead of using a list of booleans: a Python list of True/False uses about 28 bytes per entry on CPython. A bytearray uses one bit per entry after the >> 3 and & 7 packing — 224× denser. For a filter holding a million keys with 10 bits each, that is 1.25 MB versus 280 MB. Bloom filters are supposed to be tiny; using a list defeats the purpose.
Why the double-hashing trick: you need k independent hash functions, but computing k different cryptographic hashes per key is wasteful. Kirsch and Mitzenmacher showed in 2006 that h_i(x) = h_a(x) + i \cdot h_b(x) \bmod m behaves statistically like k independent hashes, as long as h_a and h_b are themselves good hashes. [1] MD5 gives you 128 bits in one call; split into two 64-bit halves and you have h_a and h_b for free. RocksDB uses this exact pattern; Cassandra uses MurmurHash3's 128-bit output the same way.
Why MD5 and not SHA-256: MD5 is cryptographically broken, but Bloom filters do not need cryptographic hashes — they need uniform hashes. MD5 distributes bits uniformly at a fraction of the cost of SHA-256. Any non-cryptographic hash with good avalanche (MurmurHash3, xxHash, FarmHash) is even better; MD5 is just the one in the Python standard library with a large enough output to split in two.
Why optimal_params returns two numbers, not one: a Bloom filter has two independent knobs. m controls the raw memory. k controls the trade-off between how many bits each insert sets (higher k means more bits set per key, so the array fills up faster) and how strict the query is (higher k means more bits to check, so a false positive needs more accidental 1s). For a given memory budget m/n, there is an optimal k — derived below — that minimises the false-positive rate. Picking k badly can double the FPR at the same memory.
Why the might_contain return values are named that way, not contains: the API has to loudly tell the caller "this is not a set, do not trust a True". Python's in operator would invite mistakes (if key in bloom: would look authoritative). might_contain reads as "this might be a yes, go verify" — which is exactly the contract.
The math
The false-positive rate of a Bloom filter is one of the few formulas in systems design that is both short and actually useful. The derivation is three steps.
Step 1. The probability that a specific bit is still 0 after all inserts.
A single insert picks k bits uniformly at random out of m (to a good approximation — the hashes are uniform). The probability that a specific bit is not set by that one insert is (1 - 1/m)^k — each of the k hashes misses the bit with probability (1 - 1/m).
After n inserts (each with its own k hashes), the probability that a specific bit is still 0 is:
The approximation uses (1 - 1/m)^m \approx e^{-1} for large m, which is accurate to within a fraction of a percent for any filter you would build in practice.
Step 2. The probability that a specific bit is 1.
Just the complement:
Step 3. The probability of a false positive.
A false positive on query q means: q was never inserted, but all k of its hashed positions happen to be 1 anyway. Assuming the bit-set events are independent (a small lie, but close enough for these sizes), this is just the k-th power of Step 2:
This is the formula that gets quoted in every textbook, and it is the formula that lets you pick m and k with confidence.
Tuning k for a given m and n. For a fixed memory budget (fixed m/n), the FPR is minimised by picking:
Plug k^* back in and the optimal FPR becomes:
So ten bits per key — m/n = 10 — gives k^* \approx 7 and FPR \approx 0.82\%. That is the classic "Bloom filter: 1% miss rate at ten bits per key" number every LSM engine quotes.
Sizing m for a target FPR. Flipping the formula to solve for m given n and a target \varepsilon:
For \varepsilon = 0.01 (1% FPR), this gives m \approx 9.585 \cdot n — just under ten bits per key. For \varepsilon = 0.001 (0.1%), it climbs to m \approx 14.38 \cdot n. For \varepsilon = 0.0001 (one in ten thousand), m \approx 19.17 \cdot n. Every factor-of-10 improvement in FPR costs roughly five extra bits per key — a beautifully gentle curve.
A worked example
A 100-bit filter holding 10 keys
Build a Bloom filter with m = 100 bits and n = 10 expected keys. The optimal k is \lceil 0.693 \cdot 10 \rceil = 7, and the expected FPR is (0.6185)^{10} \approx 0.82\%.
Insert ten SSTable keys (imagine this is the Bloom filter attached to a real SSTable on disk):
bf = BloomFilter(m_bits=100, k_hashes=7)
for k in ["age", "city", "email", "locale", "name",
"phone", "role", "state", "views", "zip"]:
bf.add(k)
After all ten inserts, approximately 100 \cdot (1 - e^{-7 \cdot 10 / 100}) = 100 \cdot (1 - 0.4966) \approx 50 of the 100 bits are 1. The filter is half-full — which is exactly what theory says is optimal. (At the optimal k, each bit has a 50% chance of being 1; stray further in either direction and the FPR climbs.)
Now probe:
bf.might_contain("age") # True (inserted — all 7 bits are 1 for sure)
bf.might_contain("name") # True (inserted)
bf.might_contain("user:42") # False (some of its 7 bits are 0 → definitely not)
bf.might_contain("score") # False (some of its 7 bits are 0 → definitely not)
bf.might_contain("views") # True (inserted)
bf.might_contain("age2") # True ← FALSE POSITIVE
# (all 7 of its bits happen to be 1 because
# other inserts set them; verify by reading
# the SSTable and finding no "age2" there)
What does the read path do with each answer? Three cases:
- True positive (
age): filter says maybe → reader opens SSTable → key is there → return value. This is the normal hot path for keys that exist. - True negative (
user:42): filter says definitely not → reader skips this SSTable entirely. Zero disk I/O. This is the payoff — it used to cost a full seek. - False positive (
age2): filter says maybe → reader opens SSTable → key is not there → skip to next SSTable. Wasted seek, but it happens on fewer than 1% of lookups at ten bits per key.
With ten bits per key and 1% FPR, in a store with 100 SSTables, a single missing-key lookup now does on average about one wasted seek (1% × 100 SSTables), down from 100 wasted seeks with no filter — a 100× improvement in the worst case, at a memory cost of ten bits per stored key.
Integration into the LSM read path
A Bloom filter per SSTable, loaded at startup, probed before any disk I/O. The change to the read path is one if statement per SSTable:
# sstable.py — SSTable with a Bloom filter sidecar
class SSTableReader:
def __init__(self, path):
self.path = path
self.index = load_sparse_index(path) # from chapter 14
self.bloom = load_bloom(path + ".bloom") # built during flush
def get(self, key):
if not self.bloom.might_contain(key):
return None # skip, zero I/O
# normal read path: binary-search index, seek to block, scan block
block = read_block_for(key, self.index, self.path)
return scan_block(block, key)
The Bloom filter for each SSTable is built during the flush — iterating the sorted memtable also iterates exactly the keys going into that SSTable, so for key, value in mt.items(): bloom.add(key) adds zero extra passes to the flush. The bitmap is then written to a sidecar file <sstable>.bloom (or appended as a "meta block" inside the SSTable; both layouts are common).
At startup, every .bloom sidecar is loaded into RAM. With ten bits per key, a store holding a billion keys keeps all its Bloom filters in about 1.25 GB of RAM. RocksDB and Cassandra load them lazily (first access of each SSTable's filter) to avoid paying for SSTables that never get read.
In the overall LSM read path, the Bloom filter sits between the SSTable selection loop and the block-level I/O:
# store.py — LSM get with Bloom filters
def get(self, key):
# 1. check memtable(s) — no Bloom filter, just O(log n) skip-list lookup
if (v := self.active.get(key)) is not None:
return None if is_tombstone(v) else v
if self.frozen and (v := self.frozen.get(key)) is not None:
return None if is_tombstone(v) else v
# 2. check SSTables newest first, but ASK THE BLOOM FILTER FIRST
for sst in self.sstables: # newest first
if not sst.bloom.might_contain(key):
continue # definitely not here
if (v := sst.get(key)) is not None: # maybe — verify on disk
return None if is_tombstone(v) else v
return None
This is the only change to the read path from the previous chapter. One line — if not sst.bloom.might_contain(key): continue — and the worst case collapses from a hundred seeks to one.
Common confusions
-
"Why not just use a hash set of every key in every SSTable?" Memory. A Python
setuses about 100 bytes per string entry (pointer, hash cache, string overhead). For a billion keys that is 100 GB of RAM — more than the database itself is supposed to fit in memory. A Bloom filter with ten bits per key is 1.25 GB for the same billion keys, an 80× savings. The cost you pay for the memory reduction is the occasional false positive (and the "no delete" restriction below). For the LSM read path, that trade is a landslide win. -
"What about deletes? If I delete a key, shouldn't I clear its bits?" You cannot. The bits set by a given key are shared with other keys — any of the
kpositions might also have been set by some unrelated insert. Clearing them would create false negatives (the other keys' queries now see a 0 where the filter remembers a 1), which is the one failure mode a Bloom filter is not allowed to have. The standard answer is: Bloom filters are add-only, and in the LSM context that is exactly what you want — an SSTable is immutable, the Bloom filter for it is built once at flush time and never modified. Deletes in the LSM world are handled by writing tombstones to a newer SSTable; the old SSTable's Bloom filter does not need to know the key was deleted. If you do need a "true delete" bloom filter — say, for a streaming deduplication job — you want a counting Bloom filter (each slot is a small counter instead of a single bit) or a cuckoo filter (covered in Going deeper), both of which support deletion at the cost of extra memory. -
"What happens at the FPR ceiling? If I keep inserting, does the filter silently fail?" Not silently, but gracefully. As
ngrows past themthe filter was sized for, more bits hit 1, and the FPR climbs according to the formula — it doesn't suddenly flip to "always says maybe". Atkn/m = ln(2)(i.e., half the bits are 1) you are at optimal FPR; beyond that, every extra key raises FPR monotonically. LSM engines avoid this by sizing the filter during flush, after the memtable is sealed andnis known exactly. You never overfill a filter; you always build it with the rightmfor the actual key count. -
"Two SSTables may contain the same key (one with a tombstone). Do both Bloom filters say 'maybe'?" Yes. Both filters see the key during their own flush and insert it. On query, both say "maybe", the read path opens the newest first, finds the tombstone, and returns
None. This is correct behaviour — compaction (next chapter) is what will eventually merge them into one SSTable with just the tombstone, after which only that one filter will say "maybe". -
"Why not use an adaptive filter that updates as queries arrive?" The academic literature has "learned Bloom filters" and various adaptive variants that use query distribution to improve FPR. They are real and they work, but the complexity cost is large: you need to remember which queries have been made, update the structure under concurrency, and lose the "build once at flush, read-only forever" property that makes plain Bloom filters so operationally simple. Every major LSM engine uses the plain version. [5] [6]
-
"What about range queries — does the Bloom filter help?" No. A Bloom filter answers point membership — "is this specific key in the set?" — and has no notion of "any key between A and Z". Range queries in LSM trees still merge-iterate every SSTable (which is fine, because range-iteration is already sequential I/O, not the random-seek case Bloom filters address). There are specialised "range filters" for this — Succinct Range Filter (SuRF) is the main one — but they are rare outside research prototypes. [4]
Going deeper
You have a Bloom filter wired into the LSM. This section covers three upgrades: counting and cuckoo filters for delete support, the sizing math for multi-level LSMs, and how RocksDB's per-data-block filters interact with block-cache and partitioned indexes.
Counting Bloom filters and cuckoo filters
A counting Bloom filter replaces each bit with a small counter (typically 4 bits). Insert increments every counter at the k positions; delete decrements them. The filter now supports removal, at a 4× memory cost and the same FPR. The 4-bit width is chosen to make counter overflow essentially impossible — with k = 7 and a half-full filter, the chance any counter reaches 15 is astronomically small.
Cuckoo filters do the same job more efficiently. Instead of setting bits, each key stores a small fingerprint (typically 8 bits) at one of two candidate buckets; on collision, one occupant is evicted and re-hashed into its other candidate bucket — the cuckoo-hashing kick. Delete is supported by removing the fingerprint from whichever bucket holds it. At low load factors a cuckoo filter uses less space than a Bloom filter at the same FPR, and supports deletion natively; at high load factors it can fail to insert (and has to rehash).
LSM engines almost never use counting Bloom filters — deletion is not needed because SSTables are immutable, and the 4× memory cost is wasted. Cuckoo filters are starting to appear in experimental branches (e.g. RocksDB's CuckooTable) for the specific case where get() is highly dominated by misses. [2]
Per-level filter sizing in a tiered LSM
A real LSM has many SSTables organised into levels (next chapter 17). The lowest level (L0) has ~10 SSTables; higher levels are larger and fewer. Since a read checks L0 first, then L1, then L2, up to L6 or L7, the read amplification skews toward the lower levels — L0 is checked on every read, L6 only on misses for keys that existed a long time ago.
Production engines tune the Bloom filter size per level:
- L0 filters: oversized. More bits per key (12–15), tighter FPR. L0 is checked on every single read; every FP there is a wasted seek.
- Higher levels: standard. Ten bits per key is fine; these levels are checked rarely.
- Last level: sometimes no filter at all. In RocksDB, setting
optimize_filters_for_hits = truedisables the filter on the largest (last) level. The reasoning: the last level is huge and rarely queried for missing keys (most missing-key queries terminate earlier via one of the upper-level filters), so the RAM cost of a Bloom filter for the last level often outweighs the I/O it saves.
The numbers change with workload. A write-heavy store with many misses wants tight filters everywhere; a read-heavy store with almost no misses can save RAM by loosening them.
RocksDB's per-block filters and partitioned filters
The Bloom filter described in this chapter is per-SSTable — one filter per file. RocksDB goes finer: per-data-block filters, optionally partitioned.
- Per-data-block filter. Each 4 KB data block inside an SSTable gets its own small Bloom filter. On a point lookup, the reader binary-searches the sparse index to find the candidate block, then checks the block's filter. If it says "no", the reader skips the block read itself — not just the SSTable. This is a second layer of filter, inside the first.
- Partitioned filter. For large SSTables, the per-SSTable filter itself is too big to fit in the block cache. RocksDB partitions it into sub-filters, one per ~128 data blocks, and stores them as separate blocks inside the SSTable. The read path loads only the relevant sub-filter into the cache. This keeps the working-set cache pressure low and makes very large SSTables (hundreds of MB) viable.
Both optimisations are described in the RocksDB blog and docs. [3] They do not change the math — a block-level filter uses the same FPR formula — but they change the failure mode of "Bloom filter eats too much RAM" into "Bloom filter cached like any other block, evicted under pressure, re-read on demand".
Sizing the filter memory budget
A practical rule for an LSM designer: the Bloom filter memory budget is a fixed fraction of the data size, and that fraction is directly controlled by bits-per-key.
- 10 bits per key → 1.25% memory overhead, ~1% FPR.
- 12 bits per key → 1.5% overhead, ~0.3% FPR.
- 14 bits per key → 1.75% overhead, ~0.1% FPR.
For a 100 GB database, 10 bits per key is 1.25 GB of RAM for Bloom filters. That is usually less than the sparse-index memory, and usually less than the OS's page cache for hot blocks. In other words: Bloom filters are cheap enough that "just turn them on everywhere" is the default operational posture. The decision is not "do I want them?" but "how many bits per key?" — and the answer is "~10 unless measurements say otherwise".
Where this leads next
You now have a read path that is fast for hits and fast for misses. A Bloom filter per SSTable, ten bits per key, ~1% FPR, and a single line of code in the read loop — that is the widget that turns a hundred-SSTable lookup from a hundred seeks into roughly one.
But there is a problem lurking just behind all this cleverness: the number of SSTables keeps growing. Every flush adds one more file. Within a few hours of sustained writes you can have thousands of SSTables, and even at ~1% FPR per filter, a query that truly does not exist will pay ten wasted seeks across ten unlucky filters. More importantly, each SSTable may hold an older version of a key that a newer SSTable overrode — a delete or an update — and those stale records are pure waste, consuming disk space and dragging the read path.
The answer is compaction: periodically pick a set of SSTables and merge them into a smaller number of larger, more up-to-date SSTables, dropping overridden keys and tombstones in the process. You already have the merge primitive from chapter 12 — the streaming k-way merge. What you do not yet have is the policy: when to compact, which files to compact together, and how to organise the output to keep read amplification bounded.
Chapter 17 builds the first of the two main policies — leveled compaction, the LevelDB/RocksDB design — and derives the read-amplification and write-amplification numbers that make it the default choice for read-heavy workloads. Chapter 18 covers the opposite extreme, tiered compaction (Cassandra's default), which trades read amplification for write amplification.
For now, take a moment with the whole machine. You have a memtable, a flush, an SSTable layout with a sparse index, a read path that checks all of them newest-first, and a Bloom filter that cheaply skips the ones that cannot hold your key. That is a working LSM-tree key-value store. It absorbs writes in O(log n) per operation, serves point reads in ~1 disk seek even across hundreds of SSTables, and is ready to be made durable (Build 5's WAL) and space-efficient (Build 3's compaction chapters). The Bloom filter is forty lines of Python, but it is also the difference between "an LSM that works" and "an LSM that is fast enough to ship".
References
- Adam Kirsch and Michael Mitzenmacher, Less Hashing, Same Performance: Building a Better Bloom Filter (ESA 2006) — the double-hashing trick that lets one pair of hashes stand in for
kindependent hashes with no measurable FPR loss. people.seas.harvard.edu/~michaelm. - Bin Fan et al., Cuckoo Filter: Practically Better Than Bloom (CoNEXT 2014) — the cuckoo-filter design, with delete support and better space efficiency at common FPRs. dl.acm.org.
- Facebook RocksDB team, Full Filter Block and Partitioned Index/Filters (RocksDB wiki) — per-data-block and partitioned Bloom filter implementations in production. github.com/facebook/rocksdb/wiki.
- Huanchen Zhang et al., SuRF: Practical Range Query Filtering with Fast Succinct Tries (SIGMOD 2018) — range-query-aware filters, which plain Bloom filters cannot support. www.cs.cmu.edu/~huanche1.
- Burton H. Bloom, Space/Time Trade-offs in Hash Coding with Allowable Errors (CACM 1970) — the original paper. Still readable, still correct. dl.acm.org/doi/10.1145/362686.362692.
- Sanjay Ghemawat and Jeff Dean, LevelDB Implementation Notes — the per-SSTable Bloom filter design, sidecar-block layout, and the
FilterPolicyinterface that RocksDB inherited. github.com/google/leveldb.