In short
You finished Build 2. Three hundred lines of Python are a real Bitcask — durable, compacted, hint-file-indexed, restarting in milliseconds. Then a product manager walks up with two requirements: "we need to store fifty million sensor readings" and "we need to list all readings between 9:00 and 10:00 this morning." Both requirements kill Bitcask. The first hits Wall A — the keydir must fit in RAM. Fifty million keys at roughly 50 bytes per keydir entry is 2.5 GB just for the index; push to five hundred million keys and you need 25 GB of RAM on every replica, forever, regardless of how cold the data is. The second hits Wall B — hash indexes have no order. A hash scatters keys uniformly across buckets on purpose; asking "what keys lie between 2026-04-24T09:00 and 2026-04-24T10:00" forces you to scan every bucket, which is every key, which is exactly the O(n) wall Build 2 was meant to knock down. These are not bugs. They are the two design pressures Bitcask's architecture cannot dissolve from the inside. Build 3 (LSM-trees) and Build 4 (B+ trees) are two radically different answers to the same two walls. This chapter names the walls with the index-memory arithmetic, draws the RAM line being crossed, shows a concrete workload that hits each wall, maps out the trade-space in a four-axis table, and — crucially — closes by telling you when Bitcask is still exactly the right choice, and why Riak shipped it for a decade.
Your BitcaskSegmented store works. Chapter 10 was the capstone — hint files collapsed restart from forty seconds to forty milliseconds, which means you can deploy this thing behind a load balancer without the ops team sending you angry messages about readiness probes. Build 2 is done. The design you have is the one Justin Sheehy and David Smith published in 2010. You have written a real database.
Then two users walk up.
User one is an IoT startup. They have fifty million temperature sensors across rural Karnataka. Each sensor reports once a minute. They want to keep the latest reading per sensor, indexed by sensor ID. "Can your database hold fifty million keys?" They mean it honestly — not a trick question. It is a question your design cannot answer yes to on the hardware they can afford.
User two is a payments company. They have a timestamp-keyed log of transactions and a real-time fraud dashboard that constantly asks "give me every transaction from the last five minutes." They want to do SELECT * WHERE timestamp BETWEEN '10:00' AND '10:05'. "Can your database do range queries?" This is also an honest question. It is also one your design cannot answer.
Two users, two walls. Both walls are baked into the shape of Bitcask — not bugs, not missing features, structural properties of the design choice "append-only log plus in-memory hash index." To get rid of them you have to change the shape. That is what Builds 3 and 4 do. This chapter is the bridge: name the walls, do the arithmetic, point at the two doors out.
Wall A — the keydir must fit in RAM
Bitcask's entire speed story rests on one fact: every live key has an entry in the in-memory keydir, and that entry names the exact (segment_id, offset, size) of the latest record. No entry, no get(). There is no fallback path that says "if the keydir does not know about this key, scan the log to find it" — scanning the log would be the O(n) wall chapter 5 named, which is exactly what the keydir exists to knock down.
So the keydir is not optional. It is total. Every live key, every replica, always in RAM.
Now do the arithmetic. A single keydir entry in a production Bitcask is roughly:
key string variable — in Python, ~50 bytes for a 20-char key
segment_id 4 bytes — uint32, which segment holds the record
value_offset 8 bytes — uint64, byte offset in the segment
value_size 4 bytes — uint32, size of the value
timestamp 8 bytes — uint64, for conflict resolution
hash-table overhead — ~40 bytes in CPython, ~16 in a C hash
─────────────────────────────
total — ~50 bytes in C, ~120 bytes in pure Python
Call it 50 bytes per key for a production-quality C or Rust implementation (the number Basho quotes for real Bitcask). In pure Python, multiply by two or three.
Plot the keydir memory against the number of keys and draw the RAM line your server has, and you get this.
Why the keydir can't just "spill to disk": the whole speed argument collapses. A get() on a keydir entry that lives on disk costs a disk seek to find the entry, then another disk seek to read the value. Two seeks per read instead of one is a 2× slowdown on cached workloads and much worse on cold ones. More fundamentally, as soon as the index lives on disk, you need a data structure to find entries in it — a hash table, a sorted file, a tree. And if you have that, you no longer have "Bitcask." You have whichever-engine-that-index-structure-implies: B+ tree, LSM, or a custom hash-on-disk scheme. The design has changed.
The RAM line is real. Cloud pricing in 2026 is about $0.01 per GB-hour for RAM on a general-purpose instance. Sixteen GB costs around $120 per month of server rent; sixty-four GB costs around $500; five hundred GB is a specialty instance at $4000 a month. Scaling Bitcask vertically is possible and gets expensive fast — and even the biggest single-machine RAM (a few TB on the largest cloud SKUs) has a hard ceiling. The number of keys you can hold in a single Bitcask node has a last rung.
Put a number on the IoT startup. 50 million sensors × 50 bytes = 2.5 GB, fits fine. 500 million sensors × 50 bytes = 25 GB — does not fit on the 16 GB box, starts to be an expensive 32 GB or 64 GB instance. 5 billion sensor-owned sub-sensors × 50 bytes = 250 GB — you need a specialty RAM-heavy instance per replica, in every region, forever. The moment your data model says "one key per (sensor, day)" and your fleet grows, you are making hardware choices that scale with the full lifetime key set, not with the working set.
Here is a one-page benchmark that makes the wall touchable on your laptop. It writes a growing number of keys and prints the RSS of the Python process.
# wall_a_keydir.py — watch keydir memory grow linearly with key count
import os, psutil, gc
from bitcask_hints import BitcaskSegmented
db = BitcaskSegmented("growram")
proc = psutil.Process(os.getpid())
baseline_mb = proc.memory_info().rss / 1e6
for target in (100_000, 1_000_000, 5_000_000):
while len(db._index) < target:
k = f"sensor:{len(db._index):09d}"
db.put(k, "22.3") # tiny value — keydir dominates RSS
gc.collect()
rss_mb = proc.memory_info().rss / 1e6 - baseline_mb
print(f"keys={target:>10,} keydir RSS={rss_mb:>7,.0f} MB "
f"per-key={rss_mb*1e6/target:>4.0f} bytes")
On a CPython 3.12 run you see roughly:
keys= 100,000 keydir RSS= 14 MB per-key= 140 bytes
keys= 1,000,000 keydir RSS= 132 MB per-key= 132 bytes
keys= 5,000,000 keydir RSS= 650 MB per-key= 130 bytes
Pure-Python overhead is about 130 bytes per keydir entry — a little under 3× the production 50-bytes-per-key figure, entirely the tax of CPython's boxed ints and string headers. The shape is what matters: a perfectly linear RSS-to-keys curve. Double the keys, double the RAM. No caching, no "pages only loaded when touched." Every key is resident, forever, from the moment you write it. This is Wall A.
Wall B — hash indexes have no order
The payments company's fraud dashboard wants WHERE timestamp BETWEEN '10:00:00' AND '10:05:00'. On Bitcask, this is not slow — it is impossible to do well.
Here is why, and the answer sits in the word "hash." The keydir is a hash table. Hash tables achieve O(1) lookup by scattering keys uniformly across buckets. Two keys that are lexicographically adjacent — "txn:10:00:01" and "txn:10:00:02" — almost certainly land in buckets far apart in memory. That scattering is not a bug; it is the whole mechanism that keeps the table collision-free. But it destroys something you did not know you had in a sorted-index world: locality.
Why you can't just "sort the keydir": you can sort the entries at some point in time (put them in a Python list and call sorted()), but then every put that arrives with a key that sorts in the middle forces you to shift entries around, turning O(1) writes into O(n) writes. Or you switch the keydir's data structure from a hash table to a sorted tree — and now you have a tree-based index, which is what Build 4 builds, and you no longer have Bitcask. The fundamental trade is: hash gives you O(1) point reads and no order; trees give you O(\log n) point reads and full order. You cannot have both from one structure.
The only ways to implement a range query on Bitcask are:
- Scan every keydir entry, keep those whose keys fall in the range. O(n_\text{keys}) — but at least the scan is in-memory, so it is fast-ish per-entry. For a hundred-million-key keydir, a range query is a second or two of pure CPU loop. Not competitive with a B+ tree doing the same query in milliseconds by walking a sorted tree.
- Maintain a parallel sorted index alongside the hash keydir. Now every write has to update two data structures. Your memory footprint roughly doubles. You have reinvented a tree-plus-hash hybrid, and you are writing Build 4 the hard way.
- Give up and tell the user to store the range key differently — e.g. bucket transactions into minute-sized keys and list the minute. This works sometimes; it does not work when the user wants arbitrary ranges, which is most of the time.
Production Bitcask does not do range queries. The Bitcask paper says so in two sentences: "Bitcask does not support queries based on value. Keys are the only way in." Riak wrapped Bitcask in a KV API (get(key), put(key, value), delete(key)) and told users: if you need ranges, use the leveldb backend. Which is — yes — an LSM.
What the solution needs to look like
Stare at the two walls and a recipe for their shared cure starts to form.
To beat Wall A (keydir > RAM) you have to get the per-key memory footprint down. The lever you have is simple: do not keep an entry in RAM for every key. Keep a sparse index — only every thousandth key, say — and resolve the others by a short binary search within a block. If you only need one index entry per 1,000 keys, your RAM use drops by 1000×. A billion keys no longer needs 50 GB; they need 50 MB. The catch is that a sparse index only works if the keys within a block are sorted, because only sorted keys let you binary-search a block by looking at its first and last keys.
To beat Wall B (no range queries) you have to give the index order. A range [start, end] then becomes: binary-search for start, walk forward until you pass end. Natural, fast, exactly the SELECT * WHERE ... BETWEEN pattern your database user wanted.
The two walls have the same fix. Sort the data on disk. Once segments are sorted by key, you get sparse indexes (Wall A fix) and range queries (Wall B fix) in one move.
That insight is the pivot of the whole track. It is why Build 3 exists. A log-structured merge tree is what you get when you sort every segment as it goes to disk. You keep every good idea from Bitcask — append-only writes, immutable segments, background compaction — and add one property, sortedness, which dissolves both walls at once.
Comparing Bitcask, LSM, B+ tree
Put the three engines side by side on the four axes that matter for this decision.
| Axis | Bitcask | LSM-tree | B+ tree |
|---|---|---|---|
| Dataset > RAM (values) | ✓ values on disk | ✓ values on disk | ✓ values on disk |
| Dataset > RAM (keys) | ✗ every key must fit in RAM | ✓ sparse index, 1 entry per block | ✓ tree on disk, only hot pages in buffer |
Range queries BETWEEN a AND b |
✗ hash scrambles order | ✓ sorted SSTables, merge-iter across levels | ✓ sorted leaves, follow sibling links |
| Write amplification | ~1.5–3× (one compaction layer) | ~5–15× (level-to-level compaction) | ~2–5× (random page writes + WAL) |
| Read amplification (worst case) | 1 seek (the best) | O(\text{levels}) + Bloom filters | O(\log_B n) page traversals |
Read the table as a menu, not a ranking. Every axis has a winner, and the winner is different for each axis. Bitcask wins on write-amp and read-amp — it is, by this measure, the fastest engine in the comparison for point workloads that fit in RAM. LSM wins on dataset-size versus RAM and on sorted-access at the cost of write-amp. B+ tree wins on balanced read/write/range performance at the cost of random I/O.
The "best" engine is whichever row your workload can afford to lose. If your key set fits in RAM and you only do point queries, Bitcask's losses are invisible. If your key set is huge and you do sorted scans, LSM's wins dominate.
A workload for each wall
Two workloads, two walls
Workload 1 — the session store that outgrows Bitcask. A gaming company keeps a session record per logged-in player: session:{player_id} → {last_seen, region, loadout_json}. Initially, 3 million monthly active players. Keydir: 3 M × 50 B = 150 MB. Bitcask is perfect — reads in one disk seek, writes append, whole keydir in L3 cache. Five years later, the game is huge. Ninety million monthly actives, plus a year of inactive accounts retained for comebacks, plus in-game guild keys, plus per-character saves — half a billion keys. Keydir: 500 M × 50 B = 25 GB. The 16 GB instance is now a 64 GB specialty instance per replica, in every region. Ops files a ticket: "can we switch storage engines?" This is Wall A. The team ports to RocksDB (LSM) and RAM use collapses to 200 MB of sparse index, because most keys sit in sorted SSTables whose first-key/last-key summaries are all the memory the engine needs.
Workload 2 — the audit log that needs range queries. A hospital records every patient interaction as audit:{timestamp}:{event_id} → json. Bitcask serves individual get(audit:2026-04-24T14:22:19Z:e_837473) in one seek. Then regulators show up with an audit requirement: list every record from 9:00 AM to 5:00 PM yesterday. There is no way to ask the keydir that question without scanning all 400 million entries. The team ports to Postgres (B+ tree) whose leaf pages are chained by pointer — SELECT ... WHERE timestamp BETWEEN ... becomes "binary-search for 9:00 in the tree, follow leaf-sibling pointers until 17:00," which takes milliseconds regardless of the table size. This is Wall B.
Notice the contrast. Workload 1's migration was a resource decision — the design worked, the machine couldn't carry it. Workload 2's migration was a capability decision — the design could not answer the question at any price. Both are the same shape as "Bitcask chose a specific point in the design space that does not include this workload."
When Bitcask is still exactly right
Build 2 is not a cautionary tale. Bitcask is a real tool that has shipped in production for fifteen years. The moment to stop using it is precise; until that moment it is excellent.
Bitcask is the right choice when three conditions hold at once:
- The live key set fits in RAM with a comfortable margin. You should be at less than 25% of the RAM you are willing to pay for — that is the growth headroom you need before the next engine migration becomes urgent.
- You only need point queries —
get(key),put(key, value),delete(key). No ranges, no scans by prefix, no secondary indexes on the value. - You want latency predictability more than anything else. A Bitcask
get()is one seek, full stop. No level-hopping, no Bloom filter lookups, no page splits, no vacuum pauses. If your p99 matters more than your p50, Bitcask's boring uniformity is a feature.
Riak ran Bitcask at exactly this point for over a decade. Heroku used it to store routing state: millions of keys, each a small piece of metadata per app, point lookup at every HTTP request. Comcast used it for subscriber state. The UK Government Digital Service used it for session storage on gov.uk. In all of these, the key set was bounded by the number of users (millions), the values were small (tens of bytes to a few KB), and the read pattern was purely point-wise — exactly the shape Bitcask was born for.
Even today, Bitcask-style engines show up whenever a team wants a fast local cache that survives restart. Netflix's EVCache backs its Memcached cluster with a Bitcask-like persistent layer so cache warmup after a deploy is seconds, not minutes. Some BPF-based observability tools use a Bitcask flavour for their on-disk event spool. The design is alive and well — just not as the main database.
Common confusions
-
"Can't I just use a bigger machine?" You can, and sometimes should. The largest AWS / GCP instances in 2026 go to about 24 TB of RAM on specialty SKUs. At 50 bytes per key that is half a trillion keys — more than enough for any single application in existence. But the cost scales linearly with RAM, and you pay that cost on every replica in every region. A three-region, three-replica deployment at 64 GB is a $4,500/month bill, doable; the same deployment at 2 TB is $45,000/month; at 24 TB it's most of a million dollars a year. At some point the economics flip — it is cheaper to run an LSM across 10 commodity nodes than a Bitcask on one giant one. The inflection point is workload-specific, but it typically lands somewhere between 100 M and 1 B keys.
-
"Can't I keep only hot keys in RAM and cold ones on disk?" You are reinventing a cache in front of a slower primary store. That is a perfectly valid architecture — and it is not Bitcask any more, because by definition some keys do not have a RAM entry. As soon as you add the "check disk if RAM misses" path, you need an on-disk index to make the miss-path fast, and you are building either an LSM (if the on-disk index is sorted) or a B+ tree (if the on-disk index is a tree). The "hot keys only in RAM" pattern is exactly what caches in front of LSM/B+ tree stores do — it is what Postgres's shared buffers do, what RocksDB's block cache does. It is not what Bitcask does; Bitcask is total-in-RAM by design.
-
"What if I just don't need range queries?" Then Wall B is not a wall for you, and the only wall you need to worry about is Wall A. That is a real and common situation — most session stores, most feature flags, most secondary caches — and is why Bitcask remains a reasonable choice. The "dataset > RAM" wall is the harder one because it scales with success: your app grows, your key set grows, the wall gets closer. If your key set is naturally bounded (one entry per active user, max millions) and bounded for the life of the product, Bitcask is fine forever.
-
"Does 'keydir in RAM' mean I lose the index on every crash?" Only notionally — the index is reconstructed from hint files at startup, which is fast. What "keydir in RAM" really constrains is the running memory footprint, not the durability story. The keydir is rebuilt on every process start but must fit in RAM while the process is running, and that is the fact Wall A is about.
-
"Why not sort the log as you write it?" Because writes are coming in whatever order the users generate them; sorting the log online would require rewriting records in the middle of the file, which destroys the append-only invariant that makes writes fast and crash-safe. The LSM answer is more subtle: buffer writes in a sorted in-memory structure (the memtable), and flush it as a sorted file (an SSTable) only when the memtable is full. Writes are still sequential to disk — but the memtable sorts them on the way. That is the trick the next chapter opens with.
Going deeper
The main text has named the walls and pointed at the two doors out. This section spends time on the less-asked questions: where Bitcask-style stores still live in the 2026 stack, and what the hybrid designs look like.
Real-world Bitcask deployments — Riak, Basho, and the post-Riak diaspora
The best-documented production Bitcask was Riak, shipped by Basho Technologies from 2009 until Basho's bankruptcy in 2017. Riak nodes routinely handled 50–200 GB of data and 10–200 million keys per node, with Bitcask as the default backend for point-lookup workloads. Notable deployments included:
- Heroku, for router state — every deployed app had session and routing keys in Riak; single-digit-millisecond reads at request time were the whole product.
- The Weather Channel, for user preferences at scale — tens of millions of users, point-read per page load.
- The UK Government Digital Service, for the gov.uk session cache — user sessions, expire-on-disk-cleanup, purely point access.
- Best Buy, for product-catalogue metadata on a heavily-cached path.
After Basho folded, Riak was open-sourced and maintained by the community (now riak.com). Many former Riak users migrated to Cassandra or DynamoDB — both LSM-based — once their datasets outgrew a single machine. Bitcask itself as a library remains in active use for smaller stores; several modern projects (Erlang's rocksdb_store, Elixir's CubDB) carry on the design directly or in spirit.
Hybrid designs — hash index + sorted log
Several production stores split the difference between Bitcask and LSM. Faster (Microsoft Research) keeps an in-memory hash index and a log-structured on-disk component, but the hash entries can spill to disk when the in-memory portion is full. RocksDB in its "hash-table memtable" mode uses a hash structure in RAM for the newest data and falls back to sorted SSTables on disk, giving point-read throughput that approaches Bitcask while still being able to grow past RAM. LMDB is a B+ tree that memory-maps its file, giving nearly zero-copy reads for hot data while the cold pages stay on disk.
The lesson is that "hash-on-RAM vs sort-on-disk" is not a binary. Real engines are usually a hybrid, with a hash or memtable at the top of the memory hierarchy and a sorted disk structure below. Bitcask sits at one extreme (only hash, no sort) and pure B+ trees sit at the other (only sort, no hash). The middle is crowded.
When modern KV stores still use hash-on-RAM
A few 2020s-era systems still ship a Bitcask-style design as their core, because the "total hash in RAM" pattern is very hard to beat for the specific workloads that fit:
- Aerospike's primary index is an in-memory hash from key to record location on flash, identical in spirit to Bitcask's keydir — but with a 64-byte entry that includes the record's storage device, offset, and generation. Aerospike scales a single "Bitcask with a bigger keydir" up to petabytes of SSD-backed storage.
- Scylla keeps an in-memory cache of recently accessed partition keys with LRU eviction, on top of an LSM — the hot path is hash-like, the cold path is sorted.
- Kafka's log-compacted topics are essentially per-partition Bitcasks: the active segment accepts appends, the closed segments compact overwritten keys away, and an in-memory offset index is the authoritative "where is the latest value for this key" structure. The keydir for a Kafka log-compacted topic is literally a
HashMap<Key, Offset>in the broker.
The common thread: "hash in RAM, values on append-only disk" is a pattern, not a single product. Bitcask was the first place anyone wrote it down clearly. Anywhere you see a constant-latency point-lookup layer in 2026, Bitcask's fingerprints are visible.
Where this leads next
You have a working Bitcask and a clear-eyed understanding of what it can and cannot do. Two walls are left: keydir > RAM, no ranges. The rest of the track is the two engines that answer them.
Build 3 — LSM-trees opens with the single insight that makes both walls dissolve. If the segments on disk are sorted by key, you can replace the fat per-key in-memory index with a tiny sparse-per-block index, and range queries become a merge-walk across sorted files. Build 3 spends ten chapters developing this: the memtable (your in-memory sorted buffer), the SSTable (your sorted file on disk), Bloom filters (the trick that keeps read amplification low), level-based compaction (the organising principle that bounds space amplification), and the manifest file (the global accountant). By the end of Build 3 your code is roughly a teaching-grade LevelDB.
Build 4 — B+ trees opens with a different insight: what if the sorted structure itself lives on disk, and RAM is just a cache for hot pages? The B+ tree has been the default relational-database index since 1972, and for good reason — it absorbs mixed read/write/range workloads better than almost anything else. Build 4 develops the on-disk tree, page splits, crash recovery via ARIES-style WAL, and the buffer pool that makes it all fast. By the end of Build 4 your code is roughly a teaching-grade SQLite.
The path from here to there is a straight line. Next chapter — sorted strings and the merge insight — is where you start Build 3 by seeing, concretely, what changes when a segment is sorted. One property, one huge consequence: the two walls of Bitcask turn into a doorway.
Close the hood on your Bitcask. It is a finished, shipped-quality engine. Now go build the next one.
References
- Justin Sheehy and David Smith, Bitcask: A Log-Structured Hash Table for Fast Key/Value Data (Basho Technologies, 2010) — §5, "Advantages and Disadvantages of Bitcask," names the two walls of this chapter verbatim.
- Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil, The Log-Structured Merge-Tree (LSM-Tree) (Acta Informatica, 1996) — the paper that answers Wall A + Wall B with sorted segments plus sparse indexes.
- Douglas Comer, The Ubiquitous B-Tree (ACM Computing Surveys, 1979) — the other answer to both walls, via a disk-resident sorted tree.
- Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 3 Storage and Retrieval — the "comparing hash indexes to SSTables" section is a crisp formalisation of this chapter's comparison table. dataintensive.net.
- Badrish Chandramouli et al., FASTER: A Concurrent Key-Value Store with In-Place Updates (SIGMOD 2018) — a hybrid hash-plus-log design that pushes past Wall A while keeping Bitcask's point-read speed.
- Aerospike documentation, Primary Index — how a production 2020s database still uses a Bitcask-shaped RAM index at petabyte scale.