In short

You finished Build 2. Three hundred lines of Python are a real Bitcask — durable, compacted, hint-file-indexed, restarting in milliseconds. Then a product manager walks up with two requirements: "we need to store fifty million sensor readings" and "we need to list all readings between 9:00 and 10:00 this morning." Both requirements kill Bitcask. The first hits Wall A — the keydir must fit in RAM. Fifty million keys at roughly 50 bytes per keydir entry is 2.5 GB just for the index; push to five hundred million keys and you need 25 GB of RAM on every replica, forever, regardless of how cold the data is. The second hits Wall B — hash indexes have no order. A hash scatters keys uniformly across buckets on purpose; asking "what keys lie between 2026-04-24T09:00 and 2026-04-24T10:00" forces you to scan every bucket, which is every key, which is exactly the O(n) wall Build 2 was meant to knock down. These are not bugs. They are the two design pressures Bitcask's architecture cannot dissolve from the inside. Build 3 (LSM-trees) and Build 4 (B+ trees) are two radically different answers to the same two walls. This chapter names the walls with the index-memory arithmetic, draws the RAM line being crossed, shows a concrete workload that hits each wall, maps out the trade-space in a four-axis table, and — crucially — closes by telling you when Bitcask is still exactly the right choice, and why Riak shipped it for a decade.

Your BitcaskSegmented store works. Chapter 10 was the capstone — hint files collapsed restart from forty seconds to forty milliseconds, which means you can deploy this thing behind a load balancer without the ops team sending you angry messages about readiness probes. Build 2 is done. The design you have is the one Justin Sheehy and David Smith published in 2010. You have written a real database.

Then two users walk up.

User one is an IoT startup. They have fifty million temperature sensors across rural Karnataka. Each sensor reports once a minute. They want to keep the latest reading per sensor, indexed by sensor ID. "Can your database hold fifty million keys?" They mean it honestly — not a trick question. It is a question your design cannot answer yes to on the hardware they can afford.

User two is a payments company. They have a timestamp-keyed log of transactions and a real-time fraud dashboard that constantly asks "give me every transaction from the last five minutes." They want to do SELECT * WHERE timestamp BETWEEN '10:00' AND '10:05'. "Can your database do range queries?" This is also an honest question. It is also one your design cannot answer.

Two users, two walls. Both walls are baked into the shape of Bitcask — not bugs, not missing features, structural properties of the design choice "append-only log plus in-memory hash index." To get rid of them you have to change the shape. That is what Builds 3 and 4 do. This chapter is the bridge: name the walls, do the arithmetic, point at the two doors out.

Wall A — the keydir must fit in RAM

Bitcask's entire speed story rests on one fact: every live key has an entry in the in-memory keydir, and that entry names the exact (segment_id, offset, size) of the latest record. No entry, no get(). There is no fallback path that says "if the keydir does not know about this key, scan the log to find it" — scanning the log would be the O(n) wall chapter 5 named, which is exactly what the keydir exists to knock down.

So the keydir is not optional. It is total. Every live key, every replica, always in RAM.

Now do the arithmetic. A single keydir entry in a production Bitcask is roughly:

  key string       variable   — in Python, ~50 bytes for a 20-char key
  segment_id       4 bytes    — uint32, which segment holds the record
  value_offset     8 bytes    — uint64, byte offset in the segment
  value_size       4 bytes    — uint32, size of the value
  timestamp        8 bytes    — uint64, for conflict resolution
  hash-table overhead         — ~40 bytes in CPython, ~16 in a C hash
  ─────────────────────────────
  total                       — ~50 bytes in C, ~120 bytes in pure Python

Call it 50 bytes per key for a production-quality C or Rust implementation (the number Basho quotes for real Bitcask). In pure Python, multiply by two or three.

Plot the keydir memory against the number of keys and draw the RAM line your server has, and you get this.

Keydir memory versus key count, crossing the RAM lineA chart with key count on the x-axis from 1 million to 10 billion on a log scale, and keydir memory on the y-axis from 10 MB to 1 TB on a log scale. A straight diagonal line shows keydir memory at fifty bytes per key rising linearly with key count. A horizontal dashed red line marked 16 GB (typical server RAM) cuts across the chart at the 320-million-key mark. To the left of the intersection the line is safe green; to the right it is red and labelled "Bitcask impossible — keydir exceeds RAM". Vertical dotted drop-lines mark three example points: 50 million keys at 2.5 GB, 320 million keys at the RAM line, and 1 billion keys at 50 GB.10 MB100 MB1 GB10 GB100 GB1 TBkeydir memory1 M10 M100 M1 B10 Bnumber of live keys16 GB RAM50 M keys · 2.5 GB320 M keys · 16 GB (the wall)1 B keys · 50 GB — out of rangeBitcask fitsBitcask impossible
Keydir memory grows linearly with the number of live keys at roughly 50 bytes per entry. On a 16 GB server — a perfectly reasonable mid-range box — you hit the wall at about 320 million keys. The IoT startup's fifty million sensors fit with headroom; their planned expansion to five hundred million does not. The arithmetic is boring and merciless.

Why the keydir can't just "spill to disk": the whole speed argument collapses. A get() on a keydir entry that lives on disk costs a disk seek to find the entry, then another disk seek to read the value. Two seeks per read instead of one is a 2× slowdown on cached workloads and much worse on cold ones. More fundamentally, as soon as the index lives on disk, you need a data structure to find entries in it — a hash table, a sorted file, a tree. And if you have that, you no longer have "Bitcask." You have whichever-engine-that-index-structure-implies: B+ tree, LSM, or a custom hash-on-disk scheme. The design has changed.

The RAM line is real. Cloud pricing in 2026 is about $0.01 per GB-hour for RAM on a general-purpose instance. Sixteen GB costs around $120 per month of server rent; sixty-four GB costs around $500; five hundred GB is a specialty instance at $4000 a month. Scaling Bitcask vertically is possible and gets expensive fast — and even the biggest single-machine RAM (a few TB on the largest cloud SKUs) has a hard ceiling. The number of keys you can hold in a single Bitcask node has a last rung.

Put a number on the IoT startup. 50 million sensors × 50 bytes = 2.5 GB, fits fine. 500 million sensors × 50 bytes = 25 GB — does not fit on the 16 GB box, starts to be an expensive 32 GB or 64 GB instance. 5 billion sensor-owned sub-sensors × 50 bytes = 250 GB — you need a specialty RAM-heavy instance per replica, in every region, forever. The moment your data model says "one key per (sensor, day)" and your fleet grows, you are making hardware choices that scale with the full lifetime key set, not with the working set.

Here is a one-page benchmark that makes the wall touchable on your laptop. It writes a growing number of keys and prints the RSS of the Python process.

# wall_a_keydir.py — watch keydir memory grow linearly with key count
import os, psutil, gc
from bitcask_hints import BitcaskSegmented

db = BitcaskSegmented("growram")
proc = psutil.Process(os.getpid())
baseline_mb = proc.memory_info().rss / 1e6

for target in (100_000, 1_000_000, 5_000_000):
    while len(db._index) < target:
        k = f"sensor:{len(db._index):09d}"
        db.put(k, "22.3")                  # tiny value — keydir dominates RSS
    gc.collect()
    rss_mb = proc.memory_info().rss / 1e6 - baseline_mb
    print(f"keys={target:>10,}  keydir RSS={rss_mb:>7,.0f} MB  "
          f"per-key={rss_mb*1e6/target:>4.0f} bytes")

On a CPython 3.12 run you see roughly:

keys=   100,000   keydir RSS=     14 MB   per-key= 140 bytes
keys= 1,000,000   keydir RSS=    132 MB   per-key= 132 bytes
keys= 5,000,000   keydir RSS=    650 MB   per-key= 130 bytes

Pure-Python overhead is about 130 bytes per keydir entry — a little under 3× the production 50-bytes-per-key figure, entirely the tax of CPython's boxed ints and string headers. The shape is what matters: a perfectly linear RSS-to-keys curve. Double the keys, double the RAM. No caching, no "pages only loaded when touched." Every key is resident, forever, from the moment you write it. This is Wall A.

Wall B — hash indexes have no order

The payments company's fraud dashboard wants WHERE timestamp BETWEEN '10:00:00' AND '10:05:00'. On Bitcask, this is not slow — it is impossible to do well.

Here is why, and the answer sits in the word "hash." The keydir is a hash table. Hash tables achieve O(1) lookup by scattering keys uniformly across buckets. Two keys that are lexicographically adjacent — "txn:10:00:01" and "txn:10:00:02" — almost certainly land in buckets far apart in memory. That scattering is not a bug; it is the whole mechanism that keeps the table collision-free. But it destroys something you did not know you had in a sorted-index world: locality.

Sorted order versus hash order for a range of keysTwo horizontal rows of buckets. The top row labelled "sorted file on disk" shows five adjacent keys txn:10:00:01 through txn:10:00:05 sitting in five contiguous slots, with an arrow spanning them labelled "one range scan". The bottom row labelled "Bitcask keydir (hash table)" shows the same five keys scattered across ten separated slots, each in a different position, with five separate arrows from a scan bracket that has to touch every bucket in the table.sorted file (LSM SSTable or B+ tree leaf)txn:10:00:01txn:10:00:02txn:10:00:03txn:10:00:04txn:10:00:05txn:10:00:06 ...one contiguous scan — 5 readsBitcask keydir (hash table)foo10:00:03bar10:00:01baz10:00:05qux10:00:02zug10:00:04must scan every bucket — O(n) in the whole keyspace
A sorted file holds the range keys contiguously; the range scan is one sequential read of five adjacent entries. The hash keydir scatters the same keys across the table by design; retrieving the range requires visiting every bucket to find which ones fall into the interval. Hashing optimises point queries at the exact cost of making range queries infeasible.

Why you can't just "sort the keydir": you can sort the entries at some point in time (put them in a Python list and call sorted()), but then every put that arrives with a key that sorts in the middle forces you to shift entries around, turning O(1) writes into O(n) writes. Or you switch the keydir's data structure from a hash table to a sorted tree — and now you have a tree-based index, which is what Build 4 builds, and you no longer have Bitcask. The fundamental trade is: hash gives you O(1) point reads and no order; trees give you O(\log n) point reads and full order. You cannot have both from one structure.

The only ways to implement a range query on Bitcask are:

  1. Scan every keydir entry, keep those whose keys fall in the range. O(n_\text{keys}) — but at least the scan is in-memory, so it is fast-ish per-entry. For a hundred-million-key keydir, a range query is a second or two of pure CPU loop. Not competitive with a B+ tree doing the same query in milliseconds by walking a sorted tree.
  2. Maintain a parallel sorted index alongside the hash keydir. Now every write has to update two data structures. Your memory footprint roughly doubles. You have reinvented a tree-plus-hash hybrid, and you are writing Build 4 the hard way.
  3. Give up and tell the user to store the range key differently — e.g. bucket transactions into minute-sized keys and list the minute. This works sometimes; it does not work when the user wants arbitrary ranges, which is most of the time.

Production Bitcask does not do range queries. The Bitcask paper says so in two sentences: "Bitcask does not support queries based on value. Keys are the only way in." Riak wrapped Bitcask in a KV API (get(key), put(key, value), delete(key)) and told users: if you need ranges, use the leveldb backend. Which is — yes — an LSM.

What the solution needs to look like

Stare at the two walls and a recipe for their shared cure starts to form.

To beat Wall A (keydir > RAM) you have to get the per-key memory footprint down. The lever you have is simple: do not keep an entry in RAM for every key. Keep a sparse index — only every thousandth key, say — and resolve the others by a short binary search within a block. If you only need one index entry per 1,000 keys, your RAM use drops by 1000×. A billion keys no longer needs 50 GB; they need 50 MB. The catch is that a sparse index only works if the keys within a block are sorted, because only sorted keys let you binary-search a block by looking at its first and last keys.

To beat Wall B (no range queries) you have to give the index order. A range [start, end] then becomes: binary-search for start, walk forward until you pass end. Natural, fast, exactly the SELECT * WHERE ... BETWEEN pattern your database user wanted.

The two walls have the same fix. Sort the data on disk. Once segments are sorted by key, you get sparse indexes (Wall A fix) and range queries (Wall B fix) in one move.

That insight is the pivot of the whole track. It is why Build 3 exists. A log-structured merge tree is what you get when you sort every segment as it goes to disk. You keep every good idea from Bitcask — append-only writes, immutable segments, background compaction — and add one property, sortedness, which dissolves both walls at once.

From Bitcask to LSM and B+ treeA triptych. Left panel shows a Bitcask diagram with an unsorted log and an in-RAM hash index, with the two walls labelled beneath it. Middle panel shows an LSM-tree with sorted SSTables on disk, a sparse in-RAM index labelled "one entry per 1000 keys", and a note that both walls dissolve. Right panel shows a B-plus tree with a disk-resident sorted tree, internal pages, and leaf pages, with a note that both walls dissolve through a different mechanism.Bitcask — Build 2keydir (RAM, hash)one entry per live keyfull 50 B per entrylog segments (disk)unsorted, write orderappend-only, compacted✗ Wall A: keydir > RAM✗ Wall B: no range queriesLSM-tree — Build 3memtable + sparse indexone entry per 1000 keysBloom filters, summariesSSTables (disk)sorted by key inside eachleveled compaction merges✓ sparse index → fits RAM✓ sorted runs → range scansB+ tree — Build 4buffer pool (RAM, pages)hot pages onlyLRU evictionsorted tree of pages (disk)root → branch → leafin-place updates under WAL✓ tree on disk → fits RAM✓ leaf link → range scans
The two walls, two doors out. Build 3 sorts the segments and shrinks the RAM index to a sparse summary. Build 4 puts the sorted structure entirely on disk and uses RAM only as a page cache. Both strategies trade the hash table's $O(1)$ point reads for $O(\log n)$ reads over a sorted structure — and unlock the two properties Bitcask could never deliver.

Comparing Bitcask, LSM, B+ tree

Put the three engines side by side on the four axes that matter for this decision.

Axis Bitcask LSM-tree B+ tree
Dataset > RAM (values) ✓ values on disk ✓ values on disk ✓ values on disk
Dataset > RAM (keys) ✗ every key must fit in RAM ✓ sparse index, 1 entry per block ✓ tree on disk, only hot pages in buffer
Range queries BETWEEN a AND b ✗ hash scrambles order ✓ sorted SSTables, merge-iter across levels ✓ sorted leaves, follow sibling links
Write amplification ~1.5–3× (one compaction layer) ~5–15× (level-to-level compaction) ~2–5× (random page writes + WAL)
Read amplification (worst case) 1 seek (the best) O(\text{levels}) + Bloom filters O(\log_B n) page traversals

Read the table as a menu, not a ranking. Every axis has a winner, and the winner is different for each axis. Bitcask wins on write-amp and read-amp — it is, by this measure, the fastest engine in the comparison for point workloads that fit in RAM. LSM wins on dataset-size versus RAM and on sorted-access at the cost of write-amp. B+ tree wins on balanced read/write/range performance at the cost of random I/O.

The "best" engine is whichever row your workload can afford to lose. If your key set fits in RAM and you only do point queries, Bitcask's losses are invisible. If your key set is huge and you do sorted scans, LSM's wins dominate.

A workload for each wall

Two workloads, two walls

Workload 1 — the session store that outgrows Bitcask. A gaming company keeps a session record per logged-in player: session:{player_id} → {last_seen, region, loadout_json}. Initially, 3 million monthly active players. Keydir: 3 M × 50 B = 150 MB. Bitcask is perfect — reads in one disk seek, writes append, whole keydir in L3 cache. Five years later, the game is huge. Ninety million monthly actives, plus a year of inactive accounts retained for comebacks, plus in-game guild keys, plus per-character saves — half a billion keys. Keydir: 500 M × 50 B = 25 GB. The 16 GB instance is now a 64 GB specialty instance per replica, in every region. Ops files a ticket: "can we switch storage engines?" This is Wall A. The team ports to RocksDB (LSM) and RAM use collapses to 200 MB of sparse index, because most keys sit in sorted SSTables whose first-key/last-key summaries are all the memory the engine needs.

Workload 2 — the audit log that needs range queries. A hospital records every patient interaction as audit:{timestamp}:{event_id} → json. Bitcask serves individual get(audit:2026-04-24T14:22:19Z:e_837473) in one seek. Then regulators show up with an audit requirement: list every record from 9:00 AM to 5:00 PM yesterday. There is no way to ask the keydir that question without scanning all 400 million entries. The team ports to Postgres (B+ tree) whose leaf pages are chained by pointer — SELECT ... WHERE timestamp BETWEEN ... becomes "binary-search for 9:00 in the tree, follow leaf-sibling pointers until 17:00," which takes milliseconds regardless of the table size. This is Wall B.

Notice the contrast. Workload 1's migration was a resource decision — the design worked, the machine couldn't carry it. Workload 2's migration was a capability decision — the design could not answer the question at any price. Both are the same shape as "Bitcask chose a specific point in the design space that does not include this workload."

When Bitcask is still exactly right

Build 2 is not a cautionary tale. Bitcask is a real tool that has shipped in production for fifteen years. The moment to stop using it is precise; until that moment it is excellent.

Bitcask is the right choice when three conditions hold at once:

  1. The live key set fits in RAM with a comfortable margin. You should be at less than 25% of the RAM you are willing to pay for — that is the growth headroom you need before the next engine migration becomes urgent.
  2. You only need point queries — get(key), put(key, value), delete(key). No ranges, no scans by prefix, no secondary indexes on the value.
  3. You want latency predictability more than anything else. A Bitcask get() is one seek, full stop. No level-hopping, no Bloom filter lookups, no page splits, no vacuum pauses. If your p99 matters more than your p50, Bitcask's boring uniformity is a feature.

Riak ran Bitcask at exactly this point for over a decade. Heroku used it to store routing state: millions of keys, each a small piece of metadata per app, point lookup at every HTTP request. Comcast used it for subscriber state. The UK Government Digital Service used it for session storage on gov.uk. In all of these, the key set was bounded by the number of users (millions), the values were small (tens of bytes to a few KB), and the read pattern was purely point-wise — exactly the shape Bitcask was born for.

Even today, Bitcask-style engines show up whenever a team wants a fast local cache that survives restart. Netflix's EVCache backs its Memcached cluster with a Bitcask-like persistent layer so cache warmup after a deploy is seconds, not minutes. Some BPF-based observability tools use a Bitcask flavour for their on-disk event spool. The design is alive and well — just not as the main database.

Common confusions

Going deeper

The main text has named the walls and pointed at the two doors out. This section spends time on the less-asked questions: where Bitcask-style stores still live in the 2026 stack, and what the hybrid designs look like.

Real-world Bitcask deployments — Riak, Basho, and the post-Riak diaspora

The best-documented production Bitcask was Riak, shipped by Basho Technologies from 2009 until Basho's bankruptcy in 2017. Riak nodes routinely handled 50–200 GB of data and 10–200 million keys per node, with Bitcask as the default backend for point-lookup workloads. Notable deployments included:

After Basho folded, Riak was open-sourced and maintained by the community (now riak.com). Many former Riak users migrated to Cassandra or DynamoDB — both LSM-based — once their datasets outgrew a single machine. Bitcask itself as a library remains in active use for smaller stores; several modern projects (Erlang's rocksdb_store, Elixir's CubDB) carry on the design directly or in spirit.

Hybrid designs — hash index + sorted log

Several production stores split the difference between Bitcask and LSM. Faster (Microsoft Research) keeps an in-memory hash index and a log-structured on-disk component, but the hash entries can spill to disk when the in-memory portion is full. RocksDB in its "hash-table memtable" mode uses a hash structure in RAM for the newest data and falls back to sorted SSTables on disk, giving point-read throughput that approaches Bitcask while still being able to grow past RAM. LMDB is a B+ tree that memory-maps its file, giving nearly zero-copy reads for hot data while the cold pages stay on disk.

The lesson is that "hash-on-RAM vs sort-on-disk" is not a binary. Real engines are usually a hybrid, with a hash or memtable at the top of the memory hierarchy and a sorted disk structure below. Bitcask sits at one extreme (only hash, no sort) and pure B+ trees sit at the other (only sort, no hash). The middle is crowded.

When modern KV stores still use hash-on-RAM

A few 2020s-era systems still ship a Bitcask-style design as their core, because the "total hash in RAM" pattern is very hard to beat for the specific workloads that fit:

The common thread: "hash in RAM, values on append-only disk" is a pattern, not a single product. Bitcask was the first place anyone wrote it down clearly. Anywhere you see a constant-latency point-lookup layer in 2026, Bitcask's fingerprints are visible.

Where this leads next

You have a working Bitcask and a clear-eyed understanding of what it can and cannot do. Two walls are left: keydir > RAM, no ranges. The rest of the track is the two engines that answer them.

Build 3 — LSM-trees opens with the single insight that makes both walls dissolve. If the segments on disk are sorted by key, you can replace the fat per-key in-memory index with a tiny sparse-per-block index, and range queries become a merge-walk across sorted files. Build 3 spends ten chapters developing this: the memtable (your in-memory sorted buffer), the SSTable (your sorted file on disk), Bloom filters (the trick that keeps read amplification low), level-based compaction (the organising principle that bounds space amplification), and the manifest file (the global accountant). By the end of Build 3 your code is roughly a teaching-grade LevelDB.

Build 4 — B+ trees opens with a different insight: what if the sorted structure itself lives on disk, and RAM is just a cache for hot pages? The B+ tree has been the default relational-database index since 1972, and for good reason — it absorbs mixed read/write/range workloads better than almost anything else. Build 4 develops the on-disk tree, page splits, crash recovery via ARIES-style WAL, and the buffer pool that makes it all fast. By the end of Build 4 your code is roughly a teaching-grade SQLite.

The path from here to there is a straight line. Next chapter — sorted strings and the merge insight — is where you start Build 3 by seeing, concretely, what changes when a segment is sorted. One property, one huge consequence: the two walls of Bitcask turn into a doorway.

Close the hood on your Bitcask. It is a finished, shipped-quality engine. Now go build the next one.

References

  1. Justin Sheehy and David Smith, Bitcask: A Log-Structured Hash Table for Fast Key/Value Data (Basho Technologies, 2010) — §5, "Advantages and Disadvantages of Bitcask," names the two walls of this chapter verbatim.
  2. Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil, The Log-Structured Merge-Tree (LSM-Tree) (Acta Informatica, 1996) — the paper that answers Wall A + Wall B with sorted segments plus sparse indexes.
  3. Douglas Comer, The Ubiquitous B-Tree (ACM Computing Surveys, 1979) — the other answer to both walls, via a disk-resident sorted tree.
  4. Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 3 Storage and Retrieval — the "comparing hash indexes to SSTables" section is a crisp formalisation of this chapter's comparison table. dataintensive.net.
  5. Badrish Chandramouli et al., FASTER: A Concurrent Key-Value Store with In-Place Updates (SIGMOD 2018) — a hybrid hash-plus-log design that pushes past Wall A while keeping Bitcask's point-read speed.
  6. Aerospike documentation, Primary Index — how a production 2020s database still uses a Bitcask-shaped RAM index at petabyte scale.