In short

Build 2 left you against a wall: Bitcask needs every key in RAM, and no amount of cleverness with hint files fixes that. The escape is one sleight of hand that changes the whole design — write each on-disk file in key-sorted order. A sorted file is called an SSTable (Sorted String Table), and three beautiful properties fall out of that single decision. First, you can merge two sorted files in linear time with O(1) extra memory — the merge step of mergesort, reused as a storage primitive, with "newer wins" picked by file age. Second, you no longer need every key in RAM: a sparse index that samples one key per block of the file plus a binary search gets you to within a page of any key, and a short sequential scan finishes the job. Third, range queriesget everything between key_A and key_M — become a single seek plus a sequential read, which is exactly what every disk is best at. This chapter lays out those three ideas with 50 lines of Python (a merge, a sparse-index builder, a sparse-index probe), two diagrams, and a worked example. The next chapter answers the one obvious question it leaves open: how do you keep an on-disk file sorted while accepting random writes?

Build 2 ended with a wall and a question. The wall — stated crisply in chapter 11 — is that Bitcask needs every key in its in-memory keydir. A billion keys means 40 GB of RAM just for the index, before you store a single value. The question was: what do you give up to get past it?

The answer in this chapter is surprisingly small. You give up the hash that makes Bitcask's keydir work and replace it with sort order on disk. One choice. Everything else follows.

Hold that in your head — sort the files on disk — and watch what falls out.

The observation

Take your Bitcask store. Its log looks like this, in insertion order:

name=dipti
age=15
city=coimbatore
age=16
city=chennai
age=17
city=delhi

Unsorted. Keys scattered. The only way to find age without the RAM index is to read every byte — the wall you hit in chapter 2.

Now imagine the same set of live records, written instead in key-sorted order to disk:

age=17
city=delhi
name=dipti

Three records, sorted. If I hand you this file and ask you to find city, you do not need an index that knows where city is. You can binary-search the file: jump to the middle, read a key, decide whether city is earlier or later, jump again. In a 1 GB file that finishes in roughly 30 seeks — a millisecond of SSD time — regardless of how many records the file holds.

Sort order replaces the hash. And it does more than that.

A sorted-string table on diskA single labelled file called sstable-0001.sst containing nine records in key-sorted order: age=17, city=delhi, email=x@y, locale=en, name=dipti, phone=99..., role=admin, state=TN, zip=600001. Beside the file, arrows point at three marked keys — age, locale, zip — indicating they are the only entries that live in the small RAM-side sparse index labelled "one in every three." Below the file a caption reads "sorted on disk: binary search finds any key; range scans are sequential reads."sstable-0001.sst (key-sorted on disk)age=17@0city=delhi@7email=x@y.io@18locale=en-IN@31name=dipti@44phone=9900011122@55role=admin@72state=TN@83zip=600001@92sparse index (RAM)age → 0locale → 31zip → 92three entries,one for everyfour records
An SSTable on disk. Records are stored in key-sorted order. The RAM-side index is sparse — only one entry per block (here, every fourth record). A lookup binary-searches the sparse index, then scans the file forward from the pinned offset. The index is a hundred to a thousand times smaller than Bitcask's keydir, which is the whole point.

The diagram shows the three ideas you will build in this chapter, sitting next to each other:

  1. The sorted file itself — the Sorted String Table, or SSTable.
  2. The sparse index — one entry per block, kept in RAM; a thousand times smaller than a full key→offset table.
  3. The merge — not in the picture yet, but the operation that combines two such files into one without loading either into memory.

The merge is the hidden load-bearing piece. Sort order would not be enough on its own; writes still arrive in random order, and no disk file can be kept sorted under random insertion without a lot of shuffling. The merge step of mergesort, lifted out of its parent algorithm and used as a storage primitive, is what makes the whole design feasible.

The merge step

You saw it once in your first algorithms class. Two sorted sequences can be combined into one sorted sequence by reading the head of each, emitting whichever is smaller, and advancing. Linear time. Constant extra memory. No random access.

It is so simple that people underestimate it. Here it is, for two sorted streams of (key, value) pairs, with a "newer wins" rule for duplicate keys:

# merge.py — merge two sorted KV streams; newer file wins on key tie
def merge(new_stream, old_stream):
    """
    Yield (key, value) pairs in sorted key order from two already-sorted
    input streams. When both streams hold the same key, emit the value
    from new_stream (the "newer" one) and skip the old_stream entry.
    """
    new_it = iter(new_stream)
    old_it = iter(old_stream)
    new = next(new_it, None)
    old = next(old_it, None)
    while new is not None and old is not None:
        if new[0] < old[0]:            # new key is smaller -> emit it
            yield new
            new = next(new_it, None)
        elif new[0] > old[0]:          # old key is smaller -> emit it
            yield old
            old = next(old_it, None)
        else:                          # same key -> newer wins, drop old
            yield new
            new = next(new_it, None)
            old = next(old_it, None)
    # drain whichever side still has records (at most one will)
    while new is not None: yield new; new = next(new_it, None)
    while old is not None: yield old; old = next(old_it, None)

Twenty lines. Read it once and notice the shape of the work.

Why the newer stream's argument goes first: the merge is not just a sort merge — it is a conflict-resolving merge. In a storage engine, when two SSTables both contain the key city, the one written more recently has the latest value; the older one's city is stale. Putting the newer stream first in the argument list, and always yielding new on tie, encodes "newer wins" directly in the call site. You will see exactly this argument order in real LSM compactors.

Three properties to notice, because they are what make this primitive strong enough to carry a storage engine:

The merge step of mergesort was designed in 1945 by von Neumann for exactly this shape of problem — sorting data larger than memory by reading in streams. Seventy years later, LevelDB, RocksDB, Cassandra, and BigTable all use it unchanged as the core operation of their storage layer. The interface has not moved because the underlying idea is already at the bottom.

Merging two sorted SSTables into oneThree files arranged horizontally. On the left, two input files: sstable-A (newer) with records age=19, city=delhi, name=dipti and sstable-B (older) with records age=17, city=chennai, phone=99..., role=admin. In the middle an arrow labelled "merge, newer wins" points right. On the right, the output sstable-merged with records age=19, city=delhi, name=dipti, phone=99..., role=admin — five rows in sorted order. The duplicate key city has only the newer value; a note beside the output reads "city=chennai dropped; city=delhi kept". Below all three files a horizontal bar represents one sequential pass through each input and one sequential write of the output.sstable-A (newer)age=19city=delhiname=diptisorted, 3 recordssstable-B (older)age=17city=chennaiphone=9900role=adminmerge()O(m+n) timeO(1) memorynewer wins on tiesstable-mergedage=19city=delhiname=diptiphone=9900role=adminsorted, 5 records(city=chennai dropped)
Two already-sorted SSTables merged into one. The merge reads each input exactly once, front to back, and produces the output in a single sequential write. On key collisions, the newer table's value wins — here, city=delhi from the newer table supersedes city=chennai from the older one. This is the operation that makes the whole LSM design work.

The sparse index

The second property of a sorted file is even more powerful for a laptop-scale engineer: you do not need every key in RAM. A sparse index that samples one key in every N is enough.

Why does that work? Because between any two sampled keys sits a contiguous run of records, also sorted, and small. If you sample every 128th record in a 1 GB file with 10 million records, each "gap" is 128 records, roughly 8 KB of data. Once binary search narrows your target to the right gap, scanning 8 KB to find the exact key takes about 40 microseconds on an SSD. The RAM cost of the index drops from 10 million entries (Bitcask's keydir) to ~78,000 — a 128× reduction with no loss of correctness.

Here is the builder. It walks a sorted file once, sampling every Nth key and recording its byte offset:

# sparse_index.py — build a sparse index over a sorted KV file
def build_sparse_index(path, stride=128):
    """
    Walk a sorted KV file and record every stride-th key with its offset.
    Returns a list of (key, offset) pairs in key order, suitable for binary search.
    """
    index = []
    with open(path, "rb") as f:
        i = 0
        while True:
            offset = f.tell()              # byte offset of the record about to be read
            line = f.readline()
            if not line:
                break
            if i % stride == 0:            # sample only every stride-th record
                key, _, _ = line.partition(b"=")
                index.append((key.decode("utf-8"), offset))
            i += 1
    return index

Fifteen lines, one pass over the file, one list in memory at the end. The list has ceil(total_records / stride) entries — typically a few thousand to a few tens of thousands for a file with millions of records. Sorted by key construction, because the file is sorted, so binary search is cheap on it.

Why tell() before readline(): you want the offset where the record starts, not where it ends. A later probe will seek() to that offset and begin reading. If you recorded the post-read position instead, you would be one record too late everywhere.

Probing the sparse index

The probe is almost as short. Binary-search the sparse index for the largest pinned key \le target, seek() to the corresponding offset, and scan forward until you either find the target or pass it:

# sstable_get.py — point lookup on a sorted KV file via a sparse index
import bisect

def sstable_get(path, sparse_index, target_key):
    """
    Point lookup on a sorted KV file. sparse_index is a list of (key, offset)
    pairs sampled during build_sparse_index. Returns the value string or None.
    """
    # bisect_right on the keys gives the first sample strictly greater than target;
    # the sample just before it is the largest key <= target.
    keys = [k for k, _ in sparse_index]
    i = bisect.bisect_right(keys, target_key) - 1
    if i < 0:
        return None                        # target is before every pinned key
    _, start_offset = sparse_index[i]
    with open(path, "rb") as f:
        f.seek(start_offset)
        while True:
            line = f.readline()
            if not line:
                return None                # end of file, key not found
            k, _, v = line.rstrip(b"\n").partition(b"=")
            key = k.decode("utf-8")
            if key == target_key:
                return v.decode("utf-8")
            if key > target_key:
                return None                # we passed it — not present

Fifteen lines again, counting comments and blanks. Two tunable costs: the binary search over the sparse index (nanoseconds; the sparse index is tiny), and the forward scan over at most stride records (microseconds; one page of disk on average). Neither grows with the total size of the file.

The quiet cleverness is the if key > target_key: return None line. Because the file is sorted, the moment you see a key bigger than the one you are looking for, you know the target is not there. This is the sort-order invariant paying rent on the read path — you can give up early without having to scan the rest of the file. With an unsorted file there is no such shortcut; you always scan to the end to be sure.

A worked session

Building a tiny SSTable, its sparse index, and probing it

Write a sorted file by hand, build an index with stride 3, and probe a few keys. This is the full lifecycle in miniature — the shape of what LevelDB does millions of times a day.

# demo.py
records = [
    ("age",    "19"),
    ("city",   "delhi"),
    ("email",  "dipti@padho.wiki"),
    ("locale", "en-IN"),
    ("name",   "dipti"),
    ("phone",  "9900011122"),
    ("role",   "admin"),
    ("state",  "TN"),
    ("zip",    "600001"),
]
# assume already sorted by key (LSM guarantees this; see next chapter)
with open("demo.sst", "wb") as f:
    for k, v in records:
        f.write(f"{k}={v}\n".encode("utf-8"))

idx = build_sparse_index("demo.sst", stride=3)
print(idx)
# [('age', 0), ('locale', 31), ('role', 72)]  <-- one pin per 3 records

print(sstable_get("demo.sst", idx, "name"))   # -> 'dipti'
print(sstable_get("demo.sst", idx, "role"))   # -> 'admin'
print(sstable_get("demo.sst", idx, "mobile")) # -> None  (would sit between locale and name)

Trace the probe for name:

bisect_right(keys=['age','locale','role'], 'name')  -> 2
i = 2 - 1 = 1           sparse_index[1] = ('locale', 31)
seek(31), readline()    'locale=en-IN\n'     'locale' < 'name', continue
          readline()    'name=dipti\n'       'name' == 'name', return 'dipti'

Two records read from disk. The total file has nine; the index has three; the scan touches two. On a file with ten million records, the same probe would read at most three records from the gap, because the stride caps the gap size. Binary search over the index plus a bounded scan — constant read cost, regardless of file size, with no full key→offset table in RAM.

Trace the probe for the missing key mobile:

bisect_right(['age','locale','role'], 'mobile') -> 2
i = 1                    sparse_index[1] = ('locale', 31)
seek(31), readline()     'locale=en-IN\n'    'locale' < 'mobile', continue
          readline()     'name=dipti\n'       'name' > 'mobile', return None

Two reads to prove a key is absent. Bitcask can answer "absent" in zero reads (the keydir knows). That is the one small regression; in exchange you removed the whole-dataset RAM requirement.

Range queries fall out for free

Third property, and the one that will turn out to matter enormously in Build 3. Because the file is sorted on disk, a query like "give me every key between age and name" is just:

  1. Binary-search the sparse index for the first gap that could contain age.
  2. seek() there.
  3. Read forward, emitting every record, until you see a key > name or hit end-of-file.

One seek plus a sequential read. No random access. No index walk. The exact access pattern spinning disks and SSDs are happiest with.

Bitcask cannot do this at all. Keys are hashed (effectively); the keydir is a hash table with no order; there is no "next key" operation. Range queries in Bitcask require scanning every segment start-to-end. The moment you build even a naive analytics panel — "show all users created in the last week" — Bitcask falls down and an SSTable-based store sails through.

Range queries are not a bonus, they are the entire reason Google's BigTable paper is sorted. Half of what a database is for is finding ranges: a WHERE clause, an index scan, a time-window aggregation. Bitcask was consciously designed to give up range queries in exchange for a simpler hot path. LSM gives them back, and this chapter shows the mechanism.

What you are about to build

Step back and look at what you have so far. You can:

Fifty lines total. A tiny, complete storage primitive. The missing piece — the one big question this chapter does not answer — is how you get a sorted file in the first place when your writes arrive in random order.

That is what the rest of Build 3 builds. Sketch:

That structure — memtable + stack of SSTables + background merge — is the Log-Structured Merge-Tree, the LSM-tree. LevelDB, RocksDB, Cassandra, HBase, ScyllaDB, InfluxDB, ClickHouse's parts engine: all of them are variations of it. The merge step you wrote in 20 lines is the load-bearing operation of all of them.

Common confusions

Going deeper

You have the three pieces. This section fills in the "why" questions — why sorted beats unsorted for merging, what sorting costs you, what changes when keys are large.

Why sorted is the right invariant for merging

Merging is not only possible on sorted data; it is only cheap on sorted data. If the inputs are unsorted, merging two n-record files is either quadratic or requires a full external sort as a pre-step — each merge becomes an O(n \log n) operation with temporary files the size of the inputs.

With the sort invariant, merges are linear and streaming. That linearity compounds: in a real LSM-tree a given key may pass through five levels of merges on its way from the memtable to the coldest SSTable. If each merge were O(n \log n) the write amplification would be astronomical. Because each merge is O(n), the total work for a key is bounded by a small constant times its final resting-place level.

There is a deeper point here. Invariants on data enable cheap invariants on operations. Sorted data enables linear merges. Immutable files enable lock-free reads. Append-only files enable trivially crash-safe writes. Constrain the data shape; the code shape follows.

The cost of maintaining order

Order is not free. In Bitcask's unsorted log the write path is one memcpy and one write(). In an LSM-tree the same write goes through a memtable insert (O(\log k)), an eventual flush to a sorted SSTable, and multiple rounds of compaction as the key moves through levels.

The total bytes written to disk per user-visible byte is called write amplification. A well-tuned LSM-tree has write amplification of 10–30×; Bitcask's is roughly 1×. You trade writes for the ability to store more data than RAM — almost always a worthwhile trade, but worth pricing explicitly.

The punchline: sort order is what makes merges cheap, but merges are what maintaining sort order costs you. There is no free lunch; there is a very useful lunch that is explicitly priced.

What happens when keys are long

Our examples have 3-to-6-character keys. Real-world keys are often longer — a UUID is 36 bytes, a composite row key in a time-series store can be 500+. Long keys do not break the algorithm, they change the constants.

The underlying shape — sorted on disk, sparse in RAM, merged in bulk — is the same whether keys are 3 bytes or 3 kilobytes.

Why LSM and not "sorted Bitcask"?

A plausible design: keep Bitcask's log, but on every put() also insert the record into a sorted data structure on disk. That is exactly what a B-tree is, and it is expensive — every insert does at least one random disk write to update an interior node. You lose the cheap sequential write of the log and rebuild a traditional RDBMS storage engine.

LSM's cleverness is that it batches the sort. Writes land in an in-memory memtable, sorted in RAM for free. Only when the memtable is full does a big sorted file hit disk, in one sequential write. Between flushes, the disk never sees a random write. The sort invariant is maintained lazily by piggybacking on flushes and compactions, not eagerly on every write.

This laziness — the willingness to accept a read having to check multiple places — is what makes LSM cheaper to write than a B-tree, and more expensive to read. Every LSM optimisation after this chapter (Bloom filters, compaction, per-SSTable indexes) exists to buy back read speed without losing the cheap writes.

Where this leads next

The pieces you have are enough to think about the design but not yet enough to run it. The next several chapters build the missing machinery:

By the end of Build 3 you will have something like a toy LevelDB: a few hundred lines of Python that can hold more data than RAM, answer range queries in milliseconds, and recover cleanly from crashes. All of it rests on the twenty-line merge function, the fifteen-line index builder, and the fifteen-line probe that live in this chapter. Everything else is scaffolding.

References

  1. Patrick O'Neil, Edward Cheng, Dieter Gawlick, Elizabeth O'Neil, The Log-Structured Merge-Tree (LSM-Tree) (Acta Informatica, 1996) — the paper that named and formalised the design. Springer.
  2. Fay Chang et al., Bigtable: A Distributed Storage System for Structured Data (OSDI 2006) — introduces the SSTable name and layout. §4 describes the sparse index and the merge-based compaction. research.google.
  3. Sanjay Ghemawat and Jeff Dean, LevelDB Implementation Notes — walks through SSTable layout, the block-level sparse index, and compaction. github.com/google/leveldb/blob/main/doc/impl.md.
  4. Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 3 — the clearest textbook presentation of SSTables, memtables, and the LSM pattern. dataintensive.net.
  5. John von Neumann, First Draft of a Report on the EDVAC (1945) — contains the original description of merge-sort, including the linear-time merge step that this chapter lifts into a storage primitive. archive.org.
  6. Siying Dong et al., Optimizing Space Amplification in RocksDB (CIDR 2017) — empirical study of compaction and write/space amplification, with production numbers from Facebook's RocksDB deployment. cidrdb.org.