Sorted Strings and the Merge Insight

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Build 2 left you against a wall: Bitcask needs every key in RAM, and no amount of cleverness with hint files fixes that. The escape is one sleight of hand that changes the whole design — write each on-disk file in key-sorted order. A sorted file is called an SSTable (Sorted String Table), and three beautiful properties fall out of that single decision. First, you can merge two sorted files in linear time with O(1) extra memory — the merge step of mergesort, reused as a storage primitive, with "newer wins" picked by file age. Second, you no longer need every key in RAM: a sparse index that samples one key per block of the file plus a binary search gets you to within a page of any key, and a short sequential scan finishes the job. Third, range queries — get everything between key_A and key_M — become a single seek plus a sequential read, which is exactly what every disk is best at. This chapter lays out those three ideas with 50 lines of Python (a merge, a sparse-index builder, a sparse-index probe), two diagrams, and a worked example. The next chapter answers the one obvious question it leaves open: how do you keep an on-disk file sorted while accepting random writes?

Build 2 ended with a wall and a question. The wall — stated crisply in chapter 11 — is that Bitcask needs every key in its in-memory keydir. A billion keys means 40 GB of RAM just for the index, before you store a single value. The question was: what do you give up to get past it?

The answer in this chapter is surprisingly small. You give up the hash that makes Bitcask's keydir work and replace it with sort order on disk. One choice. Everything else follows.

Hold that in your head — sort the files on disk — and watch what falls out.

The observation

Take your Bitcask store. Its log looks like this, in insertion order:

name=dipti
age=15
city=coimbatore
age=16
city=chennai
age=17
city=delhi

Unsorted. Keys scattered. The only way to find age without the RAM index is to read every byte — the wall you hit in chapter 2.

Now imagine the same set of live records, written instead in key-sorted order to disk:

age=17
city=delhi
name=dipti

Three records, sorted. If I hand you this file and ask you to find city, you do not need an index that knows where city is. You can binary-search the file: jump to the middle, read a key, decide whether city is earlier or later, jump again. In a 1 GB file that finishes in roughly 30 seeks — a millisecond of SSD time — regardless of how many records the file holds.

Sort order replaces the hash. And it does more than that.

An SSTable on disk. Records are stored in key-sorted order. The RAM-side index is sparse — only one entry per block (here, every fourth record). A lookup binary-searches the sparse index, then scans the file forward from the pinned offset. The index is a hundred to a thousand times smaller than Bitcask's keydir, which is the whole point.

The diagram shows the three ideas you will build in this chapter, sitting next to each other:

The sorted file itself — the Sorted String Table, or SSTable.
The sparse index — one entry per block, kept in RAM; a thousand times smaller than a full key→offset table.
The merge — not in the picture yet, but the operation that combines two such files into one without loading either into memory.

The merge is the hidden load-bearing piece. Sort order would not be enough on its own; writes still arrive in random order, and no disk file can be kept sorted under random insertion without a lot of shuffling. The merge step of mergesort, lifted out of its parent algorithm and used as a storage primitive, is what makes the whole design feasible.

The merge step

You saw it once in your first algorithms class. Two sorted sequences can be combined into one sorted sequence by reading the head of each, emitting whichever is smaller, and advancing. Linear time. Constant extra memory. No random access.

It is so simple that people underestimate it. Here it is, for two sorted streams of (key, value) pairs, with a "newer wins" rule for duplicate keys:

# merge.py — merge two sorted KV streams; newer file wins on key tie
def merge(new_stream, old_stream):
    """
    Yield (key, value) pairs in sorted key order from two already-sorted
    input streams. When both streams hold the same key, emit the value
    from new_stream (the "newer" one) and skip the old_stream entry.
    """
    new_it = iter(new_stream)
    old_it = iter(old_stream)
    new = next(new_it, None)
    old = next(old_it, None)
    while new is not None and old is not None:
        if new[0] < old[0]:            # new key is smaller -> emit it
            yield new
            new = next(new_it, None)
        elif new[0] > old[0]:          # old key is smaller -> emit it
            yield old
            old = next(old_it, None)
        else:                          # same key -> newer wins, drop old
            yield new
            new = next(new_it, None)
            old = next(old_it, None)
    # drain whichever side still has records (at most one will)
    while new is not None: yield new; new = next(new_it, None)
    while old is not None: yield old; old = next(old_it, None)

Twenty lines. Read it once and notice the shape of the work.

Why the newer stream's argument goes first: the merge is not just a sort merge — it is a conflict-resolving merge. In a storage engine, when two SSTables both contain the key city, the one written more recently has the latest value; the older one's city is stale. Putting the newer stream first in the argument list, and always yielding new on tie, encodes "newer wins" directly in the call site. You will see exactly this argument order in real LSM compactors.

Three properties to notice, because they are what make this primitive strong enough to carry a storage engine:

Linear time. The while loop advances exactly one cursor per iteration, so for input sizes m and n the work is O(m + n). Merging two 1 GB SSTables takes two sequential reads and one sequential write of 2 GB total — about a second on an NVMe SSD.
Constant extra memory. At any moment only the two current heads (new, old) sit in RAM. You can merge a 100 GB file with a 200 GB file on a laptop with 8 GB of memory. Random-access sorting algorithms cannot do that; merge can, because it only ever looks at the front of each input.
Sequential I/O. Both reads and the write are front-to-back. This is the access pattern every storage medium — spinning disk, SSD, memory — is fastest at. A merge is literally the friendliest workload you can hand a disk.

The merge step of mergesort was designed in 1945 by von Neumann for exactly this shape of problem — sorting data larger than memory by reading in streams. Seventy years later, LevelDB, RocksDB, Cassandra, and BigTable all use it unchanged as the core operation of their storage layer. The interface has not moved because the underlying idea is already at the bottom.

Two already-sorted SSTables merged into one. The merge reads each input exactly once, front to back, and produces the output in a single sequential write. On key collisions, the newer table's value wins — here, city=delhi from the newer table supersedes city=chennai from the older one. This is the operation that makes the whole LSM design work.

The sparse index

The second property of a sorted file is even more powerful for a laptop-scale engineer: you do not need every key in RAM. A sparse index that samples one key in every N is enough.

Why does that work? Because between any two sampled keys sits a contiguous run of records, also sorted, and small. If you sample every 128th record in a 1 GB file with 10 million records, each "gap" is 128 records, roughly 8 KB of data. Once binary search narrows your target to the right gap, scanning 8 KB to find the exact key takes about 40 microseconds on an SSD. The RAM cost of the index drops from 10 million entries (Bitcask's keydir) to ~78,000 — a 128× reduction with no loss of correctness.

Here is the builder. It walks a sorted file once, sampling every Nth key and recording its byte offset:

# sparse_index.py — build a sparse index over a sorted KV file
def build_sparse_index(path, stride=128):
    """
    Walk a sorted KV file and record every stride-th key with its offset.
    Returns a list of (key, offset) pairs in key order, suitable for binary search.
    """
    index = []
    with open(path, "rb") as f:
        i = 0
        while True:
            offset = f.tell()              # byte offset of the record about to be read
            line = f.readline()
            if not line:
                break
            if i % stride == 0:            # sample only every stride-th record
                key, _, _ = line.partition(b"=")
                index.append((key.decode("utf-8"), offset))
            i += 1
    return index

Fifteen lines, one pass over the file, one list in memory at the end. The list has ceil(total_records / stride) entries — typically a few thousand to a few tens of thousands for a file with millions of records. Sorted by key construction, because the file is sorted, so binary search is cheap on it.

Why tell() before readline(): you want the offset where the record starts, not where it ends. A later probe will seek() to that offset and begin reading. If you recorded the post-read position instead, you would be one record too late everywhere.

Probing the sparse index

The probe is almost as short. Binary-search the sparse index for the largest pinned key \le target, seek() to the corresponding offset, and scan forward until you either find the target or pass it:

# sstable_get.py — point lookup on a sorted KV file via a sparse index
import bisect

def sstable_get(path, sparse_index, target_key):
    """
    Point lookup on a sorted KV file. sparse_index is a list of (key, offset)
    pairs sampled during build_sparse_index. Returns the value string or None.
    """
    # bisect_right on the keys gives the first sample strictly greater than target;
    # the sample just before it is the largest key <= target.
    keys = [k for k, _ in sparse_index]
    i = bisect.bisect_right(keys, target_key) - 1
    if i < 0:
        return None                        # target is before every pinned key
    _, start_offset = sparse_index[i]
    with open(path, "rb") as f:
        f.seek(start_offset)
        while True:
            line = f.readline()
            if not line:
                return None                # end of file, key not found
            k, _, v = line.rstrip(b"\n").partition(b"=")
            key = k.decode("utf-8")
            if key == target_key:
                return v.decode("utf-8")
            if key > target_key:
                return None                # we passed it — not present

Fifteen lines again, counting comments and blanks. Two tunable costs: the binary search over the sparse index (nanoseconds; the sparse index is tiny), and the forward scan over at most stride records (microseconds; one page of disk on average). Neither grows with the total size of the file.

The quiet cleverness is the if key > target_key: return None line. Because the file is sorted, the moment you see a key bigger than the one you are looking for, you know the target is not there. This is the sort-order invariant paying rent on the read path — you can give up early without having to scan the rest of the file. With an unsorted file there is no such shortcut; you always scan to the end to be sure.

A worked session

Building a tiny SSTable, its sparse index, and probing it

Write a sorted file by hand, build an index with stride 3, and probe a few keys. This is the full lifecycle in miniature — the shape of what LevelDB does millions of times a day.

# demo.py
records = [
    ("age",    "19"),
    ("city",   "delhi"),
    ("email",  "dipti@padho.wiki"),
    ("locale", "en-IN"),
    ("name",   "dipti"),
    ("phone",  "9900011122"),
    ("role",   "admin"),
    ("state",  "TN"),
    ("zip",    "600001"),
]
# assume already sorted by key (LSM guarantees this; see next chapter)
with open("demo.sst", "wb") as f:
    for k, v in records:
        f.write(f"{k}={v}\n".encode("utf-8"))

idx = build_sparse_index("demo.sst", stride=3)
print(idx)
# [('age', 0), ('locale', 31), ('role', 72)]  <-- one pin per 3 records

print(sstable_get("demo.sst", idx, "name"))   # -> 'dipti'
print(sstable_get("demo.sst", idx, "role"))   # -> 'admin'
print(sstable_get("demo.sst", idx, "mobile")) # -> None  (would sit between locale and name)

Trace the probe for name:

bisect_right(keys=['age','locale','role'], 'name')  -> 2
i = 2 - 1 = 1           sparse_index[1] = ('locale', 31)
seek(31), readline()    'locale=en-IN\n'     'locale' < 'name', continue
          readline()    'name=dipti\n'       'name' == 'name', return 'dipti'

Two records read from disk. The total file has nine; the index has three; the scan touches two. On a file with ten million records, the same probe would read at most three records from the gap, because the stride caps the gap size. Binary search over the index plus a bounded scan — constant read cost, regardless of file size, with no full key→offset table in RAM.

Trace the probe for the missing key mobile:

bisect_right(['age','locale','role'], 'mobile') -> 2
i = 1                    sparse_index[1] = ('locale', 31)
seek(31), readline()     'locale=en-IN\n'    'locale' < 'mobile', continue
          readline()     'name=dipti\n'       'name' > 'mobile', return None

Two reads to prove a key is absent. Bitcask can answer "absent" in zero reads (the keydir knows). That is the one small regression; in exchange you removed the whole-dataset RAM requirement.

Range queries fall out for free

Third property, and the one that will turn out to matter enormously in Build 3. Because the file is sorted on disk, a query like "give me every key between age and name" is just:

Binary-search the sparse index for the first gap that could contain age.
seek() there.
Read forward, emitting every record, until you see a key > name or hit end-of-file.

One seek plus a sequential read. No random access. No index walk. The exact access pattern spinning disks and SSDs are happiest with.

Bitcask cannot do this at all. Keys are hashed (effectively); the keydir is a hash table with no order; there is no "next key" operation. Range queries in Bitcask require scanning every segment start-to-end. The moment you build even a naive analytics panel — "show all users created in the last week" — Bitcask falls down and an SSTable-based store sails through.

Range queries are not a bonus, they are the entire reason Querion's BigTable paper is sorted. Half of what a database is for is finding ranges: a WHERE clause, an index scan, a time-window aggregation. Bitcask was consciously designed to give up range queries in exchange for a simpler hot path. LSM gives them back, and this chapter shows the mechanism.

What you are about to build

Step back and look at what you have so far. You can:

Merge two sorted KV streams into one, in linear time, with "newer wins" conflict handling (20 lines).
Build a sparse index over a sorted file in one pass (15 lines).
Probe a sorted file with a sparse index in constant time, no full RAM index (15 lines).

Fifty lines total. A tiny, complete storage primitive. The missing piece — the one big question this chapter does not answer — is how you get a sorted file in the first place when your writes arrive in random order.

That is what the rest of Build 3 builds. Sketch:

Memtable (next chapter): an in-memory, sorted structure (a skip list) that absorbs random writes and keeps them sorted as they arrive. Cheap enough that every write updates it.
Flush: when the memtable gets big, serialise it to an SSTable on disk in one sorted pass. Now you have your first sorted file.
Read path: every get() checks the memtable first, then the SSTables newest-to-oldest. First hit wins.
Compaction: periodically, merge older SSTables together using the merge() you just wrote. Fewer files means fewer places to look per get().

That structure — memtable + stack of SSTables + background merge — is the Log-Structured Merge-Tree, the LSM-tree. LevelDB, RocksDB, Cassandra, HBase, ScyllaDB, InfluxDB, ClickHouse's parts engine: all of them are variations of it. The merge step you wrote in 20 lines is the load-bearing operation of all of them.

Common confusions

"How do we keep the file sorted while accepting random writes?" We do not. The on-disk files are immutable — once written, never modified. Random writes go to the memtable (an in-memory sorted structure, built in the next chapter), which sorts them as they arrive because it is small and in RAM. When the memtable is big enough, it is flushed to a new on-disk SSTable in a single sorted pass, and a fresh empty memtable starts accepting writes again. The on-disk files are always sorted because they were written that way; they are never updated in place. The set of files is what grows — and the merge() you wrote is how the set is pruned.
"What happens when two SSTables disagree on the same key?" The newer SSTable wins. "Newer" is defined by when the file was created, not by any version number inside the records. During merge(), the newer stream is the first argument and its value is emitted on any tie. During reads, SSTables are probed newest-first and the first hit wins. This is the same "latest write wins" semantics Bitcask had, just moved into a comparison of files instead of a comparison of offsets.
"Is stride=128 magic?" No, it is a knob. A larger stride shrinks the RAM index at the cost of a longer forward scan per lookup. LevelDB uses 4 KB data blocks with one index entry per block; RocksDB defaults to 4 KB data blocks plus a second-level "index block." Something between 64 and 1024 records per pin is the normal range. The tradeoff is explicit: pick your point on the RAM-versus-scan curve.
"What about deletes?" Deletes are a tombstone record — a special value that says "this key is no longer present." Tombstones participate in the merge like any other record; a key that is tombstoned in the newer stream stays tombstoned in the output and is eventually dropped once all older SSTables have been rewritten past it. You saw tombstones already in chapter 9; the LSM version is the same idea, with one subtlety — a tombstone cannot be dropped while there is still an older SSTable that contains a live record for the same key, or the delete would "un-happen." Later chapters of Build 3 handle this carefully.
"Binary search on disk seems slow." It is, if you do it naively — each probe is a random read. That is why the sparse index lives in RAM, not on disk. The binary search is in-memory (nanoseconds); the single seek() that follows is the only disk I/O. Real SSTables also cache the sparse index at multiple granularities (top-level block index, mid-level partition index) but the principle is the same: disk does one sequential read per query, memory does the binary search.
"Why sort the data and not just use a B-tree?" A B-tree can do everything an SSTable can do, and is what every SQL database uses. The difference is the write path: a B-tree updates blocks in place, requiring locks, write-ahead logs, and random writes. An SSTable is append-only — the on-disk file is written exactly once, sequentially, and never touched again except to be read or to participate in a merge that produces a new file. That matches log-structured storage's strengths (cheap sequential writes, trivially crash-safe) and is why LSM and B-tree coexist: they are optimised for different points on the read-vs-write tradeoff. Build 4 of this track will implement a B-tree for comparison.

Going deeper

You have the three pieces. This section fills in the "why" questions — why sorted beats unsorted for merging, what sorting costs you, what changes when keys are large.

Why sorted is the right invariant for merging

Merging is not only possible on sorted data; it is only cheap on sorted data. If the inputs are unsorted, merging two n-record files is either quadratic or requires a full external sort as a pre-step — each merge becomes an O(n \log n) operation with temporary files the size of the inputs.

With the sort invariant, merges are linear and streaming. That linearity compounds: in a real LSM-tree a given key may pass through five levels of merges on its way from the memtable to the coldest SSTable. If each merge were O(n \log n) the write amplification would be astronomical. Because each merge is O(n), the total work for a key is bounded by a small constant times its final resting-place level.

There is a deeper point here. Invariants on data enable cheap invariants on operations. Sorted data enables linear merges. Immutable files enable lock-free reads. Append-only files enable trivially crash-safe writes. Constrain the data shape; the code shape follows.

The cost of maintaining order

Order is not free. In Bitcask's unsorted log the write path is one memcpy and one write(). In an LSM-tree the same write goes through a memtable insert (O(\log k)), an eventual flush to a sorted SSTable, and multiple rounds of compaction as the key moves through levels.

The total bytes written to disk per user-visible byte is called write amplification. A well-tuned LSM-tree has write amplification of 10–30×; Bitcask's is roughly 1×. You trade writes for the ability to store more data than RAM — almost always a worthwhile trade, but worth pricing explicitly.

The punchline: sort order is what makes merges cheap, but merges are what maintaining sort order costs you. There is no free lunch; there is a very useful lunch that is explicitly priced.

What happens when keys are long

Our examples have 3-to-6-character keys. Real-world keys are often longer — a UUID is 36 bytes, a composite row key in a time-series store can be 500+. Long keys do not break the algorithm, they change the constants.

The sparse index grows in proportion to key length. With 100 B keys and a stride of 128 over a 1 GB file, the sparse index is about 780 KB; with 500 B keys it is 3.9 MB. Still tractable; still dwarfed by what Bitcask's full keydir would be.
Prefix compression becomes worthwhile. In a sorted file, adjacent keys share prefixes — user:00001, user:00002, ... LevelDB and RocksDB store each key as "p bytes shared with previous key; q bytes new." The sparse index pins the full key at restart points so probes can still binary-search; only the records between pins use prefix encoding. This halves or quarters the on-disk size for keys with common prefixes.
Binary search over the sparse index costs more for giant keys. Comparing two 500 B keys is a 500 B memcmp. If keys are kilobytes, the standard trick is a two-level index bucketed by the first few bytes.

The underlying shape — sorted on disk, sparse in RAM, merged in bulk — is the same whether keys are 3 bytes or 3 kilobytes.

Why LSM and not "sorted Bitcask"?

A plausible design: keep Bitcask's log, but on every put() also insert the record into a sorted data structure on disk. That is exactly what a B-tree is, and it is expensive — every insert does at least one random disk write to update an interior node. You lose the cheap sequential write of the log and rebuild a traditional RDBMS storage engine.

LSM's cleverness is that it batches the sort. Writes land in an in-memory memtable, sorted in RAM for free. Only when the memtable is full does a big sorted file hit disk, in one sequential write. Between flushes, the disk never sees a random write. The sort invariant is maintained lazily by piggybacking on flushes and compactions, not eagerly on every write.

This laziness — the willingness to accept a read having to check multiple places — is what makes LSM cheaper to write than a B-tree, and more expensive to read. Every LSM optimisation after this chapter (Bloom filters, compaction, per-SSTable indexes) exists to buy back read speed without losing the cheap writes.

Where this leads next

The pieces you have are enough to think about the design but not yet enough to run it. The next several chapters build the missing machinery:

The memtable: a sorted in-memory structure (skip list) — chapter 13: the dynamic sorted structure that absorbs random writes and stays sorted in RAM.
Flushing a memtable to an SSTable — chapter 14: when the memtable is full, turn it into an immutable on-disk file using the sparse-index builder from this chapter.
Read path: memtable → SSTables, newest wins — chapter 15: how get() walks the memtable and then the stack of SSTables in the correct order.
Bloom filters — chapter 16: a 10-bit-per-key filter that lets get() skip most SSTables without touching disk.
Compaction strategies — chapters 17–18: size-tiered and leveled compaction, and when to pick which. Both are just clever ways to call the merge() you wrote in this chapter.

By the end of Build 3 you will have something like a toy LevelDB: a few hundred lines of Python that can hold more data than RAM, answer range queries in milliseconds, and recover cleanly from crashes. All of it rests on the twenty-line merge function, the fifteen-line index builder, and the fifteen-line probe that live in this chapter. Everything else is scaffolding.

References

Patrick O'Neil, Edward Cheng, Dieter Gawlick, Elizabeth O'Neil, The Log-Structured Merge-Tree (LSM-Tree) (Acta DataFlow, 1996) — the paper that named and formalised the design. Springer.
Fay Chang et al., Bigtable: A Distributed Storage System for Structured Data (OSDI 2006) — introduces the SSTable name and layout. §4 describes the sparse index and the merge-based compaction. research.google.
Sanjay Ghemawat and Jeff Dean, LevelDB Implementation Notes — walks through SSTable layout, the block-level sparse index, and compaction. github.com/google/leveldb/blob/main/doc/impl.md.
Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 3 — the clearest textbook presentation of SSTables, memtables, and the LSM pattern. dataintensive.net.
John von Neumann, First Draft of a Report on the EDVAC (1945) — contains the original description of merge-sort, including the linear-time merge step that this chapter lifts into a storage primitive. archive.org.
Siying Dong et al., Optimizing Space Amplification in RocksDB (CIDR 2017) — empirical study of compaction and write/space amplification, with production numbers from Sociogram's RocksDB deployment. cidrdb.org.