In short

The append-only log you wrote in chapter 2 and hardened in chapters 3 and 4 is a legitimate database — durable, crash-safe, scannable. But the moment it holds more than a few megabytes it hits three walls, in this order. Wall 1: reads are O(n) in the file size, because every get scans from the top. Throughput collapses as the log grows; at a gigabyte the store is unusable. The fix is an in-memory index (Build 2, Bitcask). Wall 2: a crash mid-write leaves a half-written record at the tail. Text framing cannot tell a torn record from a real one; the scanner hands the caller garbage. The fix is length-prefixed records with per-record checksums (CRC32C). Wall 3: the file grows forever. Every overwrite and every delete only adds bytes; dead data piles up without limit. The fix is segment files plus tombstones plus background compaction. These are not three unrelated problems — they are the three axes every real storage engine is designed around. Bitcask knocks down Wall 1 by keeping an in-memory hash index. LSM-trees knock down Wall 1 and Wall 3 by keeping sorted in-memory tables and merging them on disk. B+ trees knock down all three by keeping a disk-resident sorted index and updating it in place under a WAL. This chapter names the walls, draws them, reproduces each one in Python, and lays out the three-engine map you will spend the rest of this track building.

You have a database.

It is thirty lines of Python. It calls fsync at the right moments. It survives a power cord yanked from the wall because you tested it with the crash harness from chapter 4. You put in a key, you get out the key. The acceptance tests from chapter 1 are green for durability. One of the five promises is kept.

Now load it up. Write 10 million records. Write 100 million. Watch what happens.

Read throughput collapses as the log growsA chart on log-log axes. The x-axis is the number of records in the log, ranging from one thousand to one hundred million. The y-axis is the read rate in reads per second, ranging from one to one hundred thousand. A red descending line plots read rate against log size: around 35000 reads per second at 1000 records, 4000 at 10000, 400 at 100000, 40 at 1 million, 4 at 10 million, and less than 1 at 100 million. The line has a clean slope of minus one — every tenfold increase in the log divides the read rate by ten. Three labelled annotations along the line mark where the database becomes slow, unusable, and unresponsive.10³10⁴10⁵10⁶10⁷10⁸records in the log (n)11010²10³10⁴reads / secondslowunusablea page load is a sip of teaO(n) scan — every tenfold growth costs you a factor of ten on reads
Read throughput of the chapter-2 append-only log, measured across six orders of magnitude in log size. The slope is a clean $-1$ on log-log paper — the signature of an $O(n)$ scan. At a hundred million records, a single get takes most of a second. No website survives this.

Something has gone wrong, and it is not a bug. It is a property of the design. An append-only log with no index has read cost proportional to the file size, period. Fix the slope and you have changed the design.

This chapter is the moment Build 1 ends and the rest of the subject begins. You have built the floor — a durable, crash-safe log — and now you stand on it and look at the three walls. Each wall is a property of the design you just finished, a place where the log stops being enough. Each wall has a name, a reproducer, and a fix. The fixes are not in this chapter — they are the rest of the track. What is in this chapter is the map.

The three walls, at a glance

The three walls of an append-only logThree vertical panels side by side, each illustrating one wall. Left panel: Wall 1 — a horizontal bar growing rightward to represent a scan over the whole log, with the word O(n) above. Middle panel: Wall 2 — a stack of record boxes with the top records intact and the last record drawn as a torn, jagged-edged box to represent a half-written tail. Right panel: Wall 3 — a line chart with a line rising off the top of the chart axes, arrow going up and to the right, suggesting unbounded growth. Below each panel is a label naming the wall and the fix it motivates.Wall 1 — linear readsscan O(n)every get() walks the whole log→ in-memory indexWall 2 — torn tailkey1=val1key2=val2key3=val3key4=val4key5=vacrash mid-write,torn bytes at tail→ length prefix + CRCWall 3 — unbounded growthtime →bytes on diskno overwrites, no deletes —dead data stays forever→ segments + tombstones + compaction
The three walls of the append-only log. Every real storage engine is, in one form or another, the answer to exactly these three pressures.

Three walls. Three fixes. Three engines in the rest of the subject. Let us take them one at a time.

Wall 1 — every read scans the whole log

The symptom is the graph you saw. At a thousand records, get returns in 30 microseconds. At a million, it takes 25 milliseconds. At a hundred million, it takes seconds. Every tenfold growth in the log divides the read rate by ten.

The cause is in the scanner. Here it is, unchanged from chapter 2:

def get(self, key):
    latest = None
    with open(self.path, "r", encoding="utf-8") as f:
        for line in f:
            k, _, v = line.rstrip("\n").partition("=")
            if k == key:
                latest = v
    return latest

Notice what this loop does not have: any information about where in the file the value for key lives. It cannot seek. It cannot skip. It walks every line from the top, because the only way to know which line is the latest is to see all of them. If the key is updated on line 1 and never touched again, the scanner still reads every other line to confirm.

Why you cannot cheat: the log is unordered. If the keys were sorted alphabetically, you could binary-search. But the writer appends in arrival order, not key order, and you cannot sort the file without moving records — which would break the durability guarantee that said "appends never touch old bytes." The price of crash-safety at write time is a hard floor on read time.

Reproduce the wall in ten lines. Run this script on your chapter-2 store.

# wall1_reads.py — measure how get() scales with log size
import os, time, random
from appendkv import AppendOnlyKV

for n in (10_000, 100_000, 1_000_000):
    path = f"w1_{n}.log"
    if os.path.exists(path): os.remove(path)
    db = AppendOnlyKV(path)
    for i in range(n):
        db.put(f"k{i}", f"v{i}")
    t0 = time.time()
    for _ in range(200):
        db.get(f"k{random.randint(0, n-1)}")
    rate = 200 / (time.time() - t0)
    size_mb = os.path.getsize(path) / 1e6
    print(f"n={n:>10,}  size={size_mb:>6.1f} MB  reads/s={rate:>8,.0f}")
    db.close()

On a 2025 laptop you see something like:

n=    10,000  size=   0.1 MB  reads/s=   4,200
n=   100,000  size=   1.7 MB  reads/s=     410
n= 1,000,000  size=  18.5 MB  reads/s=      39

Tenfold more records, tenfold slower reads. A clean O(n) signature. The store is not broken; the design is.

The name of the fix, which you will build in Build 2: a table, living in memory, that maps each key to the byte offset where its latest value starts in the log. On a get, you look up the offset in the table (one hash lookup, a microsecond), seek to that offset in the file (one syscall, a few microseconds), and read one record. The read cost drops from proportional to the file size to constant.

# preview — Build 2 chapter 6 builds this
class IndexedKV:
    def __init__(self, path):
        self.path = path
        self._offsets = {}    # key -> byte offset of latest record
        # rebuild by scanning the log once, at startup
        self._rebuild_index()

    def put(self, k, v):
        offset = self._f.tell()
        self._f.write(f"{k}={v}\n")
        self._f.flush(); os.fsync(self._f.fileno())
        self._offsets[k] = offset            # point at the new record

    def get(self, k):
        if k not in self._offsets: return None
        with open(self.path, "r") as f:
            f.seek(self._offsets[k])         # straight to the latest record
            return f.readline().rstrip("\n").partition("=")[2]

That six-line change takes reads from O(n) to O(1) at the cost of one Python dictionary entry per live key. It also introduces three new problems you now have to think about — the index lives in RAM so it must be rebuilt on startup, it must fit in memory, and the log it points into still grows forever. Those three problems are exactly what Wall 3 and Build 2 chapters 7–10 are about. For now, the slope has been broken. That is the whole point of Wall 1.

Wall 2 — the tail is torn, and text cannot tell

You saw this wall already in chapter 4 when the crash harness produced trials with a mangled last line. Here it is as a first-class limit on the design.

The append-only log relies on a single trick: a crash can only damage the last record, and the scanner can decide what to do about the damage because every other record is intact. The trick works so long as the scanner can recognise a damaged record. With the text format — key=value\n, one record per line — it cannot, reliably.

What a torn tail looks like on diskTwo side-by-side representations of the same log file. Left side labelled "intended" shows six records stacked neatly with key=value format. Right side labelled "on disk after crash" shows the same six records, but the seventh is drawn with jagged torn right edge, containing partial bytes. A red annotation above the last record asks whether the scanner can tell this record is torn — arrow pointing to the text format with question mark, arrow pointing to a length-prefixed binary format with a crossed-out check mark meaning stoppable.intended (writer's view)key1=applekey2=bananakey3=cherrykey4=datekey5=elderkey6=figon disk after crashkey1=applekey2=bananakey3=cherrykey4=datekey5=elderkey6=fiscanner sees:text line "key6=fi" → looks like a valid recordreturns garbage on get("key6")framed record CRC mismatch → discardedreturns None on get("key6")
Same crash, two framing choices. Text framing confuses a torn tail with a valid record and silently returns garbage. Length-prefix-plus-CRC framing recognises the tail as torn and stops recovery there.

Reproduce the wall in six lines.

# wall2_torn.py — simulate a torn tail and see what the text scanner does
from appendkv import AppendOnlyKV

db = AppendOnlyKV("torn.log")
db.put("balance", "50000")       # valid, complete record
db.close()

# simulate a torn write: the next put wrote "balance=" but was killed before the value
with open("torn.log", "a") as f:
    f.write("balance=")          # no value, no newline — exactly what a crash leaves

db2 = AppendOnlyKV("torn.log")
print(db2.get("balance"))        # prints "" (empty string) — not 50000, not None

Read that output again. You asked for a balance. A previous run committed it at 50,000 rupees. The next transaction crashed before the new value hit disk. Your scanner hands back the empty string, because "balance=".partition("=") produces the key "balance" and the value "". The acknowledgement from the committed write has been silently erased by the noise of the uncommitted write. Invariant 1 of the chapter-4 crash harness (no lost acks) is violated, and the harness had no way to know unless the record format itself said "this line was torn."

The fix is to frame records so that torn = detectable. Every production storage engine uses the same shape:

[ 4-byte length N | N bytes of payload | 4-byte CRC32 of payload ]

A record is a 4-byte header, N payload bytes, and a 4-byte checksum. The scanner reads the header, learns N, reads N+4 more bytes, and verifies the CRC. Any truncation anywhere in the record produces one of three unambiguous signals:

  1. The file ends before a full 4-byte length is readable — treat as end of log.
  2. The length is readable but the file ends before N+4 more bytes are available — torn tail, stop recovery here.
  3. Everything is readable but the CRC does not match — corruption (torn or bit-flipped), stop recovery here.

There is no fourth case where a torn record masquerades as a valid one, because the length prefix and the checksum have to agree for acceptance. Here is the primitive:

import struct, binascii

def encode(payload: bytes) -> bytes:
    return (struct.pack("<I", len(payload))
            + payload
            + struct.pack("<I", binascii.crc32(payload)))

def read_next(f):
    hdr = f.read(4)
    if len(hdr) < 4: return None                        # clean end of log
    (n,) = struct.unpack("<I", hdr)
    body = f.read(n + 4)
    if len(body) < n + 4: return "TORN"                 # truncated tail
    payload, crc = body[:n], struct.unpack("<I", body[n:])[0]
    if crc != binascii.crc32(payload): return "CORRUPT" # bit flip or torn mid-payload
    return payload

Why the CRC goes at the end and not the beginning: the writer has to compute the CRC over the payload, so the payload has to be complete before the CRC can be written. If the CRC sat at the start, the writer would either have to rewind (impossible on an append-only log) or buffer the whole payload in memory before writing anything. With the CRC at the end, the writer streams the payload straight out and tags the record closed with four bytes. If the crash cuts anywhere in the payload, the CRC never gets written — and the reader knows from the length what was promised and has to check what was delivered.

This is the framing used by Bitcask, LevelDB, RocksDB, Kafka, Postgres's WAL, SQLite's WAL, and essentially every serious storage engine. CRC32 is the historical default; CRC32C (Castagnoli) is the modern choice because Intel CPUs since the Nehalem generation have a hardware instruction for it (crc32 opcode, 1 cycle per 8 bytes), making it roughly the same cost as reading the bytes at all.

The fix is not free. You have added 8 bytes per record and two extra read calls per record on the scan path. For a workload of small records (say 30-byte payloads), the format overhead is 27%. Real engines amortise this by batching — one CRC covers an entire block of many records — but the basic principle stands: Wall 2 costs you a small fraction of your bytes and a small fraction of your read time, and in return it gives you a scanner that cannot be tricked by a torn tail.

Build 2 chapter 6 does this rewrite. For now, Wall 2 is named: text framing cannot distinguish a torn tail from a valid record; length-prefixed records with checksums can.

Wall 3 — the log grows forever

Load the store with realistic traffic. Write a million records. Update each one twenty times. Delete half of them.

Now look at the file. It has 21 million records in it. The log contains 1 million live records (or, if "deleted" means "never written again," zero of that half) — plus 20 million dead records, each a superseded version or an abandoned key, none reachable by any logic but the scanner, all occupying disk space forever.

# wall3_growth.py — watch the log balloon under update and delete traffic
import os
from appendkv import AppendOnlyKV

db = AppendOnlyKV("growth.log")
# 10,000 users, each updated 50 times
for user in range(10_000):
    for update in range(50):
        db.put(f"user{user}", f"balance={1000 + update}")
size_after_updates = os.path.getsize("growth.log")

# then delete every user — the "delete" is just a write, since we have no real delete
for user in range(10_000):
    db.put(f"user{user}", "__DELETED__")
size_after_deletes = os.path.getsize("growth.log")

print(f"live keys:    {10_000 if False else 0:,}")       # none, everyone deleted
print(f"dead records: ~510,000")
print(f"file size:    {size_after_deletes/1e6:.1f} MB")

The live data is zero bytes. The file is 20+ megabytes. That is Wall 3 in a sentence: in an append-only log, a byte written is a byte kept.

Why the log grows without boundA horizontal band representing a log file growing left to right. Most of the band is shaded as "dead" records — superseded updates and tombstones — with only a thin stripe of "live" records scattered throughout. A counter above shows "live: 1 MB" versus "on disk: 20 MB". An arrow points right labelled "growth continues forever — every put and every delete is an append". Below, a second band shows the same data after compaction, reduced to the live stripe only — roughly 1 MB total.Log on disk — before compaction|red = dead records (superseded or tombstoned) — green = live records20 MB95% deadAfter compaction — segment rewritten with only live records1 MBcompaction = read old segment, write only the live records to a new segment, delete the old oneWall 3 is bounded, not defeated — the sawtooth pattern of bytes-on-disk over time is the signature of every LSM and Bitcask
Wall 3. Without compaction, bytes-on-disk grows monotonically; most of what the disk holds is dead. Compaction is how the file pulls itself back down to the size of the live working set.

The fix has three parts, which together are the whole of Build 2 chapters 7–9.

Segment files. Stop writing to one giant file. After the active log reaches some threshold (say 64 MB), close it and start a new one. Each closed segment is immutable — its bytes never change again. The database now has a directory full of segment files numbered by creation time, plus one "active" segment accepting writes.

Tombstones. "Delete" is a real operation, not a write of a magic string. A tombstone is a record that says this key has no live value after this point. On scan or lookup, a tombstone shadows all earlier values of the key, exactly the way any newer put shadows an older one. The tombstone itself eventually has to be removed (what is called grave compaction) once no segment older than it still has a live value for that key.

Compaction. A background process reads old segments and rewrites them, one at a time or in pairs, keeping only the live records — the latest value for each key, or nothing if a tombstone supersedes everything. The rewritten segment is smaller. The old segment is deleted once the new one is durable. Disk usage is now bounded by roughly twice the live working set, not by the total writes ever issued.

These three together are the reason a real Bitcask or LevelDB or RocksDB node running for a year does not consume a petabyte of disk to hold a hundred gigabytes of data. They are also the reason those engines have a background subsystem — a compaction thread — which is the first time in this track we have needed concurrency in the storage layer. Build 2 chapter 10 builds it.

For now, Wall 3 is named: an append-only log grows monotonically; bounding its size needs segments plus tombstones plus compaction.

Why these three walls — and not others

There are many properties a database can fail to deliver. Why do these three particular walls sit at the centre of the subject?

Because they are exactly the three directions an append-only log cannot be improved from the inside. Every other property the chapter-2 log lacks — concurrency, typed values, transactions, queries by predicate — is additive. You can keep the append-only log exactly as it is and add a lock manager, or add a schema, or add a B-tree on a separate file. Those additions live next to the log without changing its shape.

The three walls are different. They are properties of the act of appending to one unindexed file. To fix Wall 1 you must add something outside the log (an index). To fix Wall 2 you must change the bytes inside the log (the record format). To fix Wall 3 you must change the structure from one file to many (segments) and run something in the background (compaction). Each fix is a structural edit to the design itself.

Why all three walls must be addressed, in some order, by every storage engine: a real workload has size, it has crashes, and it lives for more than five minutes. Any storage engine that does not have an answer to all three of these is a toy. This is why the chapter-2 log is a toy even though it keeps one of the five promises — it cannot withstand any one of the walls, let alone all three.

Bitcask, LSM, B+ tree — three answers to the same three walls

Here is the payoff. Every production-grade storage engine you will ever meet is designed around these three walls, and can be classified by which wall it puts in the centre of its design.

Three storage engines, three answers to the three wallsA three-by-three-plus-one grid. Rows are the three engines: Bitcask, LSM-tree, and B+ tree. Columns are Wall 1 (linear reads), Wall 2 (corrupt tails), and Wall 3 (unbounded growth). Each cell contains the short name of how that engine addresses that wall. Bitcask row: in-memory hash index for Wall 1, CRC-framed records for Wall 2, segments + compaction for Wall 3. LSM-tree row: in-memory sorted MemTable plus sparse block index for Wall 1, CRC-framed WAL and SSTable blocks for Wall 2, leveled compaction for Wall 3. B-plus tree row: disk-resident sorted tree giving log-n access for Wall 1, per-page checksums plus WAL replay for Wall 2, in-place updates reuse freed pages plus vacuum for Wall 3.Wall 1 — O(n) readsWall 2 — torn tailsWall 3 — unbounded growthBitcaskRiak backendBuild 2in-memory hash indexkey → (segment, offset)O(1) reads, all keys in RAMCRC-framed recordsdiscard torn tail on openscanner stops at first bad CRCsegments + compactionmerge segments in backgroundhint files for fast restartLSM-treeLevelDB, RocksDB,Cassandra, ScyllaBuild 3MemTable + SSTablessorted files, sparse indexO(log n) with Bloom filtersWAL + block CRCsreplay log into MemTableSSTable blocks immutableleveled compactionmerge sorted runs, drop dead keyswrite amplification is the costB+ treePostgres, SQLite,MySQL InnoDBBuild 4disk-resident sorted treeroot → branch → leaf pagesO(log n), few disk seeksWAL + page checksumsreplay WAL into pages on recoverydouble-write buffer for torn pagesin-place updates reuse pagesvacuum / page reclaimgrowth bounded by live set
The three-engine map of the rest of this subject. Same three walls; three radically different answers. You will build all three before the course ends.

Bitcask — the Riak backend, spelled out in a famously short paper (Sheehy & Smith, 2010). Bitcask's answer to Wall 1 is keep the index in memory, cover no other trick. Every live key has a hash-map entry pointing at its latest record's byte offset in the current segment file. Reads are one map lookup, one seek, one read — effectively O(1). Wall 2 is handled by CRC-framed records. Wall 3 is handled by segment files plus background compaction, with hint files — small summaries of each compacted segment — so that the index can be rebuilt from disk in milliseconds rather than by scanning every record. The restriction is that every key must fit in RAM. If your key set is bigger than memory, Bitcask is over. Riak used this for over a decade. Build 2 is Bitcask.

LSM-tree — Log-Structured Merge-tree, the design behind LevelDB, RocksDB, Cassandra, ScyllaDB, CockroachDB's storage layer, and countless others. Its answer to Wall 1 is keep the newest data sorted in memory, periodically flush it as a sorted immutable file on disk (an SSTable), and search across those files at read time. A sparse per-block index plus Bloom filters make each file's lookup close to O(1), and the total lookup is O(\log \text{files}). Wall 2 is handled by a write-ahead log (CRC-framed) protecting the in-memory table, plus per-block CRCs on each SSTable. Wall 3 is handled by leveled compaction: SSTables at level L are merged with SSTables at level L+1 in the background, producing fewer, larger, deduplicated files at the deeper level. The cost is write amplification — each byte written may be rewritten several times during compaction. LSM engines dominate the NoSQL world because their write path is almost as fast as a pure append-only log. Build 3 is LSM.

B+ tree — the classical relational database design, used by Postgres, SQLite, MySQL InnoDB, Oracle, DB2, and most SQL databases that predate the LSM era. Its answer to Wall 1 is keep a sorted tree of fixed-size pages on disk; each internal page points at children by page number; each leaf page holds a range of sorted keys. Reads traverse root → branch → leaf in O(\log n) disk accesses. Wall 2 is handled by a write-ahead log plus per-page checksums, with the WAL replayed into the pages on crash recovery (the ARIES algorithm). Wall 3 is handled by in-place updates: when a page has free space, a new version of a key overwrites the old one rather than appending. Split and merge operations keep pages balanced. Deleted space is reclaimed by vacuum or garbage collection. Growth is bounded roughly by the live set, not by total writes. B+ trees trade slower writes (because every write is a random page update, requiring a seek on a spinning disk) for faster reads and better range scans. Build 4 is B+ trees.

Three engines. Three answers to the same three walls. Each engine is a different point in the trade-space between read speed, write speed, memory footprint, and compaction cost.

Common confusions

Going deeper

If you just wanted the names of the three walls and the shape of the fixes, you have them. The rest of this section takes three tempting shortcuts — "just keep it in memory," "just mmap," "just use one giant log segment" — and shows why each of them fails to dissolve the walls, as preparation for why the real answers are what they are.

Why "keep everything in memory" does not dissolve Wall 3

The naive response to Wall 1 and Wall 3 is to keep the whole database in RAM. Reads are microseconds. Writes are microseconds. No scan cost, no compaction, no segments. Call it Redis.

Wall 3 does not care. A pure in-memory database that serves a write-heavy workload still has to do one of two things: snapshot periodically to disk, or append every write to a log on disk. The first is what Redis's RDB mode does — a full dump of the memory contents to a file, once every few minutes. The second is what Redis's AOF mode does — an append-only log of every command. Either way, you are back in disk-land, and the disk layer is subject to all three walls.

Snapshotting postpones Wall 3: you never have a log of dead records, because each snapshot is just the live state. But snapshots are O(\text{live set}) in both time and I/O, and you cannot take them after every write. So there is always a window of writes since the last snapshot that lives only in RAM; if the machine dies, that window is lost. Redis addresses this by combining RDB snapshots with an AOF (append-only file), at which point the AOF is an append-only log and you own Wall 3 again: rewriting the AOF — "BGREWRITEAOF" — is compaction by another name.

The rule: any durable database that is not willing to write every change to disk has to accept some recent-data loss on a crash. Any durable database that does write every change to disk has an append-only log, and the log has Wall 3.

Why mmap alone does not dissolve Wall 1

Another tempting shortcut: mmap the log file. Now the file is a region of the process's address space; every read is a memory access; the kernel pages in disk blocks on demand. No seeks in your code. Is that an O(1) read?

No. mmap changes the mechanism of access but not the amount. A get(key) still has to find the latest record for key, and it still has no information about where that record lives. The scan still happens; it is now a sequence of memory loads rather than read() calls, but every byte of the file still has to be touched (or at least considered for touch). On a log larger than RAM, mmap forces the kernel to page blocks in and out of memory, which is strictly slower than streaming them through a read() syscall because now you pay for minor page faults and TLB misses.

Where mmap does help is as a trick inside a different design. A B+ tree keeps its pages in a file; mmap lets the tree treat the file as an array of pages with no explicit read logic. Many embedded databases (LMDB being the famous example) are basically "B+ tree inside an mmap'd file," and they are very fast. But the speedup comes from the B+ tree's O(\log n) access pattern. mmap on a linear scan is still a linear scan.

Log segments as a partial answer — and why they need compaction

A half-measure that is popular enough to deserve a name: segment the log (Wall 3 step 1) but do not yet compact. The active log is bounded to 64 MB. When it fills, close it, start a new one. Now the database is a directory full of segment files plus an active writer.

This alone does two useful things. First, old segments are immutable — you can hand them to a backup tool, compress them, copy them to cloud storage, without coordinating with the writer. Second, the active segment (where all writes land) stays small, so the "last segment might be torn" recovery story only has to scan 64 MB, not 64 GB.

But Wall 3 is not solved, only postponed. The total disk usage still grows monotonically until you start deleting segments. And you cannot delete a segment just because it is old — you have to be sure it contains no live records. That check is itself an O(\text{segment size}) scan, and its result is usually "yes, some records are still live," so you cannot delete the whole file.

The fix is merge compaction: read several old segments, write one new segment containing only the live records, delete the old segments. This is the real answer, and it is what Build 2 chapter 8 builds. Segments alone are a prerequisite; compaction is the mechanism.

Write amplification — the hidden cost of LSM compaction

An LSM engine's compactor may rewrite each logical byte several times as it is promoted from level to level. If the engine has 6 levels and each level is 10× the size of the previous one, a byte written at level 0 might be rewritten at levels 1, 2, 3, 4, 5 before it reaches its final resting place — a write amplification factor of roughly 6. On a workload that writes 100 MB/s of application data, the actual disk I/O is 600 MB/s. This is why LSM engines tune their level sizing carefully, and why modern variants like Universal compaction (RocksDB) or leveled-N compaction trade read amplification for lower write amplification. The walls have fixes, but the fixes have bills, and the bills are parameters in the engine config.

Why Wall 2 is not "solved" by fsync alone

fsync guarantees that the bytes you asked to persist are on the media. It does not tell you that every byte on the media belongs to a complete, committed record. If the process died mid-write() — before fsync was called on that record — the kernel still flushes the dirty pages it has, and the disk still accepts them, and the bytes are durable. They are just incomplete. The record format is the only thing that can tell a subsequent scanner "these bytes were a torn tail, not a real record." fsync and framing are orthogonal guarantees: fsync protects what you committed; framing protects recognising what was committed.

Where this leads next

You have finished Build 1. You have a 30-line Python class that is durable, crash-tested, and completely unsuitable for anything but a log. You also have a map of what is wrong with it — three walls — and you know the names of the three engines that knock them down.

Build 2 picks up the hammer for the first wall.

By the end of Build 2, you will have a working Bitcask — a log-plus-hash-index KV store that passes the crash harness, compacts in the background, and does tens of thousands of reads per second regardless of log size. Then Build 3 opens the second wall-knocker — sorted in-memory tables that flush to disk — and you meet the LSM. Build 4 opens the third — B+ trees. By then the three walls are not walls any more; they are the three axes of a design space you can navigate.

Start with the index.

References

  1. Justin Sheehy and David Smith, Bitcask: A Log-Structured Hash Table for Fast Key/Value Data (Riak, 2010) — the seven-page paper that introduced Bitcask. The cleanest possible answer to Wall 1 on top of an append-only log.
  2. Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil, The Log-Structured Merge-Tree (LSM-Tree) (Acta Informatica, 1996) — the original LSM paper, foundation of LevelDB, RocksDB, Cassandra.
  3. Douglas Comer, The Ubiquitous B-Tree (ACM Computing Surveys, 1979) — the classical B-tree/B+-tree reference, foundation of Postgres, MySQL InnoDB, SQLite.
  4. C. Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging (ACM TODS, 1992) — the WAL-plus-B+-tree recovery algorithm every modern RDBMS descends from.
  5. Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 3 Storage and Retrieval — a working-engineer-level tour of hash indexes, LSMs, and B-trees that pairs directly with this chapter. dataintensive.net.
  6. Mark Callaghan, Read, write & space amplification (Small Datum, 2015) — a compact note on the LSM/B-tree trade-space in terms of the three amplifications, by one of the engineers who built MyRocks.