In short
The append-only log you wrote in chapter 2 and hardened in chapters 3 and 4 is a legitimate database — durable, crash-safe, scannable. But the moment it holds more than a few megabytes it hits three walls, in this order. Wall 1: reads are O(n) in the file size, because every get scans from the top. Throughput collapses as the log grows; at a gigabyte the store is unusable. The fix is an in-memory index (Build 2, Bitcask). Wall 2: a crash mid-write leaves a half-written record at the tail. Text framing cannot tell a torn record from a real one; the scanner hands the caller garbage. The fix is length-prefixed records with per-record checksums (CRC32C). Wall 3: the file grows forever. Every overwrite and every delete only adds bytes; dead data piles up without limit. The fix is segment files plus tombstones plus background compaction. These are not three unrelated problems — they are the three axes every real storage engine is designed around. Bitcask knocks down Wall 1 by keeping an in-memory hash index. LSM-trees knock down Wall 1 and Wall 3 by keeping sorted in-memory tables and merging them on disk. B+ trees knock down all three by keeping a disk-resident sorted index and updating it in place under a WAL. This chapter names the walls, draws them, reproduces each one in Python, and lays out the three-engine map you will spend the rest of this track building.
You have a database.
It is thirty lines of Python. It calls fsync at the right moments. It survives a power cord yanked from the wall because you tested it with the crash harness from chapter 4. You put in a key, you get out the key. The acceptance tests from chapter 1 are green for durability. One of the five promises is kept.
Now load it up. Write 10 million records. Write 100 million. Watch what happens.
get takes most of a second. No website survives this.Something has gone wrong, and it is not a bug. It is a property of the design. An append-only log with no index has read cost proportional to the file size, period. Fix the slope and you have changed the design.
This chapter is the moment Build 1 ends and the rest of the subject begins. You have built the floor — a durable, crash-safe log — and now you stand on it and look at the three walls. Each wall is a property of the design you just finished, a place where the log stops being enough. Each wall has a name, a reproducer, and a fix. The fixes are not in this chapter — they are the rest of the track. What is in this chapter is the map.
The three walls, at a glance
Three walls. Three fixes. Three engines in the rest of the subject. Let us take them one at a time.
Wall 1 — every read scans the whole log
The symptom is the graph you saw. At a thousand records, get returns in 30 microseconds. At a million, it takes 25 milliseconds. At a hundred million, it takes seconds. Every tenfold growth in the log divides the read rate by ten.
The cause is in the scanner. Here it is, unchanged from chapter 2:
def get(self, key):
latest = None
with open(self.path, "r", encoding="utf-8") as f:
for line in f:
k, _, v = line.rstrip("\n").partition("=")
if k == key:
latest = v
return latest
Notice what this loop does not have: any information about where in the file the value for key lives. It cannot seek. It cannot skip. It walks every line from the top, because the only way to know which line is the latest is to see all of them. If the key is updated on line 1 and never touched again, the scanner still reads every other line to confirm.
Why you cannot cheat: the log is unordered. If the keys were sorted alphabetically, you could binary-search. But the writer appends in arrival order, not key order, and you cannot sort the file without moving records — which would break the durability guarantee that said "appends never touch old bytes." The price of crash-safety at write time is a hard floor on read time.
Reproduce the wall in ten lines. Run this script on your chapter-2 store.
# wall1_reads.py — measure how get() scales with log size
import os, time, random
from appendkv import AppendOnlyKV
for n in (10_000, 100_000, 1_000_000):
path = f"w1_{n}.log"
if os.path.exists(path): os.remove(path)
db = AppendOnlyKV(path)
for i in range(n):
db.put(f"k{i}", f"v{i}")
t0 = time.time()
for _ in range(200):
db.get(f"k{random.randint(0, n-1)}")
rate = 200 / (time.time() - t0)
size_mb = os.path.getsize(path) / 1e6
print(f"n={n:>10,} size={size_mb:>6.1f} MB reads/s={rate:>8,.0f}")
db.close()
On a 2025 laptop you see something like:
n= 10,000 size= 0.1 MB reads/s= 4,200
n= 100,000 size= 1.7 MB reads/s= 410
n= 1,000,000 size= 18.5 MB reads/s= 39
Tenfold more records, tenfold slower reads. A clean O(n) signature. The store is not broken; the design is.
The name of the fix, which you will build in Build 2: a table, living in memory, that maps each key to the byte offset where its latest value starts in the log. On a get, you look up the offset in the table (one hash lookup, a microsecond), seek to that offset in the file (one syscall, a few microseconds), and read one record. The read cost drops from proportional to the file size to constant.
# preview — Build 2 chapter 6 builds this
class IndexedKV:
def __init__(self, path):
self.path = path
self._offsets = {} # key -> byte offset of latest record
# rebuild by scanning the log once, at startup
self._rebuild_index()
def put(self, k, v):
offset = self._f.tell()
self._f.write(f"{k}={v}\n")
self._f.flush(); os.fsync(self._f.fileno())
self._offsets[k] = offset # point at the new record
def get(self, k):
if k not in self._offsets: return None
with open(self.path, "r") as f:
f.seek(self._offsets[k]) # straight to the latest record
return f.readline().rstrip("\n").partition("=")[2]
That six-line change takes reads from O(n) to O(1) at the cost of one Python dictionary entry per live key. It also introduces three new problems you now have to think about — the index lives in RAM so it must be rebuilt on startup, it must fit in memory, and the log it points into still grows forever. Those three problems are exactly what Wall 3 and Build 2 chapters 7–10 are about. For now, the slope has been broken. That is the whole point of Wall 1.
Wall 2 — the tail is torn, and text cannot tell
You saw this wall already in chapter 4 when the crash harness produced trials with a mangled last line. Here it is as a first-class limit on the design.
The append-only log relies on a single trick: a crash can only damage the last record, and the scanner can decide what to do about the damage because every other record is intact. The trick works so long as the scanner can recognise a damaged record. With the text format — key=value\n, one record per line — it cannot, reliably.
Reproduce the wall in six lines.
# wall2_torn.py — simulate a torn tail and see what the text scanner does
from appendkv import AppendOnlyKV
db = AppendOnlyKV("torn.log")
db.put("balance", "50000") # valid, complete record
db.close()
# simulate a torn write: the next put wrote "balance=" but was killed before the value
with open("torn.log", "a") as f:
f.write("balance=") # no value, no newline — exactly what a crash leaves
db2 = AppendOnlyKV("torn.log")
print(db2.get("balance")) # prints "" (empty string) — not 50000, not None
Read that output again. You asked for a balance. A previous run committed it at 50,000 rupees. The next transaction crashed before the new value hit disk. Your scanner hands back the empty string, because "balance=".partition("=") produces the key "balance" and the value "". The acknowledgement from the committed write has been silently erased by the noise of the uncommitted write. Invariant 1 of the chapter-4 crash harness (no lost acks) is violated, and the harness had no way to know unless the record format itself said "this line was torn."
The fix is to frame records so that torn = detectable. Every production storage engine uses the same shape:
[ 4-byte length N | N bytes of payload | 4-byte CRC32 of payload ]
A record is a 4-byte header, N payload bytes, and a 4-byte checksum. The scanner reads the header, learns N, reads N+4 more bytes, and verifies the CRC. Any truncation anywhere in the record produces one of three unambiguous signals:
- The file ends before a full 4-byte length is readable — treat as end of log.
- The length is readable but the file ends before N+4 more bytes are available — torn tail, stop recovery here.
- Everything is readable but the CRC does not match — corruption (torn or bit-flipped), stop recovery here.
There is no fourth case where a torn record masquerades as a valid one, because the length prefix and the checksum have to agree for acceptance. Here is the primitive:
import struct, binascii
def encode(payload: bytes) -> bytes:
return (struct.pack("<I", len(payload))
+ payload
+ struct.pack("<I", binascii.crc32(payload)))
def read_next(f):
hdr = f.read(4)
if len(hdr) < 4: return None # clean end of log
(n,) = struct.unpack("<I", hdr)
body = f.read(n + 4)
if len(body) < n + 4: return "TORN" # truncated tail
payload, crc = body[:n], struct.unpack("<I", body[n:])[0]
if crc != binascii.crc32(payload): return "CORRUPT" # bit flip or torn mid-payload
return payload
Why the CRC goes at the end and not the beginning: the writer has to compute the CRC over the payload, so the payload has to be complete before the CRC can be written. If the CRC sat at the start, the writer would either have to rewind (impossible on an append-only log) or buffer the whole payload in memory before writing anything. With the CRC at the end, the writer streams the payload straight out and tags the record closed with four bytes. If the crash cuts anywhere in the payload, the CRC never gets written — and the reader knows from the length what was promised and has to check what was delivered.
This is the framing used by Bitcask, LevelDB, RocksDB, Kafka, Postgres's WAL, SQLite's WAL, and essentially every serious storage engine. CRC32 is the historical default; CRC32C (Castagnoli) is the modern choice because Intel CPUs since the Nehalem generation have a hardware instruction for it (crc32 opcode, 1 cycle per 8 bytes), making it roughly the same cost as reading the bytes at all.
The fix is not free. You have added 8 bytes per record and two extra read calls per record on the scan path. For a workload of small records (say 30-byte payloads), the format overhead is 27%. Real engines amortise this by batching — one CRC covers an entire block of many records — but the basic principle stands: Wall 2 costs you a small fraction of your bytes and a small fraction of your read time, and in return it gives you a scanner that cannot be tricked by a torn tail.
Build 2 chapter 6 does this rewrite. For now, Wall 2 is named: text framing cannot distinguish a torn tail from a valid record; length-prefixed records with checksums can.
Wall 3 — the log grows forever
Load the store with realistic traffic. Write a million records. Update each one twenty times. Delete half of them.
Now look at the file. It has 21 million records in it. The log contains 1 million live records (or, if "deleted" means "never written again," zero of that half) — plus 20 million dead records, each a superseded version or an abandoned key, none reachable by any logic but the scanner, all occupying disk space forever.
# wall3_growth.py — watch the log balloon under update and delete traffic
import os
from appendkv import AppendOnlyKV
db = AppendOnlyKV("growth.log")
# 10,000 users, each updated 50 times
for user in range(10_000):
for update in range(50):
db.put(f"user{user}", f"balance={1000 + update}")
size_after_updates = os.path.getsize("growth.log")
# then delete every user — the "delete" is just a write, since we have no real delete
for user in range(10_000):
db.put(f"user{user}", "__DELETED__")
size_after_deletes = os.path.getsize("growth.log")
print(f"live keys: {10_000 if False else 0:,}") # none, everyone deleted
print(f"dead records: ~510,000")
print(f"file size: {size_after_deletes/1e6:.1f} MB")
The live data is zero bytes. The file is 20+ megabytes. That is Wall 3 in a sentence: in an append-only log, a byte written is a byte kept.
The fix has three parts, which together are the whole of Build 2 chapters 7–9.
Segment files. Stop writing to one giant file. After the active log reaches some threshold (say 64 MB), close it and start a new one. Each closed segment is immutable — its bytes never change again. The database now has a directory full of segment files numbered by creation time, plus one "active" segment accepting writes.
Tombstones. "Delete" is a real operation, not a write of a magic string. A tombstone is a record that says this key has no live value after this point. On scan or lookup, a tombstone shadows all earlier values of the key, exactly the way any newer put shadows an older one. The tombstone itself eventually has to be removed (what is called grave compaction) once no segment older than it still has a live value for that key.
Compaction. A background process reads old segments and rewrites them, one at a time or in pairs, keeping only the live records — the latest value for each key, or nothing if a tombstone supersedes everything. The rewritten segment is smaller. The old segment is deleted once the new one is durable. Disk usage is now bounded by roughly twice the live working set, not by the total writes ever issued.
These three together are the reason a real Bitcask or LevelDB or RocksDB node running for a year does not consume a petabyte of disk to hold a hundred gigabytes of data. They are also the reason those engines have a background subsystem — a compaction thread — which is the first time in this track we have needed concurrency in the storage layer. Build 2 chapter 10 builds it.
For now, Wall 3 is named: an append-only log grows monotonically; bounding its size needs segments plus tombstones plus compaction.
Why these three walls — and not others
There are many properties a database can fail to deliver. Why do these three particular walls sit at the centre of the subject?
Because they are exactly the three directions an append-only log cannot be improved from the inside. Every other property the chapter-2 log lacks — concurrency, typed values, transactions, queries by predicate — is additive. You can keep the append-only log exactly as it is and add a lock manager, or add a schema, or add a B-tree on a separate file. Those additions live next to the log without changing its shape.
The three walls are different. They are properties of the act of appending to one unindexed file. To fix Wall 1 you must add something outside the log (an index). To fix Wall 2 you must change the bytes inside the log (the record format). To fix Wall 3 you must change the structure from one file to many (segments) and run something in the background (compaction). Each fix is a structural edit to the design itself.
Why all three walls must be addressed, in some order, by every storage engine: a real workload has size, it has crashes, and it lives for more than five minutes. Any storage engine that does not have an answer to all three of these is a toy. This is why the chapter-2 log is a toy even though it keeps one of the five promises — it cannot withstand any one of the walls, let alone all three.
Bitcask, LSM, B+ tree — three answers to the same three walls
Here is the payoff. Every production-grade storage engine you will ever meet is designed around these three walls, and can be classified by which wall it puts in the centre of its design.
Bitcask — the Riak backend, spelled out in a famously short paper (Sheehy & Smith, 2010). Bitcask's answer to Wall 1 is keep the index in memory, cover no other trick. Every live key has a hash-map entry pointing at its latest record's byte offset in the current segment file. Reads are one map lookup, one seek, one read — effectively O(1). Wall 2 is handled by CRC-framed records. Wall 3 is handled by segment files plus background compaction, with hint files — small summaries of each compacted segment — so that the index can be rebuilt from disk in milliseconds rather than by scanning every record. The restriction is that every key must fit in RAM. If your key set is bigger than memory, Bitcask is over. Riak used this for over a decade. Build 2 is Bitcask.
LSM-tree — Log-Structured Merge-tree, the design behind LevelDB, RocksDB, Cassandra, ScyllaDB, CockroachDB's storage layer, and countless others. Its answer to Wall 1 is keep the newest data sorted in memory, periodically flush it as a sorted immutable file on disk (an SSTable), and search across those files at read time. A sparse per-block index plus Bloom filters make each file's lookup close to O(1), and the total lookup is O(\log \text{files}). Wall 2 is handled by a write-ahead log (CRC-framed) protecting the in-memory table, plus per-block CRCs on each SSTable. Wall 3 is handled by leveled compaction: SSTables at level L are merged with SSTables at level L+1 in the background, producing fewer, larger, deduplicated files at the deeper level. The cost is write amplification — each byte written may be rewritten several times during compaction. LSM engines dominate the NoSQL world because their write path is almost as fast as a pure append-only log. Build 3 is LSM.
B+ tree — the classical relational database design, used by Postgres, SQLite, MySQL InnoDB, Oracle, DB2, and most SQL databases that predate the LSM era. Its answer to Wall 1 is keep a sorted tree of fixed-size pages on disk; each internal page points at children by page number; each leaf page holds a range of sorted keys. Reads traverse root → branch → leaf in O(\log n) disk accesses. Wall 2 is handled by a write-ahead log plus per-page checksums, with the WAL replayed into the pages on crash recovery (the ARIES algorithm). Wall 3 is handled by in-place updates: when a page has free space, a new version of a key overwrites the old one rather than appending. Split and merge operations keep pages balanced. Deleted space is reclaimed by vacuum or garbage collection. Growth is bounded roughly by the live set, not by total writes. B+ trees trade slower writes (because every write is a random page update, requiring a seek on a spinning disk) for faster reads and better range scans. Build 4 is B+ trees.
Three engines. Three answers to the same three walls. Each engine is a different point in the trade-space between read speed, write speed, memory footprint, and compaction cost.
Common confusions
-
"The walls are bugs in the chapter-2 log." No. They are properties of the design. The chapter-2 log is a perfectly fine write-ahead log — a component inside every production database in the world. What it is not is a whole database. The walls name what has to be added to turn it into one.
-
"The walls are independent — I can fix one and ignore the others." In practice, no. Fixing Wall 1 with an in-memory index gives you O(1) reads, but the index grows with the live key set, which only stays bounded if you solve Wall 3 (compaction). Fixing Wall 2 with length-prefixed records changes the scanner, which changes what "clean recovery" means, which changes how segments are recognised for Wall 3. All three fixes interact, which is why real storage engines are not three independent modules but one tangled whole.
-
"Wall 3 is fine, I have infinite disk." Two things break before the disk does. First, startup scan time is proportional to total log bytes — if Wall 1's index is rebuilt by scanning the log at boot (which is what Bitcask does by default, and what hint files mitigate), unbounded growth means unbounded startup time, and you eventually cannot reboot in a working lifetime. Second, backups, replication, and compaction bandwidth all scale with bytes on disk; infinite disk does not give you infinite network.
-
"LSM is 'just append-only compaction' and B+ tree is 'just in-place' — they are obviously competitors, one must be better." They are not competitors; they are different points on a trade-off curve. LSM optimises sequential write throughput at the cost of read amplification and background compaction work. B+ tree optimises read latency and range queries at the cost of random-write I/O. The right answer depends entirely on the read/write mix, the hardware, and the queries. Cassandra and Postgres both make sensible choices for their target workloads.
-
"Bitcask is too simple to be a 'real' engine." Bitcask was the storage layer of Riak, a distributed database that handled production workloads at Heroku, Verizon, the UK government, and others for over a decade. Simple is not synonymous with toy; it is synonymous with tractable. The whole Bitcask paper is 7 pages.
-
"If I add a hash index to the chapter-2 log, I have built Bitcask." Almost — but missing the CRC framing, the segments, the compactor, the hint files, and the single-writer concurrency model. Bitcask's simplicity is the sum of many small decisions that this chapter has only hinted at. Build 2 is where those decisions are made explicit.
Going deeper
If you just wanted the names of the three walls and the shape of the fixes, you have them. The rest of this section takes three tempting shortcuts — "just keep it in memory," "just mmap," "just use one giant log segment" — and shows why each of them fails to dissolve the walls, as preparation for why the real answers are what they are.
Why "keep everything in memory" does not dissolve Wall 3
The naive response to Wall 1 and Wall 3 is to keep the whole database in RAM. Reads are microseconds. Writes are microseconds. No scan cost, no compaction, no segments. Call it Redis.
Wall 3 does not care. A pure in-memory database that serves a write-heavy workload still has to do one of two things: snapshot periodically to disk, or append every write to a log on disk. The first is what Redis's RDB mode does — a full dump of the memory contents to a file, once every few minutes. The second is what Redis's AOF mode does — an append-only log of every command. Either way, you are back in disk-land, and the disk layer is subject to all three walls.
Snapshotting postpones Wall 3: you never have a log of dead records, because each snapshot is just the live state. But snapshots are O(\text{live set}) in both time and I/O, and you cannot take them after every write. So there is always a window of writes since the last snapshot that lives only in RAM; if the machine dies, that window is lost. Redis addresses this by combining RDB snapshots with an AOF (append-only file), at which point the AOF is an append-only log and you own Wall 3 again: rewriting the AOF — "BGREWRITEAOF" — is compaction by another name.
The rule: any durable database that is not willing to write every change to disk has to accept some recent-data loss on a crash. Any durable database that does write every change to disk has an append-only log, and the log has Wall 3.
Why mmap alone does not dissolve Wall 1
Another tempting shortcut: mmap the log file. Now the file is a region of the process's address space; every read is a memory access; the kernel pages in disk blocks on demand. No seeks in your code. Is that an O(1) read?
No. mmap changes the mechanism of access but not the amount. A get(key) still has to find the latest record for key, and it still has no information about where that record lives. The scan still happens; it is now a sequence of memory loads rather than read() calls, but every byte of the file still has to be touched (or at least considered for touch). On a log larger than RAM, mmap forces the kernel to page blocks in and out of memory, which is strictly slower than streaming them through a read() syscall because now you pay for minor page faults and TLB misses.
Where mmap does help is as a trick inside a different design. A B+ tree keeps its pages in a file; mmap lets the tree treat the file as an array of pages with no explicit read logic. Many embedded databases (LMDB being the famous example) are basically "B+ tree inside an mmap'd file," and they are very fast. But the speedup comes from the B+ tree's O(\log n) access pattern. mmap on a linear scan is still a linear scan.
Log segments as a partial answer — and why they need compaction
A half-measure that is popular enough to deserve a name: segment the log (Wall 3 step 1) but do not yet compact. The active log is bounded to 64 MB. When it fills, close it, start a new one. Now the database is a directory full of segment files plus an active writer.
This alone does two useful things. First, old segments are immutable — you can hand them to a backup tool, compress them, copy them to cloud storage, without coordinating with the writer. Second, the active segment (where all writes land) stays small, so the "last segment might be torn" recovery story only has to scan 64 MB, not 64 GB.
But Wall 3 is not solved, only postponed. The total disk usage still grows monotonically until you start deleting segments. And you cannot delete a segment just because it is old — you have to be sure it contains no live records. That check is itself an O(\text{segment size}) scan, and its result is usually "yes, some records are still live," so you cannot delete the whole file.
The fix is merge compaction: read several old segments, write one new segment containing only the live records, delete the old segments. This is the real answer, and it is what Build 2 chapter 8 builds. Segments alone are a prerequisite; compaction is the mechanism.
Write amplification — the hidden cost of LSM compaction
An LSM engine's compactor may rewrite each logical byte several times as it is promoted from level to level. If the engine has 6 levels and each level is 10× the size of the previous one, a byte written at level 0 might be rewritten at levels 1, 2, 3, 4, 5 before it reaches its final resting place — a write amplification factor of roughly 6. On a workload that writes 100 MB/s of application data, the actual disk I/O is 600 MB/s. This is why LSM engines tune their level sizing carefully, and why modern variants like Universal compaction (RocksDB) or leveled-N compaction trade read amplification for lower write amplification. The walls have fixes, but the fixes have bills, and the bills are parameters in the engine config.
Why Wall 2 is not "solved" by fsync alone
fsync guarantees that the bytes you asked to persist are on the media. It does not tell you that every byte on the media belongs to a complete, committed record. If the process died mid-write() — before fsync was called on that record — the kernel still flushes the dirty pages it has, and the disk still accepts them, and the bytes are durable. They are just incomplete. The record format is the only thing that can tell a subsequent scanner "these bytes were a torn tail, not a real record." fsync and framing are orthogonal guarantees: fsync protects what you committed; framing protects recognising what was committed.
Where this leads next
You have finished Build 1. You have a 30-line Python class that is durable, crash-tested, and completely unsuitable for anything but a log. You also have a map of what is wrong with it — three walls — and you know the names of the three engines that knock them down.
Build 2 picks up the hammer for the first wall.
- The offset table — turning a log into a KV store — Build 2 chapter 6: the in-memory hash index. O(n) reads become O(1). You will discover that the simple preview at the top of Wall 1 is not quite right — atomic updates, concurrent readers, and index-recovery on startup all need care — and you will fix each.
- Segment files and closing old segments — Build 2 chapter 7: the first step of the Wall 3 fix. Rotate logs at a size threshold; learn why the recovery scan only needs to re-read the active segment.
- Compaction — reclaiming deleted and overwritten keys — Build 2 chapter 8: the second step of the Wall 3 fix. A background process that reads old segments and rewrites them. Your first taste of storage-engine concurrency.
- Tombstones and the delete problem — Build 2 chapter 9: the third step. What "delete" really means in an append-only world, and why tombstones cannot be deleted immediately.
- Hint files — rebuilding the index in milliseconds — Build 2 chapter 10: the trick that turns "rescan the whole log on startup" into a bounded, fast operation. The missing piece of Bitcask.
By the end of Build 2, you will have a working Bitcask — a log-plus-hash-index KV store that passes the crash harness, compacts in the background, and does tens of thousands of reads per second regardless of log size. Then Build 3 opens the second wall-knocker — sorted in-memory tables that flush to disk — and you meet the LSM. Build 4 opens the third — B+ trees. By then the three walls are not walls any more; they are the three axes of a design space you can navigate.
Start with the index.
References
- Justin Sheehy and David Smith, Bitcask: A Log-Structured Hash Table for Fast Key/Value Data (Riak, 2010) — the seven-page paper that introduced Bitcask. The cleanest possible answer to Wall 1 on top of an append-only log.
- Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil, The Log-Structured Merge-Tree (LSM-Tree) (Acta Informatica, 1996) — the original LSM paper, foundation of LevelDB, RocksDB, Cassandra.
- Douglas Comer, The Ubiquitous B-Tree (ACM Computing Surveys, 1979) — the classical B-tree/B+-tree reference, foundation of Postgres, MySQL InnoDB, SQLite.
- C. Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging (ACM TODS, 1992) — the WAL-plus-B+-tree recovery algorithm every modern RDBMS descends from.
- Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 3 Storage and Retrieval — a working-engineer-level tour of hash indexes, LSMs, and B-trees that pairs directly with this chapter. dataintensive.net.
- Mark Callaghan, Read, write & space amplification (Small Datum, 2015) — a compact note on the LSM/B-tree trade-space in terms of the three amplifications, by one of the engineers who built MyRocks.