In short
The memtable from chapter 13 is fast, but it is in RAM and it is bounded in size. Eventually it gets full — typically 32 MB to 256 MB — and something has to move it out. That "something" is a flush: a single sequential pass over the memtable in sorted order, packing records into block-aligned chunks on disk, emitting a tiny sparse index at the tail, and closing with a footer that points back at the index. Because the memtable is already sorted (a skip list walks in order), the flush is not a sort — it is a transcription. The trick that makes it safe under concurrent writes is the active/frozen swap: the moment you decide to flush, the current memtable is marked immutable ("frozen") and a fresh empty one is installed as the "active" writer. Reads check both until the flush completes; writes go only to the active one. When the flush finishes, the frozen memtable is dropped (the WAL has already made it durable; see Build 5), and an SSTable sits on disk, ready to be read. This chapter writes the fifty-line flush_memtable(), draws the SSTable layout and the active/frozen swap, and handles the three boring-but-critical details — block alignment, atomic rename, and what "done" means when a crash can happen at any byte.
You have a memtable. It is a skip list holding sorted key-value pairs in RAM, it accepts writes in O(\log n), and — the non-negotiable part — it has a memory budget. In LevelDB the default is 4 MB. In RocksDB it is 64 MB. In Cassandra it is a fraction of the JVM heap. Pick your number, but there is always a number, and when the memtable crosses it, you have a problem.
You cannot just grow the memtable forever. RAM is finite; a process that never evicts anything from RAM either crashes or drives the host into swap. You cannot insert into an on-disk SSTable in place, because the on-disk files are sorted and immutable by design (that is why chapter 12 cost you the keydir — to gain sort order you gave up in-place update).
So the only move is to move the memtable to disk as a complete, new SSTable. That operation is called a flush, and it is the hinge on which Build 3 turns. Memtable absorbs random writes cheaply. Flush turns those absorbed writes into a sorted, durable file. SSTable holds them forever, until a compaction (next few chapters) decides otherwise.
This chapter builds the flush.
When to flush
Three signals trigger a flush in real systems, and any production engine uses all three:
- Size. The memtable crosses a byte threshold — 32 MB, 64 MB, 256 MB depending on the engine. This is the dominant trigger under heavy write load. It bounds RAM usage and bounds the size of each resulting SSTable.
- Time. Even a nearly empty memtable gets flushed after some maximum age — typically five or ten minutes. This bounds how long a write can sit in RAM before it becomes a durable SSTable (the WAL covers crash-safety in the meantime, but flushing still matters for the read path and for eventually reclaiming WAL space).
- Pressure from new writes. If a write arrives and the memtable is already full or nearly so, the write path itself triggers the flush. This is a back-pressure mechanism: if flushes cannot keep up with writes, the write path stalls briefly until the flush completes, rather than letting RAM usage unbound.
You will implement all three as a single should_flush() predicate. The shape:
# flush_policy.py — when to flush
MEMTABLE_MAX_BYTES = 64 * 1024 * 1024 # 64 MB
MEMTABLE_MAX_AGE_SECONDS = 10 * 60 # 10 minutes
def should_flush(memtable, now):
"""Return True if the given memtable should be sealed and written to disk."""
if memtable.size_bytes >= MEMTABLE_MAX_BYTES:
return True # size trigger
if now - memtable.created_at >= MEMTABLE_MAX_AGE_SECONDS:
return True # time trigger
return False
The pressure trigger is not in this predicate — it is a consequence of a write finding should_flush() already true on entry, and synchronously kicking off the flush before returning. Real engines run the flush on a background thread so writes never block on disk I/O, but the first implementation can be synchronous and still be correct. Async is an optimisation (covered at the end).
The flush algorithm
Pseudocode first, because the shape is small and the details will look scary if you have not seen the outline:
flush_memtable(mt, path):
1. seal mt (no more writes) # active/frozen swap happens here
2. open a temp file at path.tmp
3. for each (key, value) in mt.in_sorted_order():
pack record into the current block buffer
if block buffer >= BLOCK_SIZE:
write block to file
record (first_key_of_block, offset) in sparse_index
start a new block buffer
4. flush any partial block
5. write sparse_index as a single block
6. write footer: (sparse_index_offset, sparse_index_length, magic)
7. fsync the file
8. rename path.tmp -> path # atomic on any POSIX filesystem
9. drop the frozen memtable
Nine steps. Three of them are the real work (3, 5, 6); three are concurrency/durability hygiene (1, 7, 8); three are bookkeeping (2, 4, 9). Let us draw the on-disk shape first, then write the code.
Three design decisions are baked into this layout. Each one earns its keep.
Block alignment. Data is packed into fixed-size blocks (here, 4 KB). A record straddles a block boundary only if it is itself bigger than a block; otherwise each block holds a whole number of records with a little padding at the end. The payoff is that every disk read is one page — the OS page cache, the SSD's FTL, and the filesystem all work in pages, so aligning to page boundaries means one read = one block = one cache line of I/O. You never pay to read half of an adjacent block just because your target happened to span two pages.
Sparse index at the tail, not the head. The index is written after the data because you do not know the block offsets until you have written the blocks. Writing sequentially, top-to-bottom, means you discover offsets as you go and emit the index only once, at the end, in one pass.
Footer at the absolute end. A reader does not know where the index starts — it could be anywhere, because data size varies. The solution is the footer: a fixed-size block at a known position (end minus 48 bytes). Read the footer first, learn the index offset, then seek to it. This makes SSTables self-describing: no external metadata file needed.
The code
Here is the fifty-line flush. It assumes a memtable that supports .items() yielding (key, value) in sorted order (the skip list from chapter 13 does this for free):
# flush.py — turn a sorted memtable into a block-aligned SSTable on disk
import os, struct
BLOCK_SIZE = 4096
MAGIC = b"SSTv1\x00\x00\x00" # 8 bytes; reader checks this
FOOTER_SIZE = 8 + 8 + 8 + 8 # index_off, index_len, num_records, magic
def _encode_record(key, value):
kb, vb = key.encode("utf-8"), value.encode("utf-8")
return struct.pack(">II", len(kb), len(vb)) + kb + vb
def flush_memtable(mt, path):
"""
Write the memtable's sorted contents to a new SSTable at `path`.
Uses a .tmp suffix and atomic rename so partial files never survive a crash.
"""
tmp = path + ".tmp"
sparse_index = [] # list of (first_key, block_offset)
num_records = 0
block = bytearray()
block_first_key = None
with open(tmp, "wb") as f:
for key, value in mt.items(): # sorted order
rec = _encode_record(key, value)
if block and len(block) + len(rec) > BLOCK_SIZE:
# current block is full — pad and flush it
block += b"\x00" * (BLOCK_SIZE - len(block))
sparse_index.append((block_first_key, f.tell()))
f.write(block)
block, block_first_key = bytearray(), None
if block_first_key is None:
block_first_key = key # first key of new block
block += rec
num_records += 1
if block: # flush final partial block
block += b"\x00" * (BLOCK_SIZE - len(block))
sparse_index.append((block_first_key, f.tell()))
f.write(block)
# write sparse index as one block
index_off = f.tell()
index_buf = bytearray()
for k, off in sparse_index:
kb = k.encode("utf-8")
index_buf += struct.pack(">IQ", len(kb), off) + kb
f.write(index_buf)
index_len = len(index_buf)
# write footer at the very end
f.write(struct.pack(">QQQ", index_off, index_len, num_records) + MAGIC)
f.flush()
os.fsync(f.fileno()) # durable before rename
os.rename(tmp, path) # atomic publish
return {"path": path, "records": num_records, "blocks": len(sparse_index)}
Fifty lines if you count the encoder. Read it once with the layout diagram open, and trace what a single record does: it gets encoded (4 bytes key-length, 4 bytes value-length, key, value), appended to the current block buffer, and — if the block is full — the buffer gets padded to 4096 bytes and flushed to the file. Meanwhile the sparse-index list grows by one entry per block, not one entry per record.
Why pad the block instead of leaving it short: if you write a 3840-byte block followed by the next block, the file positions no longer land on 4 KB boundaries, and the "one disk read = one block" property is gone. Padding is wasted bytes, but at 4 KB blocks the waste is at most a few percent and the alignment is worth every lost byte.
Why fsync() before rename(): write() only copies bytes into the kernel's page cache; until an fsync(), the file could vanish in a crash. And rename() is only atomic with respect to the filesystem metadata — if the data has not been fsynced, the renamed file can exist on disk as a zero-byte (or partially written) entry. Fsync the data, then rename; that is the safe order. [3]
Why the magic bytes at the end, not the beginning: the reader of an SSTable starts at the tail. It reads the last 48 bytes, sanity-checks the magic, and extracts the index offset. Magic at the head would require a second seek. LevelDB, RocksDB, Parquet, and many others put their magic and metadata at the tail for exactly this reason.
During-flush writes: the active/frozen swap
Here is the question the pseudocode hides. A flush takes time — disks are not instantaneous, and a 64 MB memtable dumping to SSD at 500 MB/s still needs 128 milliseconds. During that 128 ms, a write arrives. It cannot go into the memtable that is currently being iterated (you would invalidate the iterator, and the ordering guarantees evaporate). It cannot be dropped. It must go somewhere.
The answer is the active/frozen swap. It is worth drawing because the picture makes the flow obvious.
The swap is the one non-trivial piece of concurrency in the whole design, and it is small. A sketch:
# store.py — active/frozen swap
class LsmStore:
def __init__(self):
self.active = Memtable() # accepts writes
self.frozen = None # being flushed, or None
self.sstables = [] # on-disk, newest-first
def put(self, key, value):
if should_flush(self.active, time.time()):
self._begin_flush()
self.active.put(key, value) # always to the active one
def get(self, key):
if (v := self.active.get(key)) is not None: return v
if self.frozen is not None:
if (v := self.frozen.get(key)) is not None: return v
for sst in self.sstables: # newest first
if (v := sst.get(key)) is not None: return v
return None
def _begin_flush(self):
# atomic swap: seal active, install fresh one
self.frozen = self.active
self.active = Memtable()
# run flush (synchronously for now; see "async flush threads" below)
path = f"sstable-{len(self.sstables):04d}.sst"
flush_memtable(self.frozen, path)
self.sstables.insert(0, SSTableReader(path))
self.frozen = None # drop frozen; WAL made it safe
The key invariants:
- Writes only go to
active. The frozen one is immutable from the moment of the swap. This is what lets the flush iterate it without taking any lock on the data — by the time the flush runs, no other thread is allowed to touch the frozen memtable. - Reads check
active, thenfrozen, then on-disk SSTables. The frozen memtable is still live data as far as reads are concerned; just because it is being flushed does not mean the keys in it are unavailable. Only when the flush completes and the SSTable is installed doesfrozengo toNone. - At most one frozen memtable at a time. If a second flush would be triggered while the first is still running, the write path blocks until the first completes. Real systems queue multiple frozen memtables to avoid this stall; Cassandra's "immutable memtable" pool and RocksDB's
max_write_buffer_numberare exactly this knob.
A worked flush
Flushing a 5-key memtable to `demo.sst`
Take a tiny memtable with five records and walk it through the flush step by step. BLOCK_SIZE is lowered to 64 bytes just for the demo, so blocks actually split.
# demo_flush.py
mt = Memtable() # skip list from chapter 13
mt.put("age", "19")
mt.put("city", "delhi")
mt.put("email", "dipti@padho.wiki")
mt.put("locale", "en-IN")
mt.put("name", "dipti")
# BLOCK_SIZE = 64 for the demo
flush_memtable(mt, "demo.sst")
Trace the flush:
items() yields in sorted order:
age=19 encoded: 8 + 2 + 3 + 2 = 15 bytes
city=delhi encoded: 8 + 4 + 5 + 2 = 19 bytes
email=dipti... encoded: 8 + 5 + 16+ 2 = 31 bytes
locale=en-IN encoded: 8 + 6 + 5 + 2 = 21 bytes
name=dipti encoded: 8 + 4 + 5 + 2 = 19 bytes
Block 0 (first_key=age):
pack age=19 (15 B) buffer: 15 B
pack city=delhi (19 B) buffer: 34 B
pack email=... (31 B) buffer: 65 B → over 64 B!
→ write current block (just age + city = 34 B, pad to 64 B)
sparse_index = [("age", 0)]
file now at offset 64
Block 1 (first_key=email):
pack email=... (31 B) buffer: 31 B
pack locale=en-IN (21 B) buffer: 52 B
pack name=dipti (19 B) buffer: 71 B → over 64 B!
→ write block (email + locale = 52 B, pad)
sparse_index = [("age", 0), ("email", 64)]
file now at offset 128
Block 2 (first_key=name):
pack name=dipti (19 B) buffer: 19 B
end of memtable
→ flush final block (name = 19 B, pad to 64 B)
sparse_index = [("age", 0), ("email", 64), ("name", 128)]
file now at offset 192
Sparse index written at offset 192, length ~45 bytes.
Footer at the tail: (index_off=192, index_len=45, num_records=5, MAGIC).
fsync. rename demo.sst.tmp → demo.sst.
Five records, three blocks, three sparse-index entries, one footer. The file is 192 + 45 + 32 = 269 bytes. A reader would open it tail-first: read the last 32 bytes, find index_off=192, seek to 192, read 45 bytes of sparse index, binary-search for the target key, seek to the right data block, scan forward. The exact same read path as the sparse-index probe in chapter 12, now running over a self-describing file.
Notice what did not happen: nobody sorted anything. The memtable was already sorted. The flush is a linear transcription — one pass, one write, constant extra memory. That is the whole point of keeping the memtable sorted in RAM: the flush becomes trivial.
Common confusions
-
"Why block-align? Can I not just stream records one after another?" You can, and it will work — it will just be slower to read. Every disk I/O on every modern storage stack is in 4 KB pages (or 16 KB on some SSDs). If a record crosses a page boundary, a point lookup reads two pages instead of one. If blocks are aligned to 4 KB, each block lookup is exactly one page. The cost is the wasted padding at the end of each block — typically 1–3% of the file. The benefit is that every read is one page, and the sparse index can be one-entry-per-block instead of one-entry-per-record. The block structure is also the unit of compression (next section) and the unit of caching in the block cache. Every later LSM optimisation works on blocks; packing unit matters.
-
"What happens to reads during a flush?" They succeed, and they see a consistent view of the data. The read path in
_begin_flushshows why —get()checks the active memtable, then the frozen one, then the on-disk SSTables. A key that was in M0 is still findable in the frozen memtable until the flush finishes and the SSTable takes its place. The swap from frozen-memtable-holding-the-key to SSTable-holding-the-key is atomic (one list insertion, one nulling-out), and either ordering is correct: if the SSTable is installed beforefrozenis cleared, the same key is findable twice (reads still pick the newer one); iffrozenis cleared before the SSTable is installed, the key is only in the SSTable, which is the newer version anyway. Real engines do it in the order shown — install first, clear second — so the key is always findable. No reader ever sees "key temporarily missing." -
"What if the process crashes mid-flush?" The
.tmpsuffix saves you. Until the finalos.rename(tmp, path), the SSTable does not officially exist — a restart seesdemo.sst.tmpin the directory, recognises it as orphaned, and deletes it. The memtable it came from is not lost either: the WAL (Build 5, chapter 21) has a record of every write that was in the memtable, so on restart the memtable is rebuilt from the WAL and the process retries the flush. Thetmp + rename + fsyncpattern is the standard "atomic file publish" trick and is used in every serious on-disk format. -
"The memtable is 'dropped' after flush — does that mean the data is only on disk afterwards?" Yes, but the data was already on disk before the flush — it was in the WAL. The memtable is a fast read cache of the WAL's contents; the WAL is the durability source of truth. The flush copies data from memtable→SSTable (a cheaper, more compact, indexed representation), then the WAL segment covering that memtable's writes can be garbage-collected. The memtable drop is a pure RAM release; no durability decision rides on it. (This only works because the WAL already fsynced every write; chapter 21 covers this explicitly. Without a WAL, you must fsync the SSTable before acknowledging the writes — which is what RocksDB's
use_direct_writesmode does for some workloads.) -
"Can the flush happen in the background without blocking writes?" Yes, and real engines do. The synchronous version above is the simplest correct implementation. Making it async is an easy change: spawn a thread at
_begin_flush, let it do the flush while the write path proceeds with the new active memtable. The only subtlety is backpressure — if writes are faster than flushes, the pool of frozen memtables grows, and eventually you must stall writes to avoid unbounded RAM. The "Going deeper" section below sketches the machinery. -
"Does the sparse index have to be one block?" No. Small SSTables have a one-block sparse index; large ones need multiple blocks and often a second-level index that indexes the first-level one. LevelDB uses this two-level scheme: a top-level "index block" pointing at "data blocks," plus a "meta-index block" pointing at Bloom filters and block-level checksums. The shape grows, but the principle — index at the tail, footer at the end, self-describing — does not change.
Going deeper
You have the flush. This section fills in the three production concerns — compression, async flushes, and how the flush fits into the wider crash-safety story.
Block compression (preview of Build 15)
The 4 KB block is not just an alignment unit, it is a compression unit. Because keys in a sorted SSTable have long shared prefixes (user:00001, user:00002, ...) and values often repeat (country codes, enum strings, zero-padded integers), a single block typically compresses by 3–5× with a general-purpose algorithm like LZ4 or Snappy.
Compression is a per-block decision because you want to be able to decompress one block in isolation — a point lookup still reads exactly one block from disk and decompresses it in RAM. If you compressed the whole SSTable as one stream, a point lookup would require decompressing from the start up to the target, which wipes out the O(\log n) access of the sparse index.
The flush algorithm barely changes: instead of writing block directly, you compress it first, and record the compressed size in the sparse index (alongside the original block offset). Each block gets a one-byte "compression type" prefix (0 = none, 1 = snappy, 2 = zstd...) so the reader can dispatch to the right decoder.
Build 15 walks through this in detail with benchmarks; for now it is enough to know the block structure was chosen partly to make compression plug in cleanly.
Async flush threads and backpressure
Real engines run the flush on a background thread so the write path never blocks on disk I/O. Pseudocode:
# async flush sketch
def _begin_flush(self):
frozen = self.active
self.active = Memtable()
self.frozen_queue.append(frozen) # multiple may queue
self.flush_executor.submit(self._do_flush, frozen)
def _do_flush(self, frozen):
path = f"sstable-{next_id()}.sst"
flush_memtable(frozen, path)
with self.lock:
self.sstables.insert(0, SSTableReader(path))
self.frozen_queue.remove(frozen)
Three new concerns appear the moment you go async.
- Ordering. Two flushes can be in flight at once. They must install their SSTables in the correct order — the one from the older memtable goes at the older position in the sstables list. Solution: assign sequence numbers at freeze time, install at finish in sequence-number order (reorder with a small priority queue if finishes race).
- Backpressure. If writes are faster than flushes, the frozen queue grows without bound. Solution: bound the queue (typically to 2 or 3), and when full, make the next
put()block until a flush finishes. This is a smooth degradation — under sustained overload, writes slow to the speed flushes can drain at, but correctness is preserved. RocksDB calls this a "write stall." - Crash during a pending flush. If the process dies with two frozen memtables queued and one flush complete but not installed, the recovery logic must rebuild the right memtables from the right WAL segments and replay. Chapter 22 covers this.
The synchronous flush in this chapter is a simpler starting point. Graduating to async is the next stop once the write path exists and measurements show it.
Atomicity of flush — the tmp + fsync + rename trick
The pattern in the flush.py code is a classic atomic-file-publish sequence, and it is worth being pedantic about why each step is where it is:
- Write to
path.tmp, notpath. Until the rename, no other code in the system treatspath.tmpas a real SSTable. This is the "uncommitted" state. fsync()the tmp file before rename. On POSIX,rename()is atomic with respect to the filesystem's view of directory entries, but the contents of the file are not guaranteed to be on disk at the moment of rename. If the host loses power between the rename and the contents hitting the disk, you can findpathin the directory but empty or torn. Fsync first.rename(tmp, path)is the commit point. After this call returns, a reader openingpathwill see a complete SSTable. Before this call, no reader should have been looking atpath.fsync()the directory. On paranoid filesystems you also have tofsync()the directory after the rename, to make the directory-entry durable.ext4defaultdata=orderedmode handles this for you; some filesystems do not. Production code does the dir-fsync explicitly.
This pattern is how every serious on-disk format commits: SQLite, LevelDB, PostgreSQL's base backup, and Git's loose-object writes all use variations of tmp + fsync + rename. It is cheap, it is crash-safe, and it has no moving parts. [1][3]
Integration with the write-ahead log (preview of Build 5)
The memtable is a RAM-only structure. A process crash wipes it. So how does a put() ever become durable before the next flush?
The answer is the write-ahead log (WAL): every put() first appends (key, value, timestamp) to a log file and fsyncs, and then inserts into the memtable. If the process crashes, the WAL survives; on restart, the WAL is replayed into a fresh memtable and the store is back where it was.
That gives the flush a crucial property: the flush does not have to be fsynced synchronously with the write. The WAL already made the write durable. The flush can run on its own schedule, and when it finishes (with its own fsync, of course), the corresponding WAL segments can be reclaimed — their contents are now in the SSTable, which is the more compact, indexed form.
This is why the memtable "can be dropped once durable" in the chapter's TL;DR — the frozen memtable is dropped as soon as the SSTable is renamed, not because the data is no longer valuable, but because the same data is now in two durable places (WAL and SSTable), and the WAL's copy is the one we can reclaim first.
Build 5 builds the WAL from scratch (chapter 21) and shows the exact sequence: write→WAL→fsync→memtable, flush→SSTable→fsync→drop WAL segment. For this chapter, it is enough to know that the flush sits inside a larger durability loop and the flush's own fsync is just one link in that chain.
Where this leads next
You now have the three parts of a working LSM-tree write path: an in-memory sorted memtable that absorbs writes, a flush that turns the memtable into an SSTable, and a stack of SSTables on disk. What you do not yet have is a coherent read across all of them.
A get(key) in this design must check:
- The active memtable (might have the newest value).
- The frozen memtable, if any (slightly older, still in RAM).
- Each SSTable on disk, from newest to oldest, stopping at the first hit.
That is the topic of the next chapter — the read path. The subtlety is not the algorithm itself (which is a four-line loop) but the efficiency: if you have a hundred SSTables and the key you want lives only in the newest memtable, you do not want to open and probe all hundred files. Bloom filters (chapter 16) are the fix, and they turn what looks like an O(n) file-scan into a ~1-file probe with negligible false-positive cost.
For now, though, take a moment with what you have built. A memtable plus a flush plus an SSTable reader is already a working key-value store. It handles more data than RAM. It survives restarts (once the WAL arrives in Build 5). It serves range queries cheaply. And the write path is a single O(\log n) insert plus an occasional background flush — the cheapest write path any durable database can offer.
Fifty lines of flush code bridged the biggest gap in chapter 12's design: how do you keep a file sorted while accepting random writes? Answer: you don't — you sort the writes in RAM, batch them up, and transcribe them to disk in order. The flush is the transcription. The sparse index from chapter 12 sits at the tail of every file the flush produces. The merge from the same chapter is what will later combine the files the flush keeps producing. Everything clicks together, and Build 3's machine is almost complete.
References
- Sanjay Ghemawat and Jeff Dean, LevelDB Implementation Notes — describes the block-aligned SSTable layout, footer-at-the-tail design, and the
tmp + fsync + renameatomic-publish pattern used for new table files. github.com/google/leveldb/blob/main/doc/impl.md. - Fay Chang et al., Bigtable: A Distributed Storage System for Structured Data (OSDI 2006) — §4 introduces the memtable/SSTable split and the flush operation. research.google.
- Pillai et al., All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications (OSDI 2014) — empirical study of how real applications get the
fsync/renameordering wrong and what filesystems guarantee. usenix.org. - Siying Dong et al., Optimizing Space Amplification in RocksDB (CIDR 2017) — discusses memtable sizing, flush policies, and the interaction between flushes and compactions in production. cidrdb.org.
- Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 3 — walks through the full memtable→flush→SSTable lifecycle with the same shape as this chapter. dataintensive.net.
- Avinash Lakshman and Prashant Malik, Cassandra — A Decentralized Structured Storage System (LADIS 2009) — §5.1 describes Cassandra's memtable-flush mechanism and the immutable-memtable pool that motivates the active/frozen split. cassandra.apache.org.