In short

The memtable from chapter 13 is fast, but it is in RAM and it is bounded in size. Eventually it gets full — typically 32 MB to 256 MB — and something has to move it out. That "something" is a flush: a single sequential pass over the memtable in sorted order, packing records into block-aligned chunks on disk, emitting a tiny sparse index at the tail, and closing with a footer that points back at the index. Because the memtable is already sorted (a skip list walks in order), the flush is not a sort — it is a transcription. The trick that makes it safe under concurrent writes is the active/frozen swap: the moment you decide to flush, the current memtable is marked immutable ("frozen") and a fresh empty one is installed as the "active" writer. Reads check both until the flush completes; writes go only to the active one. When the flush finishes, the frozen memtable is dropped (the WAL has already made it durable; see Build 5), and an SSTable sits on disk, ready to be read. This chapter writes the fifty-line flush_memtable(), draws the SSTable layout and the active/frozen swap, and handles the three boring-but-critical details — block alignment, atomic rename, and what "done" means when a crash can happen at any byte.

You have a memtable. It is a skip list holding sorted key-value pairs in RAM, it accepts writes in O(\log n), and — the non-negotiable part — it has a memory budget. In LevelDB the default is 4 MB. In RocksDB it is 64 MB. In Cassandra it is a fraction of the JVM heap. Pick your number, but there is always a number, and when the memtable crosses it, you have a problem.

You cannot just grow the memtable forever. RAM is finite; a process that never evicts anything from RAM either crashes or drives the host into swap. You cannot insert into an on-disk SSTable in place, because the on-disk files are sorted and immutable by design (that is why chapter 12 cost you the keydir — to gain sort order you gave up in-place update).

So the only move is to move the memtable to disk as a complete, new SSTable. That operation is called a flush, and it is the hinge on which Build 3 turns. Memtable absorbs random writes cheaply. Flush turns those absorbed writes into a sorted, durable file. SSTable holds them forever, until a compaction (next few chapters) decides otherwise.

This chapter builds the flush.

When to flush

Three signals trigger a flush in real systems, and any production engine uses all three:

  1. Size. The memtable crosses a byte threshold — 32 MB, 64 MB, 256 MB depending on the engine. This is the dominant trigger under heavy write load. It bounds RAM usage and bounds the size of each resulting SSTable.
  2. Time. Even a nearly empty memtable gets flushed after some maximum age — typically five or ten minutes. This bounds how long a write can sit in RAM before it becomes a durable SSTable (the WAL covers crash-safety in the meantime, but flushing still matters for the read path and for eventually reclaiming WAL space).
  3. Pressure from new writes. If a write arrives and the memtable is already full or nearly so, the write path itself triggers the flush. This is a back-pressure mechanism: if flushes cannot keep up with writes, the write path stalls briefly until the flush completes, rather than letting RAM usage unbound.

You will implement all three as a single should_flush() predicate. The shape:

# flush_policy.py — when to flush
MEMTABLE_MAX_BYTES = 64 * 1024 * 1024   # 64 MB
MEMTABLE_MAX_AGE_SECONDS = 10 * 60      # 10 minutes

def should_flush(memtable, now):
    """Return True if the given memtable should be sealed and written to disk."""
    if memtable.size_bytes >= MEMTABLE_MAX_BYTES:
        return True                                  # size trigger
    if now - memtable.created_at >= MEMTABLE_MAX_AGE_SECONDS:
        return True                                  # time trigger
    return False

The pressure trigger is not in this predicate — it is a consequence of a write finding should_flush() already true on entry, and synchronously kicking off the flush before returning. Real engines run the flush on a background thread so writes never block on disk I/O, but the first implementation can be synchronous and still be correct. Async is an optimisation (covered at the end).

The flush algorithm

Pseudocode first, because the shape is small and the details will look scary if you have not seen the outline:

flush_memtable(mt, path):
    1. seal mt (no more writes)           # active/frozen swap happens here
    2. open a temp file at path.tmp
    3. for each (key, value) in mt.in_sorted_order():
           pack record into the current block buffer
           if block buffer >= BLOCK_SIZE:
               write block to file
               record (first_key_of_block, offset) in sparse_index
               start a new block buffer
    4. flush any partial block
    5. write sparse_index as a single block
    6. write footer: (sparse_index_offset, sparse_index_length, magic)
    7. fsync the file
    8. rename path.tmp -> path      # atomic on any POSIX filesystem
    9. drop the frozen memtable

Nine steps. Three of them are the real work (3, 5, 6); three are concurrency/durability hygiene (1, 7, 8); three are bookkeeping (2, 4, 9). Let us draw the on-disk shape first, then write the code.

SSTable file layout — data blocks, sparse index, footerA horizontal stack representing an SSTable file, divided into four vertical sections from left to right. The first three sections are labelled data block 0, data block 1, data block 2 and so on, each one 4 kilobytes wide, showing sorted key-value pairs inside. The fourth section at the right is a narrower block labelled sparse index that lists the first key of each data block with its byte offset. At the very right tail is a small footer labelled magic and index_offset. Arrows from each data block point to the corresponding entry in the sparse index.sstable-0001.sst (on disk, left to right)data block 04 KBage=17city=delhiemail=a@b...first_key = agedata block 14 KBlocale=en-INname=diptiphone=99.....first_key = localedata block 24 KBrole=adminstate=TNzip=600001...first_key = role...sparse index blockfew KB, one entry per blockage → 0locale → 4096role → 8192...footer48 Bindex_offindex_lennum_recordsmagicreader opens tail-first: read footer (last 48 B) → seek to index → binary-search index → seek to data block → scantotal size = ceil(data / 4 KB) × 4 KB + index + 48 B footer
An SSTable on disk. Data blocks come first (block-aligned, each one 4 KB here), the sparse index follows as a single block, and a fixed-size footer sits at the very end. A reader starts at the tail — read the last 48 bytes, learn where the index lives, read the index, binary-search it, seek to the right data block, scan for the target key. The whole file is laid out for exactly this access pattern.

Three design decisions are baked into this layout. Each one earns its keep.

Block alignment. Data is packed into fixed-size blocks (here, 4 KB). A record straddles a block boundary only if it is itself bigger than a block; otherwise each block holds a whole number of records with a little padding at the end. The payoff is that every disk read is one page — the OS page cache, the SSD's FTL, and the filesystem all work in pages, so aligning to page boundaries means one read = one block = one cache line of I/O. You never pay to read half of an adjacent block just because your target happened to span two pages.

Sparse index at the tail, not the head. The index is written after the data because you do not know the block offsets until you have written the blocks. Writing sequentially, top-to-bottom, means you discover offsets as you go and emit the index only once, at the end, in one pass.

Footer at the absolute end. A reader does not know where the index starts — it could be anywhere, because data size varies. The solution is the footer: a fixed-size block at a known position (end minus 48 bytes). Read the footer first, learn the index offset, then seek to it. This makes SSTables self-describing: no external metadata file needed.

The code

Here is the fifty-line flush. It assumes a memtable that supports .items() yielding (key, value) in sorted order (the skip list from chapter 13 does this for free):

# flush.py — turn a sorted memtable into a block-aligned SSTable on disk
import os, struct

BLOCK_SIZE = 4096
MAGIC = b"SSTv1\x00\x00\x00"          # 8 bytes; reader checks this
FOOTER_SIZE = 8 + 8 + 8 + 8             # index_off, index_len, num_records, magic

def _encode_record(key, value):
    kb, vb = key.encode("utf-8"), value.encode("utf-8")
    return struct.pack(">II", len(kb), len(vb)) + kb + vb

def flush_memtable(mt, path):
    """
    Write the memtable's sorted contents to a new SSTable at `path`.
    Uses a .tmp suffix and atomic rename so partial files never survive a crash.
    """
    tmp = path + ".tmp"
    sparse_index = []                       # list of (first_key, block_offset)
    num_records = 0
    block = bytearray()
    block_first_key = None
    with open(tmp, "wb") as f:
        for key, value in mt.items():                       # sorted order
            rec = _encode_record(key, value)
            if block and len(block) + len(rec) > BLOCK_SIZE:
                # current block is full — pad and flush it
                block += b"\x00" * (BLOCK_SIZE - len(block))
                sparse_index.append((block_first_key, f.tell()))
                f.write(block)
                block, block_first_key = bytearray(), None
            if block_first_key is None:
                block_first_key = key                       # first key of new block
            block += rec
            num_records += 1
        if block:                                           # flush final partial block
            block += b"\x00" * (BLOCK_SIZE - len(block))
            sparse_index.append((block_first_key, f.tell()))
            f.write(block)
        # write sparse index as one block
        index_off = f.tell()
        index_buf = bytearray()
        for k, off in sparse_index:
            kb = k.encode("utf-8")
            index_buf += struct.pack(">IQ", len(kb), off) + kb
        f.write(index_buf)
        index_len = len(index_buf)
        # write footer at the very end
        f.write(struct.pack(">QQQ", index_off, index_len, num_records) + MAGIC)
        f.flush()
        os.fsync(f.fileno())                                # durable before rename
    os.rename(tmp, path)                                    # atomic publish
    return {"path": path, "records": num_records, "blocks": len(sparse_index)}

Fifty lines if you count the encoder. Read it once with the layout diagram open, and trace what a single record does: it gets encoded (4 bytes key-length, 4 bytes value-length, key, value), appended to the current block buffer, and — if the block is full — the buffer gets padded to 4096 bytes and flushed to the file. Meanwhile the sparse-index list grows by one entry per block, not one entry per record.

Why pad the block instead of leaving it short: if you write a 3840-byte block followed by the next block, the file positions no longer land on 4 KB boundaries, and the "one disk read = one block" property is gone. Padding is wasted bytes, but at 4 KB blocks the waste is at most a few percent and the alignment is worth every lost byte.

Why fsync() before rename(): write() only copies bytes into the kernel's page cache; until an fsync(), the file could vanish in a crash. And rename() is only atomic with respect to the filesystem metadata — if the data has not been fsynced, the renamed file can exist on disk as a zero-byte (or partially written) entry. Fsync the data, then rename; that is the safe order. [3]

Why the magic bytes at the end, not the beginning: the reader of an SSTable starts at the tail. It reads the last 48 bytes, sanity-checks the magic, and extracts the index offset. Magic at the head would require a second seek. LevelDB, RocksDB, Parquet, and many others put their magic and metadata at the tail for exactly this reason.

During-flush writes: the active/frozen swap

Here is the question the pseudocode hides. A flush takes time — disks are not instantaneous, and a 64 MB memtable dumping to SSD at 500 MB/s still needs 128 milliseconds. During that 128 ms, a write arrives. It cannot go into the memtable that is currently being iterated (you would invalidate the iterator, and the ordering guarantees evaporate). It cannot be dropped. It must go somewhere.

The answer is the active/frozen swap. It is worth drawing because the picture makes the flow obvious.

The active/frozen memtable swap during a flushThree vertical panels showing three moments in time. Left panel labelled "before flush" shows a single box labelled active memtable (M0) holding a few entries, with writes arriving as arrows from the top pointing into M0. Middle panel labelled "flush begins" shows two boxes side by side: a box labelled frozen memtable (M0) marked "sealed, being flushed", and next to it a fresh empty box labelled active memtable (M1) with writes arriving into M1. A dashed arrow from M0 points to a disk icon labelled sstable-0001.sst. Right panel labelled "flush done" shows only one box again, the active memtable (M1), and the disk now holds sstable-0001.sst while the frozen M0 has been dropped. An arrow below the three panels labels the time axis "t".before flushactive M0skip list, sortedage=17city=delhiname=dipti...put()flush beginsfrozen M0immutablebeing flushedage=17city=delhiname=diptiactive M1fresh, emptyviews=421(new writes)put() now → M1sstable-0001writing...flush doneactive M1only memtable nowviews=421score=A9F...sstable-0001durabletime →
The active/frozen swap. Before the flush, one memtable (M0) accepts all writes. When the flush starts, M0 is atomically sealed and renamed "frozen"; a fresh M1 is installed as the active writer. New writes go to M1; the flush iterates M0 in peace. When the SSTable is durable on disk, the frozen M0 can be dropped. At most one memtable is being written to at any moment, but reads check both until the flush completes.

The swap is the one non-trivial piece of concurrency in the whole design, and it is small. A sketch:

# store.py — active/frozen swap
class LsmStore:
    def __init__(self):
        self.active = Memtable()             # accepts writes
        self.frozen = None                   # being flushed, or None
        self.sstables = []                   # on-disk, newest-first

    def put(self, key, value):
        if should_flush(self.active, time.time()):
            self._begin_flush()
        self.active.put(key, value)          # always to the active one

    def get(self, key):
        if (v := self.active.get(key)) is not None: return v
        if self.frozen is not None:
            if (v := self.frozen.get(key)) is not None: return v
        for sst in self.sstables:            # newest first
            if (v := sst.get(key)) is not None: return v
        return None

    def _begin_flush(self):
        # atomic swap: seal active, install fresh one
        self.frozen = self.active
        self.active = Memtable()
        # run flush (synchronously for now; see "async flush threads" below)
        path = f"sstable-{len(self.sstables):04d}.sst"
        flush_memtable(self.frozen, path)
        self.sstables.insert(0, SSTableReader(path))
        self.frozen = None                   # drop frozen; WAL made it safe

The key invariants:

A worked flush

Flushing a 5-key memtable to `demo.sst`

Take a tiny memtable with five records and walk it through the flush step by step. BLOCK_SIZE is lowered to 64 bytes just for the demo, so blocks actually split.

# demo_flush.py
mt = Memtable()                             # skip list from chapter 13
mt.put("age",    "19")
mt.put("city",   "delhi")
mt.put("email",  "dipti@padho.wiki")
mt.put("locale", "en-IN")
mt.put("name",   "dipti")
# BLOCK_SIZE = 64 for the demo

flush_memtable(mt, "demo.sst")

Trace the flush:

items() yields in sorted order:
    age=19          encoded: 8 + 2 + 3 + 2  = 15 bytes
    city=delhi      encoded: 8 + 4 + 5 + 2  = 19 bytes
    email=dipti...  encoded: 8 + 5 + 16+ 2  = 31 bytes
    locale=en-IN    encoded: 8 + 6 + 5 + 2  = 21 bytes
    name=dipti      encoded: 8 + 4 + 5 + 2  = 19 bytes

Block 0 (first_key=age):
    pack age=19 (15 B)      buffer: 15 B
    pack city=delhi (19 B)  buffer: 34 B
    pack email=... (31 B)   buffer: 65 B → over 64 B!
    → write current block (just age + city = 34 B, pad to 64 B)
       sparse_index = [("age", 0)]
       file now at offset 64
    Block 1 (first_key=email):
       pack email=... (31 B) buffer: 31 B
       pack locale=en-IN (21 B) buffer: 52 B
       pack name=dipti (19 B) buffer: 71 B → over 64 B!
       → write block (email + locale = 52 B, pad)
          sparse_index = [("age", 0), ("email", 64)]
          file now at offset 128
    Block 2 (first_key=name):
       pack name=dipti (19 B) buffer: 19 B
       end of memtable
       → flush final block (name = 19 B, pad to 64 B)
          sparse_index = [("age", 0), ("email", 64), ("name", 128)]
          file now at offset 192

Sparse index written at offset 192, length ~45 bytes.
Footer at the tail: (index_off=192, index_len=45, num_records=5, MAGIC).
fsync. rename demo.sst.tmp → demo.sst.

Five records, three blocks, three sparse-index entries, one footer. The file is 192 + 45 + 32 = 269 bytes. A reader would open it tail-first: read the last 32 bytes, find index_off=192, seek to 192, read 45 bytes of sparse index, binary-search for the target key, seek to the right data block, scan forward. The exact same read path as the sparse-index probe in chapter 12, now running over a self-describing file.

Notice what did not happen: nobody sorted anything. The memtable was already sorted. The flush is a linear transcription — one pass, one write, constant extra memory. That is the whole point of keeping the memtable sorted in RAM: the flush becomes trivial.

Common confusions

Going deeper

You have the flush. This section fills in the three production concerns — compression, async flushes, and how the flush fits into the wider crash-safety story.

Block compression (preview of Build 15)

The 4 KB block is not just an alignment unit, it is a compression unit. Because keys in a sorted SSTable have long shared prefixes (user:00001, user:00002, ...) and values often repeat (country codes, enum strings, zero-padded integers), a single block typically compresses by 3–5× with a general-purpose algorithm like LZ4 or Snappy.

Compression is a per-block decision because you want to be able to decompress one block in isolation — a point lookup still reads exactly one block from disk and decompresses it in RAM. If you compressed the whole SSTable as one stream, a point lookup would require decompressing from the start up to the target, which wipes out the O(\log n) access of the sparse index.

The flush algorithm barely changes: instead of writing block directly, you compress it first, and record the compressed size in the sparse index (alongside the original block offset). Each block gets a one-byte "compression type" prefix (0 = none, 1 = snappy, 2 = zstd...) so the reader can dispatch to the right decoder.

Build 15 walks through this in detail with benchmarks; for now it is enough to know the block structure was chosen partly to make compression plug in cleanly.

Async flush threads and backpressure

Real engines run the flush on a background thread so the write path never blocks on disk I/O. Pseudocode:

# async flush sketch
def _begin_flush(self):
    frozen = self.active
    self.active = Memtable()
    self.frozen_queue.append(frozen)         # multiple may queue
    self.flush_executor.submit(self._do_flush, frozen)

def _do_flush(self, frozen):
    path = f"sstable-{next_id()}.sst"
    flush_memtable(frozen, path)
    with self.lock:
        self.sstables.insert(0, SSTableReader(path))
        self.frozen_queue.remove(frozen)

Three new concerns appear the moment you go async.

The synchronous flush in this chapter is a simpler starting point. Graduating to async is the next stop once the write path exists and measurements show it.

Atomicity of flush — the tmp + fsync + rename trick

The pattern in the flush.py code is a classic atomic-file-publish sequence, and it is worth being pedantic about why each step is where it is:

  1. Write to path.tmp, not path. Until the rename, no other code in the system treats path.tmp as a real SSTable. This is the "uncommitted" state.
  2. fsync() the tmp file before rename. On POSIX, rename() is atomic with respect to the filesystem's view of directory entries, but the contents of the file are not guaranteed to be on disk at the moment of rename. If the host loses power between the rename and the contents hitting the disk, you can find path in the directory but empty or torn. Fsync first.
  3. rename(tmp, path) is the commit point. After this call returns, a reader opening path will see a complete SSTable. Before this call, no reader should have been looking at path.
  4. fsync() the directory. On paranoid filesystems you also have to fsync() the directory after the rename, to make the directory-entry durable. ext4 default data=ordered mode handles this for you; some filesystems do not. Production code does the dir-fsync explicitly.

This pattern is how every serious on-disk format commits: SQLite, LevelDB, PostgreSQL's base backup, and Git's loose-object writes all use variations of tmp + fsync + rename. It is cheap, it is crash-safe, and it has no moving parts. [1][3]

Integration with the write-ahead log (preview of Build 5)

The memtable is a RAM-only structure. A process crash wipes it. So how does a put() ever become durable before the next flush?

The answer is the write-ahead log (WAL): every put() first appends (key, value, timestamp) to a log file and fsyncs, and then inserts into the memtable. If the process crashes, the WAL survives; on restart, the WAL is replayed into a fresh memtable and the store is back where it was.

That gives the flush a crucial property: the flush does not have to be fsynced synchronously with the write. The WAL already made the write durable. The flush can run on its own schedule, and when it finishes (with its own fsync, of course), the corresponding WAL segments can be reclaimed — their contents are now in the SSTable, which is the more compact, indexed form.

This is why the memtable "can be dropped once durable" in the chapter's TL;DR — the frozen memtable is dropped as soon as the SSTable is renamed, not because the data is no longer valuable, but because the same data is now in two durable places (WAL and SSTable), and the WAL's copy is the one we can reclaim first.

Build 5 builds the WAL from scratch (chapter 21) and shows the exact sequence: write→WAL→fsync→memtable, flush→SSTable→fsync→drop WAL segment. For this chapter, it is enough to know that the flush sits inside a larger durability loop and the flush's own fsync is just one link in that chain.

Where this leads next

You now have the three parts of a working LSM-tree write path: an in-memory sorted memtable that absorbs writes, a flush that turns the memtable into an SSTable, and a stack of SSTables on disk. What you do not yet have is a coherent read across all of them.

A get(key) in this design must check:

  1. The active memtable (might have the newest value).
  2. The frozen memtable, if any (slightly older, still in RAM).
  3. Each SSTable on disk, from newest to oldest, stopping at the first hit.

That is the topic of the next chapter — the read path. The subtlety is not the algorithm itself (which is a four-line loop) but the efficiency: if you have a hundred SSTables and the key you want lives only in the newest memtable, you do not want to open and probe all hundred files. Bloom filters (chapter 16) are the fix, and they turn what looks like an O(n) file-scan into a ~1-file probe with negligible false-positive cost.

For now, though, take a moment with what you have built. A memtable plus a flush plus an SSTable reader is already a working key-value store. It handles more data than RAM. It survives restarts (once the WAL arrives in Build 5). It serves range queries cheaply. And the write path is a single O(\log n) insert plus an occasional background flush — the cheapest write path any durable database can offer.

Fifty lines of flush code bridged the biggest gap in chapter 12's design: how do you keep a file sorted while accepting random writes? Answer: you don't — you sort the writes in RAM, batch them up, and transcribe them to disk in order. The flush is the transcription. The sparse index from chapter 12 sits at the tail of every file the flush produces. The merge from the same chapter is what will later combine the files the flush keeps producing. Everything clicks together, and Build 3's machine is almost complete.

References

  1. Sanjay Ghemawat and Jeff Dean, LevelDB Implementation Notes — describes the block-aligned SSTable layout, footer-at-the-tail design, and the tmp + fsync + rename atomic-publish pattern used for new table files. github.com/google/leveldb/blob/main/doc/impl.md.
  2. Fay Chang et al., Bigtable: A Distributed Storage System for Structured Data (OSDI 2006) — §4 introduces the memtable/SSTable split and the flush operation. research.google.
  3. Pillai et al., All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications (OSDI 2014) — empirical study of how real applications get the fsync/rename ordering wrong and what filesystems guarantee. usenix.org.
  4. Siying Dong et al., Optimizing Space Amplification in RocksDB (CIDR 2017) — discusses memtable sizing, flush policies, and the interaction between flushes and compactions in production. cidrdb.org.
  5. Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 3 — walks through the full memtable→flush→SSTable lifecycle with the same shape as this chapter. dataintensive.net.
  6. Avinash Lakshman and Prashant Malik, Cassandra — A Decentralized Structured Storage System (LADIS 2009) — §5.1 describes Cassandra's memtable-flush mechanism and the immutable-memtable pool that motivates the active/frozen split. cassandra.apache.org.