Snapshots and Iterators Over Shifting Data

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

A point lookup (chapter 15) is instantaneous — it reads the memtable and a handful of SSTable blocks and returns in microseconds. A range scan is not instantaneous. scan("user:1000", "user:9999") across a hundred million rows might stream for minutes, emitting keys the whole time. During those minutes the background compactor is rewriting SSTables, deleting old ones, installing new ones. If the iterator is holding a reference to sstable-0042 and the compactor deletes that file mid-scan, the iterator gets a file-not-found error (or worse, silently corrupt data) the next time it tries to read a block. A snapshot is the mechanism that makes long-running reads safe against the moving floor beneath them. Every write in the LSM gets a monotonically increasing sequence number (a 64-bit counter) stamped onto it. When a reader opens a snapshot, it captures the current seq, call it snap = N. The snapshot is a promise: "this reader will only see writes whose seq ≤ N." Two mechanisms make the promise hold. (1) At read time, the merging iterator walks all layers in sorted key order and, for each key, returns the newest version whose seq ≤ snap — ignoring anything newer. (2) At compaction time, the compactor consults the list of live snapshots before dropping any record, and keeps versions that are still visible to some open snapshot; it also refuses to delete an SSTable file if any open snapshot references it. The second rule is the file pin: an open snapshot holds SSTables alive against the compactor. Release the snapshot (the client is done with the scan), and the pinned files become eligible for deletion again. This chapter writes the k-way heap-merge that stitches the memtable and SSTables into one sorted stream, draws the pin across a compaction cycle, and previews how the same seq-number machinery grows into full multi-version concurrency control (MVCC) when Build 7 layers transactions on top.

You have the read path. A get("alice") checks the memtable, then walks SSTables newest-to-oldest, short-circuiting on the first hit. It takes microseconds. It sees a consistent point-in-time view of the store because the whole operation completes before any background activity can reshape the layers under it.

Now consider the other kind of read.

scan("user:1000", "user:9999")

A range scan. The caller wants every key in that range, in sorted order, streamed one at a time. On a hundred-million-row store this might emit ten million keys and take minutes. During those minutes, the compactor is awake. It is picking SSTables, merging them, writing new ones, deleting the old inputs. The database from one second to the next is a different database — different set of files on disk, different distribution of keys across layers.

The iterator started by opening thirty SSTable files. Halfway through the scan, the compactor merges eight of those into two new ones and deletes the eight originals. The iterator is still holding file descriptors to the originals — on Linux those FDs keep the underlying inodes alive until closed, so the read itself works, but on other OSes or if the engine uses file paths instead of FDs, the reader would see a file-not-found error or read garbage from a reused inode. Even on Linux, if the iterator merely remembers the filename ("sstable-0042.sst") and opens it lazily on each block access, the compactor's unlink kills the next block read.

There is a second, subtler problem. Suppose the compactor is allowed to drop records during merge: an overwritten value gets dropped because a newer put exists, a tombstone-plus-live-row pair gets dropped because the tombstone is past gc-grace. Those drops are safe for future reads — nobody will query for a value the user overwrote last week. But they are not safe for the iterator in progress, because the iterator opened before those drops happened and has already committed to showing the caller a view from that earlier moment. If the compactor drops a record the iterator has not yet emitted, the caller sees a hole in their result set where they should have seen a key.

So a range scan in an LSM is not just a merge across layers. It is a merge across layers as they existed at the moment the scan began, continuing to hold that view steady even as the real layers shift underneath. The mechanism that makes this possible is the snapshot.

The sequence number

Every write in an LSM — every put(key, value) or delete(key) — gets a monotonically increasing sequence number stamped onto it. Think of it as a global write counter: the first write in the database's life is seq=1, the second is seq=2, and so on. Production engines use 64-bit counters (the space wraps after 2^{64} writes, which at a million writes per second is 600,000 years; in practice this is treated as "never").

The seq is part of the internal key. Up to now we have been writing records as (key, value, type) where type is put or tombstone. The real on-disk record is (key, seq, type, value) — the seq sits inside the key for sort purposes, not alongside the value. Two records with the same user-facing key but different seqs are different internal keys, and they can coexist in the same SSTable without either overwriting the other.

user-facing write                  internal record
---------------------------        -----------------------------------
put("alice", "35")  at seq=101     ("alice", seq=101, PUT, "35")
put("alice", "36")  at seq=247     ("alice", seq=247, PUT, "36")
delete("alice")     at seq=389     ("alice", seq=389, DEL, <empty>)

Three internal records, one user-facing key. All three may sit in the memtable together, or spread across different SSTables. The read path's "newest wins" rule is just "highest seq wins for a given user key."

Why 64 bits and not a smaller counter? Because the counter has to be unique across the entire database lifetime, including across restarts. On restart, the engine recovers the last-used seq from the WAL (or from the highest seq in any SSTable) and continues from there. A 32-bit counter wraps in an hour at a million writes per second; 64 bits gives you cosmic-timescale headroom.

The snapshot object

A snapshot is a tiny object — essentially a sequence number and a refcount. Conceptually:

class Snapshot:
    def __init__(self, seq, store):
        self.seq = seq                  # the "as-of" sequence number
        self.store = store
        self.released = False

    def release(self):
        """Client signals they are done with this snapshot."""
        if not self.released:
            self.store._release_snapshot(self)
            self.released = True

When a client asks the store for a snapshot, the store takes the current value of its monotonic seq counter and hands it back:

class LsmStore:
    def __init__(self):
        self.next_seq = 1                # monotonic write counter
        self.live_snapshots = set()      # every open snapshot
        self.lock = threading.RLock()
        # ... memtable, sstables, etc ...

    def create_snapshot(self):
        with self.lock:
            snap = Snapshot(seq=self.next_seq - 1, store=self)
            self.live_snapshots.add(snap)
        return snap

    def _release_snapshot(self, snap):
        with self.lock:
            self.live_snapshots.discard(snap)
        # The compactor may now reclaim files that only this snapshot was pinning.

snap.seq = self.next_seq - 1 means "the highest seq that has already been assigned." Any write that comes in after this point gets a higher seq and is invisible to the snapshot.

live_snapshots is the set of every snapshot the store currently has open. The compactor will consult this set before dropping records or deleting files; more on this in a moment.

Reading through a snapshot

The read path gets one small modification: when it probes a layer, it filters to records with seq ≤ snap.seq, ignoring anything newer.

class LsmStore:
    def get_at(self, key, snap):
        """Point lookup as of the snapshot."""
        layers = self._layer_snapshot()          # memtable + frozen + sstables, newest first
        for layer in layers:
            rec = layer.probe_le(key, snap.seq)  # highest seq ≤ snap.seq for this key
            if rec is None:
                continue
            if rec.type == TOMBSTONE:
                return None
            return rec.value
        return None

The only change from chapter 15's lsm_get is that probe() becomes probe_le() — "probe for the highest-seq record for key whose seq is ≤ the snapshot's seq." Inside the memtable's skip list this is a floor-scan over the internal keys (key, *). Inside an SSTable it is the same sparse-index block read followed by a scan that stops at the first record matching key with seq ≤ snap.seq (internal keys are sorted (user_key ASC, seq DESC), so the scan sees the highest seq first and can stop).

This makes point reads snapshot-aware for free. The interesting case is range scans.

The merging iterator

A range scan has to walk every key in [lo, hi] in sorted order, returning the newest-seq-below-snap version of each. The only way to do this with layers scattered across memtable + N SSTables is a k-way merge: open an iterator on each layer, each positioned at the first key ≥ lo, and repeatedly emit the smallest key across all of them.

The classic data structure for k-way merge is a min-heap keyed on the current front of each layer's iterator. Pop the smallest; if it is the same user key as the previous emission (stale duplicate), skip it; otherwise emit. Then advance that layer's iterator by one and push the new front back into the heap.

# merging_iterator.py — a scan that sees a consistent snapshot
import heapq

class MergingIterator:
    """K-way merge across memtable + frozen memtable + SSTables.

    Emits (user_key, value) pairs in sorted user_key order, each one the
    newest version with seq ≤ snap.seq. Skips tombstoned keys. Pins all
    underlying layers alive for its lifetime."""

    def __init__(self, layers, snap, lo, hi):
        self.snap = snap
        self.hi = hi
        # Each layer yields internal records (user_key, seq, type, value)
        # in order: user_key ASC, seq DESC.
        self.sub_iters = [layer.seek(lo) for layer in layers]
        self.heap = []
        for idx, it in enumerate(self.sub_iters):
            rec = next(it, None)
            if rec is not None:
                # Heap key: (user_key, -seq, idx).
                # -seq so the newest version of a user_key comes out first
                # on ties; idx breaks further ties deterministically.
                heapq.heappush(self.heap, (rec.user_key, -rec.seq, idx, rec))
        self.last_emitted_key = None

    def __iter__(self):
        return self

    def __next__(self):
        while self.heap:
            user_key, neg_seq, idx, rec = heapq.heappop(self.heap)
            # Advance the sub-iterator we just drained.
            nxt = next(self.sub_iters[idx], None)
            if nxt is not None and nxt.user_key <= self.hi:
                heapq.heappush(
                    self.heap, (nxt.user_key, -nxt.seq, idx, nxt)
                )
            # Stop scan once we pass the upper bound.
            if user_key > self.hi:
                raise StopIteration
            # Skip versions newer than the snapshot.
            if rec.seq > self.snap.seq:
                continue
            # Skip duplicate user_keys — we already emitted a newer version.
            if user_key == self.last_emitted_key:
                continue
            self.last_emitted_key = user_key
            # Tombstone: mark this user_key as "seen" so older puts are skipped,
            # but do not emit anything to the caller.
            if rec.type == TOMBSTONE:
                continue
            return (user_key, rec.value)
        raise StopIteration

    def close(self):
        """Release file descriptors and drop the snapshot reference."""
        for it in self.sub_iters:
            it.close()

Why the heap key is (user_key, -seq, idx): popping the smallest entry must surface the smallest user_key first (that is the sort order the caller expects), and within a user_key the newest version (highest seq) first (so we emit the freshest data and skip older versions on subsequent pops of the same key). Negating the seq flips the natural min-heap ordering into "largest-seq-first within a user_key group." idx is a tiebreaker in case two layers hold a record with identical (user_key, seq) — this should never happen in a well-formed store, but a tiebreaker keeps the heap deterministic instead of crashing on comparisons of rec objects.

Why last_emitted_key is needed even though the heap orders by user_key: after emitting ("alice", "35") from the newest layer, the next heap pop may be ("alice", "34") from an older layer — same user_key, older seq, irrelevant to the caller. The dedup check catches it. Without the check, the caller would see alice twice with decreasing freshness.

Why the tombstone case skips silently instead of emitting (key, None): the contract of a range scan is "show me every live key in [lo, hi]". A deleted key is not live. Emitting it with None would leak internal representation to the caller. The last_emitted_key update before the continue is critical — it records that we have now "spoken for" this user_key, so older put-versions of it in deeper layers get skipped on their turn at the heap.

Pinning files across compaction

The merging iterator is correct as long as every SSTable it opens stays on disk for the iterator's lifetime. The compactor, left alone, has no idea the iterator exists. It looks at sstable-0042.sst, sees that everything it contains has been superseded by newer files, and calls unlink("sstable-0042.sst"). The next time the iterator tries to read a block from that file, it fails.

The fix is a per-SSTable pin count (or, equivalently, a per-snapshot set of referenced SSTables). The compactor is forbidden from deleting any SSTable with a nonzero pin count.

class LsmStore:
    def _release_snapshot(self, snap):
        with self.lock:
            self.live_snapshots.discard(snap)
            # Any SSTable whose "earliest_needed_seq" is > oldest live snapshot
            # seq is now eligible for deletion when compaction says so.
            self._maybe_reclaim_files()

    def _compact(self, inputs, output):
        # ... standard merge ...
        new_sstable = self._write_sstable(output)
        with self.lock:
            # Install the new SSTable in the manifest first.
            self.sstables.insert(0, new_sstable)
            # Mark the inputs as "superseded" but do not delete yet.
            for sst in inputs:
                sst.superseded = True
            self._maybe_reclaim_files()

    def _maybe_reclaim_files(self):
        """Delete any superseded SSTable that no open snapshot references."""
        oldest_snap_seq = min(
            (s.seq for s in self.live_snapshots), default=float("inf")
        )
        for sst in list(self.sstables):
            if (sst.superseded
                and sst.max_seq < oldest_snap_seq
                and sst.refcount == 0):
                sst.close()
                os.unlink(sst.path)
                self.sstables.remove(sst)

Two rules compose here. Refcount: when a merging iterator opens, it bumps the refcount on every SSTable it will read; when the iterator closes, it decrements them. Snapshot seq floor: even without any iterators, an open Snapshot with seq=N keeps alive every SSTable whose records might be needed to reconstruct the view as-of N. "Might be needed" is conservative — in practice, an SSTable whose maximum seq is ≤ the oldest open snapshot's seq is fully covered (no future snapshot can ever need newer info), but an SSTable containing records at seqs that straddle a snapshot has to stick around.

The sequence-number floor also affects what the compactor can drop during merge. The normal rule is "two records for the same user key, keep only the newest." Under snapshots the rule is "keep the newest, and keep the newest version that is ≤ seq of each live snapshot." So if snapshots are open at seq=100 and seq=500, and you have alice records at seq=50, 200, 600, 900, the compactor must retain seq=50 (the latest ≤ 100), seq=200 (the latest ≤ 500), and seq=900 (the current tip). seq=600 can be dropped because no snapshot's view window falls between 500 and 900. Most production engines simplify this by tracking only the oldest open snapshot's seq as the retention floor: they keep everything ≥ that seq, drop older duplicates as usual. It is slightly more retention than strictly necessary but far simpler to implement.

A snapshot pins the SSTables it needs across a compaction cycle. At t1, compaction has produced the merged file AB, but the original A and B cannot be deleted because an open snapshot at seq=500 still references them (and an iterator may be actively reading from them). At t2, the snapshot releases; the inputs become eligible for unlink(), and the on-disk state collapses to the post-merge view.

A scan traced end-to-end

A range scan taken at snapshot N while the DB keeps writing

A store is up and running. At seq=500 the client opens a snapshot and starts a range scan over [user:1000, user:2000]:

snap = store.create_snapshot()           # captures seq=500
it = store.scan("user:1000", "user:2000", snap=snap)

At this moment the layers (newest-first) are:

active memtable    : user:1250 @ seq=497, user:1870 @ seq=498, user:2100 @ seq=499
sstable-0042.sst   : user:1100..user:1600 at various seqs 300-450
sstable-0041.sst   : user:1500..user:1950 at various seqs 150-300
sstable-0040.sst   : user:1700..user:2400 at various seqs  50-200

The iterator opens sub-iterators on the memtable and all three SSTables, bumps the refcount on each SSTable from 0 to 1, and starts merging.

The client pulls keys one at a time. While they do:

seq=501:  put("user:1300", "...")        # lands in active memtable
seq=502:  put("user:1250", "new-value")  # overwrites a key the iterator will pass
seq=503:  delete("user:1870")            # tombstones a key the iterator already emitted

compactor:  merges sstable-0041 + sstable-0042 into sstable-0043
            (sstable-0043 holds user:1100..user:1950, all seqs ≤ 450 compacted down)
            sstable-0041 and sstable-0042 are marked "superseded" but NOT deleted,
            because the iterator's refcount on each is 1.

What does the client see for their [user:1000, user:2000] scan?

user:1100 → (value as of some seq in 300-450, whichever is newest ≤ 500)  (from sstable-0042)
user:1250 → ("...some value..." at seq=497 from the memtable sub-iterator)
            the newer put at seq=502 is NOT visible: its seq > snap.seq=500
user:1500 → (value from sstable-0042 if present, else sstable-0041)
user:1870 → (value from sstable-0041 or wherever it was put most recently ≤ seq=500)
            the delete at seq=503 is NOT visible: its seq > snap.seq=500
user:1950 → ...
user:2000 → ...

Every key is the version as of seq=500. Writes that happened during the scan (seq 501, 502, 503) are invisible. The compaction that ran mid-scan produces sstable-0043, but the iterator never looks at it — its sub-iterators are on sstable-0041 and sstable-0042, which are still there (pinned).

When the client finishes:

it.close()       # releases refcounts on sstable-0040, 0041, 0042, memtable
snap.release()   # drops the snapshot seq floor from 500 back to "no snapshots"

The store's _maybe_reclaim_files() runs. sstable-0041 and sstable-0042 now have refcount=0 and are marked superseded, so they are unlink()ed. The on-disk state simplifies to sstable-0040, sstable-0043, sstable-0044, ... — the compaction is finally visible on the filesystem, not just in the manifest.

The client got a consistent view. The compactor reclaimed its space. Nobody blocked anybody.

Common confusions

"Why do we need seq numbers at all — doesn't the layer ordering already give me 'newest wins'?" Layer ordering tells you which SSTable is newer than which. Seq numbers tell you which record is newer than which. These are different: a single SSTable can hold multiple records for the same user key at different seqs (multiple writes that were all in the same memtable at flush time), and two SSTables can even hold overlapping seq ranges after a compaction that rewrote both. The layer rule is a coarse ordering; the seq is the fine-grained ordering that makes snapshots possible. Without seq, you cannot say "give me the view as of one specific write ago" — you can only say "give me the view as of some SSTable."
"Is a snapshot free? Can I open a million of them?" A snapshot object itself is tiny — a seq number and a pointer. The cost is indirect: every open snapshot holds back compaction's ability to reclaim space. Open one snapshot a day and forget to release it, and the store can never drop records older than that snapshot's seq; space amplification grows without bound. Production engines have alerts for "oldest-snapshot-age > 1 hour" because a long-held snapshot is nearly always a leaked handle.
"What happens if a snapshot is held across a restart?" It doesn't survive. Snapshots are in-memory concepts — a Snapshot object in the client process. A restart drops all snapshots, and the store's oldest-snapshot floor resets to the current seq. Long-running read clients that need crash-durable views have to use a different mechanism (a checkpoint — a full copy-on-write of the SSTable set to a separate directory — which is a heavier primitive we cover in Build 5).
"Doesn't pinning files indefinitely cause disk to fill up?" Yes, and that is the operational cost of long snapshots. RocksDB and LevelDB both accumulate "zombie" SSTables — files that are superseded in the manifest but can't be deleted because a snapshot references them. During a one-hour snapshot against a 10 GB/hour write load, you can have 10 GB of zombies. The fix is always "release the snapshot sooner" or "take checkpoints instead of snapshots for read-heavy workloads." There is no magic: holding a view of the past costs the disk space needed to retain that past.
"How does this interact with the WAL and memtable flush?" The flush is seq-monotonic — the memtable being flushed contains a contiguous range of seqs, and the resulting SSTable's min_seq and max_seq cover that range. A snapshot at seq=N keeps the flushed SSTable alive if N is within the file's seq range (or lower than its max_seq, depending on the exact invariant the engine chooses). The memtable itself is pinned by snapshots the same way SSTables are — a memtable cannot be dropped after flush until every snapshot with seq < min_seq_in_next_memtable has been released. In practice memtables flush quickly and snapshots short-lived so this rarely matters, but the invariant is the same.
"Can I scan without creating a snapshot?" Some engines let you. A "naked" range scan that skips the snapshot machinery sees whatever records happen to be visible at each step — effectively a snapshot that moves forward with the scan. This is cheaper (no pin, no seq-floor retention) but can produce inconsistent results (a key written halfway through the scan may or may not appear, depending on timing). RocksDB calls this an "unmanaged iterator"; most applications should not use it. The default db.NewIterator() implicitly creates a snapshot for you.
"Why is the heap key (user_key, -seq, idx) and not just user_key?" Because multiple sub-iterators can be positioned at the same user_key simultaneously — the memtable and an SSTable might both hold alice at different seqs. Breaking ties on -seq ensures the newest version of alice comes off the heap first, so the dedup logic (skip if user_key == last_emitted_key) skips the older versions cleanly. Without the tiebreak, the order of emission within a user_key group would be arbitrary, and the dedup would sometimes skip the newest and return an older version.

Going deeper

Snapshots are a load-bearing primitive. The details below are where production systems diverge.

RocksDB's snapshot implementation

RocksDB implements snapshots with a single uint64_t sequence number and a linked list of live snapshots sorted by seq. CreateSnapshot() takes a spinlock, reads the current seq, constructs a SnapshotImpl, and inserts it into the list. ReleaseSnapshot() unlinks it. The overhead per snapshot is a pointer and two list-link nodes — call it 32 bytes.

The compactor's retention check walks this list. For each record it is considering dropping, it finds the largest snapshot seq ≤ the record's seq; if that snapshot exists and no earlier version of the record (at a higher seq) is also ≤ that snapshot seq, the record must be kept. In practice RocksDB uses a simpler approximation — "keep everything ≥ oldest-snapshot-seq" — which over-retains slightly but is a single comparison per record instead of a list walk.

The file-pin mechanism is folded into RocksDB's Version abstraction (from the VersionEdit/manifest machinery inherited from LevelDB). Each Version holds refcounts on all SSTables it references, and each iterator grabs a Version reference at open. When the compactor installs new files, it creates a new Version; old Versions are reclaimed when their refcount hits zero, which triggers the unlink() of their unreferenced SSTables.

The one subtlety: the memtable's seq range is not bounded on the high end at flush time. A memtable being flushed contains seqs up to some value S, but S is not known until the flush finishes. RocksDB works around this by stamping the max-seq-at-freeze-time onto the flushed SSTable and using that as the file's max_seq. A snapshot at seq > this value sees only writes that landed in later memtables.

Sequence-number wraparound

64-bit counters wrap after 2^{64} \approx 1.8 \times 10^{19} writes. At a million writes per second this is 600,000 years. At a billion writes per second (the fastest single-node LSM benchmarks) it is 600 years. So wraparound is not a concern in practice.

But it does matter at the specification level. RocksDB documents that the seq space is "effectively unbounded" and code paths that compare seqs do not handle wraparound. If you somehow hit 2^{64} writes — or if a bug corrupts the seq counter — the ordering invariant breaks and reads return arbitrary results. The mitigation is the same as for any 64-bit monotonic counter: treat 2^{64} as infinity and make sure the counter is recovered correctly on restart (from the highest seq in any SSTable plus the WAL tail).

A more realistic wraparound risk: a bug where the counter is persisted as a 32-bit truncation. A database that has been running for a few hours with a 32-bit truncated seq has silently introduced ambiguity — older writes may have seqs that compare as "newer" than recent writes, and the read path will return wrong answers. This is the kind of bug that is caught by fuzz testing the seq-counter persistence, not by unit tests.

Snapshot-based isolation (preview of Build 7)

The seq-number machinery in this chapter is the foundation of snapshot isolation — the most common transactional isolation level in modern databases. The observation: a transaction can be modelled as "a snapshot at transaction-start time, plus a set of pending writes that will commit together at transaction-end time." Reads see the snapshot (so the transaction has a stable view); writes are buffered (so partial progress is not visible until commit); commit assigns a new seq and atomically installs all buffered writes at that seq.

The whole MVCC machinery in PostgreSQL, MySQL/InnoDB, Oracle, SQL Server (when using READ_COMMITTED_SNAPSHOT), and every serious distributed store is a dressed-up version of the snapshot mechanism in this chapter. The extra pieces are (a) write-write conflict detection (two transactions that both want to write the same key must not both commit), (b) durable commit (the commit record hits the WAL before it is visible), and (c) — for serialisable isolation — read-write conflict detection or the SSI algorithm.

The important structural point: snapshots are a cheap primitive. Once you have seq-numbered writes and a compactor that respects the snapshot list, you get read-committed and snapshot-isolation "for free" from the same bookkeeping. The hard part of transactions is not the read path, it is the conflict-detection and durable-commit machinery. We will build that in Build 7.

Checkpoints vs snapshots

A checkpoint is a snapshot's heavier cousin. Where a snapshot is in-memory and transient, a checkpoint is an on-disk copy-on-write clone of the entire SSTable set (and a manifest pointing at the cloned files). Checkpoints survive restarts, can be copied to other hosts for replication or backup, and do not pin files in the live database — they pin files in their own directory.

RocksDB's Checkpoint API creates a checkpoint by hardlinking SSTables into a fresh directory. Hardlinks are cheap (metadata-only, no block copy) but require the checkpoint to live on the same filesystem. On restart the checkpoint directory is itself a valid RocksDB store that can be opened independently.

Snapshots are right for "give me a consistent view for this minute of scanning." Checkpoints are right for "give me a backup I can ship to another server." Both rest on the same seq-number foundation; the difference is only in where the pinned files live and how long they persist.

Iterator refresh and read-your-own-writes

A snapshot is read-only. Writes made by the same client after the snapshot was taken are invisible to reads through that snapshot. This is sometimes desired (consistent scans) and sometimes surprising (the client just wrote a key and can't see it back).

Applications that want read-your-own-writes from an iterator use a different pattern: a "current state" iterator that re-creates its snapshot on every advance, or a transaction's read-uncommitted mode that merges the snapshot with the transaction's own pending write buffer. Both are implementable on top of the primitives in this chapter; the default snapshot is the strict point-in-time one.

Where this leads next

Snapshots make long reads safe. The next problem they create is an operational one: compaction I/O is bursty, and those bursts compete with foreground reads and writes for disk bandwidth. A compaction merging four 256 MB SSTables reads a gigabyte and writes half a gigabyte — 1.5 GB of I/O in a few seconds. During those seconds, the same disk serving foreground get() and put() requests has its service time inflated by a factor of 5-10. The tail of the latency distribution — p99, p999 — is dominated by "your read happened to land during a compaction."

The next chapter (Walls: compaction I/O, tail latency) measures the problem, explains why the p50 can look great while the p999 is a disaster, and walks the three canonical mitigations: rate-limiting compaction I/O (capping MB/s spent on merges), scheduling compaction during low-traffic windows, and spreading compaction across independent disks. The snapshot machinery you just built is part of the story — long-held snapshots delay compaction and amplify the I/O burst when compaction finally runs — but the centre of gravity shifts from correctness to operations.

After the compaction-tail chapter, Build 3 closes with a wrap-up that ties the memtable, SSTable, flush, read path, Bloom filters, compaction policies, and snapshots into one coherent storage engine. The output is everything RocksDB has, implemented from first principles in ~2000 lines of code. That is the platform on which Build 4 (transactions), Build 5 (replication), and Build 7 (MVCC) all rest.

Walls: compaction I/O, tail latency — the operational cost of the machinery we just built.

References

Patrick O'Neil et al., The Log-Structured Merge-Tree (LSM-Tree) (Acta DataFlow, 1996) — the foundational paper; §3 introduces the layered read and the merge-iterator structure used here. Springer.
RocksDB Team, Snapshot and Iterator — RocksDB's documentation of CreateSnapshot, ReleaseSnapshot, and the Version/refcount machinery that pins SSTables. github.com/facebook/rocksdb/wiki/Snapshot.
Sanjay Ghemawat and Jeff Dean, LevelDB Implementation Notes — the original Version/VersionEdit design inherited by RocksDB; explains how manifests and refcounts keep files alive across iterator lifetimes. github.com/google/leveldb.
Hellerstein, Stonebraker, Hamilton, Architecture of a Database System (Foundations and Trends in Databases, 2007) — §5 covers MVCC and snapshot isolation as foundations for concurrency control; provides the bridge from this chapter to Build 7. db.cs.berkeley.edu.
Chen Luo and Michael J. Carey, LSM-based Storage Techniques: A Survey (VLDB Journal, 2020) — §4 covers snapshot and iterator semantics across LSM engines; useful comparison of RocksDB, Cassandra, and HBase approaches. arxiv.org/abs/1812.07527.
Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 7 — the classroom-friendly walkthrough of MVCC, snapshot isolation, and the relationship between storage-level snapshots and transactional isolation. dataintensive.net.