In short
The previous chapter left one large question unanswered. SSTables on disk are sorted — that was the whole point — but the writes a database receives do not arrive sorted. A put("zip", ...) is followed by a put("age", ...), then a put("locale", ...), in any order. You cannot keep a file on disk sorted under random insertion without shuffling bytes around, which is the expensive thing you were trying to avoid. The LSM answer is to add one small layer in front of the disk — the memtable, an in-memory sorted structure that absorbs writes as they arrive and keeps them in key order. When it has accumulated enough (typically a few tens of megabytes), you serialise the whole thing to a new SSTable in one sorted pass and start a fresh empty memtable. Sorting in RAM is cheap; sorting on disk is not; the memtable is the place where that cheap sort happens. This chapter implements a memtable as a skip list — a probabilistic sorted structure with O(\log n) insert and lookup, sorted iteration for free, and (crucially, for later chapters) a dramatically simpler concurrency story than balanced trees. The entire thing fits in about sixty lines of Python. You also wire the memtable into put() and get() so it becomes the top level of the LSM read path — every lookup checks the memtable first before touching disk.
The last chapter gave you a storage primitive — an SSTable with a sparse index, merged with a twenty-line streaming merge — and a question it did not answer: how do you keep a file sorted when writes arrive in any order?
The answer, stated once in the previous chapter's common confusions and unpacked here, is that you do not. On-disk files are immutable. Random writes go to a small, dynamic, sorted structure in RAM called the memtable. When the memtable is big enough, it is flushed to disk in one sequential sorted pass as a new SSTable, and a fresh empty memtable starts accepting writes.
This chapter builds that memtable. The data structure is a skip list — not because it is the only option, but because it is the one that works cleanest for this specific job.
The writes-arrive-out-of-order problem
Picture the write stream hitting your database over one second:
put("zip", "600001")
put("age", "19")
put("city", "delhi")
put("name", "dipti")
put("age", "20") # overwrite
put("locale", "en-IN")
put("role", "admin")
The keys arrive in no particular order — zip before age, an overwrite of age before the unrelated locale and role. A real production workload looks like this, only a million times noisier.
You already decided, in the previous chapter, that on-disk files will be sorted SSTables. The sparse index, the linear-time merge, the cheap range queries — all three depend on the sort invariant. So somewhere between "writes arriving in any order" and "sorted file on disk" a sort has to happen.
Sorting on disk is out. A sorted file cannot accept an insertion in the middle without shifting bytes — a random write across the whole file, in the worst case. That is the opposite of sequential I/O, the thing this whole design was supposed to preserve.
Sorting in RAM is cheap. O(\log n) per insert into any balanced ordered structure; nanoseconds of work; no disk touched at all. If you can hold recent writes in RAM, keep them sorted as they arrive, and only occasionally move them to disk in one pre-sorted batch, you get cheap random writes (from the application's point of view) and cheap sequential writes (from the disk's point of view) simultaneously.
That buffer is the memtable. Its job is narrow and specific:
- Absorb every write immediately, in any order.
- Keep its contents in key-sorted order at all times.
- Support fast point lookup — every
get()checks it before touching disk. - Support sorted iteration — when it is time to flush, walk the contents in key order and write them to a new SSTable.
- Stay small — bounded by a size threshold (typical values are 4 MB, 64 MB, 256 MB depending on the engine).
Five requirements. The data structure you pick is whatever satisfies them most simply.
Why a skip list
The obvious choices for an "in-memory sorted structure" are a balanced BST (red-black, AVL), a sorted array with binary search, or a skip list. All three are O(\log n) for lookup; they differ on the other axes.
A sorted array is excellent for lookup (binary search; cache-friendly) and terrible for insertion (shift half the array on every write, O(n)). A memtable processing a thousand writes per second on a million-entry structure would spend all its time shifting bytes. Ruled out.
A balanced BST — red-black or AVL — handles both operations in O(\log n) and is what std::map uses under the hood. It is the right answer if you need worst-case guarantees and you have a single thread. The trouble is concurrency. A red-black tree rebalance after an insert can touch O(\log n) nodes, some on the path from the inserted node up to the root. To support concurrent readers and writers safely, you either lock a large portion of the tree during a rebalance (coarse locks, poor parallelism) or implement elaborate lock-coupling / node-versioning schemes that are notoriously hard to get right. Production LSM engines support thousands of concurrent readers on the memtable, so this matters.
A skip list — William Pugh's 1989 structure — hits the same O(\log n) expected bounds with two structural properties that make the concurrency story clean:
- Insertions are local. A new node is spliced into \le a handful of linked lists at a set of known positions; no other node is touched. Unlike a BST, there is no rebalancing step that can ripple up to the root.
- The structure is probabilistic, not balanced. Each node picks its own "height" (number of levels it participates in) with a coin flip at insertion time. No "left-rotate when the subtree is too heavy" logic. The expected shape of the structure is balanced because of the randomisation, not because of explicit rebalancing.
Those two properties let a skip list be lock-free. Readers walk the bottom-level linked list; writers do a small number of compare-and-swap operations on pointer fields; no global or subtree locks are needed. Java's ConcurrentSkipListMap, LevelDB's MemTable, and RocksDB's default memtable are all skip lists for exactly this reason. The coding is also simpler — this chapter's implementation is under sixty lines; a correct concurrent red-black tree is closer to a thousand.
The mental model is simple. The bottom level is a sorted linked list of every key. Above it sit shorter "express" linked lists that skip over most of the nodes. Search starts at the top-left corner and walks right as far as it can without overshooting the target, then drops down a level and repeats. Each drop-down halves (in expectation) the remaining search window.
That is the whole idea. Now the code.
The implementation — about sixty lines
We need four operations — insert(k, v), get(k), iter_sorted(), and a size-in-bytes estimate so the flusher knows when to trigger. A fifth, delete(k), is modelled as inserting a tombstone value; this chapter keeps the surface small and handles tombstones properly in chapter 15.
# memtable.py — a skip-list-backed memtable for an LSM-tree
import random
_MAX_LEVEL = 16 # supports ~2^16 = 65k entries at p=0.5; more than enough for a memtable
_P = 0.5 # probability of promoting to the next level at insertion
class _Node:
__slots__ = ("key", "value", "forward")
def __init__(self, key, value, level):
self.key, self.value = key, value
# forward[i] is the next node at level i; one pointer per level this node lives on
self.forward = [None] * (level + 1)
class SkipListMemtable:
def __init__(self):
# the head is a sentinel: its key is None ("-infinity"), it lives on every level
self._head = _Node(None, None, _MAX_LEVEL)
self._level = 0 # highest level currently in use (0-indexed)
self._bytes = 0 # approximate bytes of user data held
def _random_level(self):
# geometric distribution: keep flipping coins until tails, capped at _MAX_LEVEL
lvl = 0
while random.random() < _P and lvl < _MAX_LEVEL:
lvl += 1
return lvl
def insert(self, key, value):
update = [self._head] * (_MAX_LEVEL + 1) # per level, the last node whose next key < target
cur = self._head
# top-down descent: at each level walk right while next.key < key, then drop down
for i in range(self._level, -1, -1):
while cur.forward[i] is not None and cur.forward[i].key < key:
cur = cur.forward[i]
update[i] = cur
# now cur.forward[0] is either the existing node for this key, or the successor
nxt = cur.forward[0]
if nxt is not None and nxt.key == key: # overwrite: update value in place
self._bytes += len(value) - len(nxt.value)
nxt.value = value
return
lvl = self._random_level()
if lvl > self._level: # structure grew — splice head at new levels
for i in range(self._level + 1, lvl + 1):
update[i] = self._head
self._level = lvl
new_node = _Node(key, value, lvl)
for i in range(lvl + 1): # splice new node in at each of its levels
new_node.forward[i] = update[i].forward[i]
update[i].forward[i] = new_node
self._bytes += len(key) + len(value)
def get(self, key):
cur = self._head
for i in range(self._level, -1, -1):
while cur.forward[i] is not None and cur.forward[i].key < key:
cur = cur.forward[i]
nxt = cur.forward[0]
return nxt.value if (nxt is not None and nxt.key == key) else None
def iter_sorted(self):
# bottom level is a complete sorted linked list — exactly what the flusher wants
cur = self._head.forward[0]
while cur is not None:
yield cur.key, cur.value
cur = cur.forward[0]
def approx_bytes(self):
return self._bytes
About sixty lines including the node class and blank lines. Read it once end-to-end and notice three things.
Why the update array: insert has to splice the new node into every level it will participate in. At each level the "splice point" is the last node whose next key is still less than the target. Rather than re-descend from the head once per level, the descent records one splice point per level on its way down, in the update list. One O(\log n) pass instead of O(\log^2 n).
Why the sentinel head: the head is a fake node whose key is treated as less than every real key. It simplifies the splicing code — there is always a node to write forward[i] = new_node into, even for the very first insert. Without a sentinel, every level-0 insert would need a "am I inserting at the head?" branch.
Why _random_level uses a geometric distribution: we want roughly half the nodes at level 0 only, a quarter up to level 1, an eighth up to level 2, and so on. That geometric fall-off is what gives the log-n search. Each node's height is drawn independently — there is no global bookkeeping, no rebalancing, no "too-heavy subtree" concept. You could not write a balanced BST in sixty lines; the probabilistic structure is what buys the simplicity.
Why not a simpler dict + periodic sort?
A tempting shortcut is "just use a Python dict, and when the memtable is full, sort it for the flush." That works — a single-pass sort of n entries is O(n \log n), which amortises to O(\log n) per insert on the flush. But it loses two properties the LSM read path needs:
- Sorted iteration during a flush in progress. Real engines often start flushing while still accepting new writes; you need to walk the structure in sorted order while it is mutating. A dict cannot.
- Ordered range queries on the memtable.
get_range("age", "name")on the live memtable is a few lines on a skip list (walk level 0 forward until key > upper bound); on a dict it is a full scan plus sort.
The dict answer is a fine prototype for a single-shot insert-then-dump demo; it is not what a production memtable looks like. The skip list gives you ordered iteration and range queries for free, and puts you on the path to lock-free concurrency later.
Skip list vs the alternatives, side by side
| Property | Sorted array | Balanced BST (RB/AVL) | Skip list |
|---|---|---|---|
| Insert (single-threaded) | O(n) — shift | O(\log n) + rebalance | O(\log n) expected |
| Lookup | O(\log n) binary search | O(\log n) | O(\log n) expected |
| Sorted iteration | O(n), sequential memory | O(n), in-order walk | O(n), walk level 0 |
| Range query | O(\log n + k) | O(\log n + k) | O(\log n + k) |
| Concurrent writers | Terrible — single writer only | Hard — rebalances touch many nodes | Clean — local CAS per level |
| Worst-case guarantee | Deterministic | Deterministic | Probabilistic (very tight) |
| Lines of code (simple impl) | ~20 | ~300+ (RB) | ~60 |
The only column where the skip list loses is "worst-case guarantee." With very bad luck every node could roll the max height and the structure would degenerate. The probability of this on a million-entry structure is about 2^{-20 \cdot 10^6} — a number so small it is not worth reasoning about. Pragmatic worst case equals expected case.
Balanced BSTs are the "right" answer if you demand a deterministic O(\log n) bound for every operation, tolerate a thousand lines of code, and can afford the concurrency complexity. Skip lists are the "right" answer for a memtable, which is exactly the workload — short-lived, concurrent, sorted, throw-away — that the probabilistic trade-off is designed for.
Put and get through the memtable
Once the skip list exists, wiring it into the database's hot path is three lines of change. The put() function just inserts into the memtable. The get() function checks the memtable first; only on a miss does it fall through to the on-disk SSTables (which you will build in the next chapter). This "memtable is the top level of the LSM" structure is the whole read path in miniature.
# lsm.py — the shape of put/get once the memtable exists
class LSMStore:
def __init__(self):
self.memtable = SkipListMemtable()
self.sstables = [] # newest-first list; filled in chapter 14 onwards
self.flush_bytes = 4 * 1024 * 1024 # 4 MB threshold for this chapter's demo
def put(self, key, value):
self.memtable.insert(key, value)
if self.memtable.approx_bytes() >= self.flush_bytes:
self._flush() # chapter 14: serialise memtable to a new SSTable
def get(self, key):
v = self.memtable.get(key)
if v is not None:
return v # memtable hit — newest value, always wins
for sst in self.sstables: # newest-first on-disk probe
v = sst.get(key)
if v is not None:
return v
return None
Two things to notice.
First, the memtable is always consulted first on a read, because it holds the newest values. A key that was overwritten ten seconds ago lives in the memtable with its new value; an older SSTable on disk might still carry the stale value. Checking memtable first preserves "latest write wins" for free, without any version numbers.
Second, put() is unconditionally a memtable insert. There is no "should I write this to disk?" decision on the hot path. The disk is touched only when the memtable crosses its size threshold, which amortises the cost across many writes. A million small put()s cost a million skip-list inserts plus one or two big sequential SSTable writes — not a million random disk writes.
This is the LSM's central trick, condensed: writes go to RAM, flushed in bulk; reads check RAM then disk, newest-first. Every remaining chapter of Build 3 fills in pieces of that sentence.
A trace
A trace of seven writes and four reads through the memtable
Start with an empty memtable. Apply the seven writes from the top of the chapter, in the order they arrived, then read four keys. Watch the skip list sort the data as it arrives.
m = SkipListMemtable()
m.insert("zip", "600001")
m.insert("age", "19")
m.insert("city", "delhi")
m.insert("name", "dipti")
m.insert("age", "20") # overwrite — same slot, value replaced in place
m.insert("locale", "en-IN")
m.insert("role", "admin")
After all seven inserts, iter_sorted() walks level 0 front to back. It yields — regardless of the random levels the individual nodes happened to roll:
age -> 20 (overwrite applied; 19 is gone)
city -> delhi
locale -> en-IN
name -> dipti
role -> admin
zip -> 600001
Six entries — the overwrite of age collapsed two inserts into one slot, exactly what you want. Point lookups:
m.get("name") # -> 'dipti' — walks top-level, drops down, lands on 'name'
m.get("age") # -> '20' — newest value, because overwrite updated in place
m.get("mobile") # -> None — descent lands between 'locale' and 'name'; next key > 'mobile'
m.get("zip") # -> '600001'
Trace get("name") in detail, on the particular shape shown in the earlier SVG (levels: age has height 3, name has height 2, zip has height 3; city, email, locale, phone, role are level 0 only):
start: cur = head, i = 3
L3: head.forward[3] = age; age.key='age' < 'name', advance. cur = age.
age.forward[3] = zip; zip.key='zip' > 'name', drop. i = 2.
L2: age.forward[2] = name; name.key='name' < 'name'? No, equal. drop. i = 1.
L1: age.forward[1] = email; email.key='email' < 'name', advance. cur = email.
email.forward[1] = name; equal to target, drop. i = 0.
L0: email.forward[0] = locale; 'locale' < 'name', advance. cur = locale.
locale.forward[0] = name; key matches — return 'dipti'.
Six pointer hops. For a million-entry memtable the same search would take about twenty. That is the log-n shape in action.
Common confusions
-
"Why randomised levels instead of explicit balance?" Randomisation replaces an explicit invariant ("heights differ by at most one") with a statistical one ("heights are geometrically distributed"). The statistical version is achievable without looking at the rest of the structure — each node makes its own decision at insert time — which means every operation is local. Every local operation is easier to make concurrent. The cost is that the structure is only probably balanced, but the probability is overwhelming and the constants are small.
-
"What about concurrent inserts?" The version in this chapter is single-threaded — all examples assume one writer. Making it concurrent is a surprisingly small change: each
forward[i]pointer is updated with a compare-and-swap rather than a plain assignment, readers ignore locks entirely, and a node is "present" once its level-0 pointer is published. Real implementations (Java'sConcurrentSkipListMap, LevelDB'sMemTable) add a few subtleties around deletion and memory reclamation, but the core shape is the same as this chapter's code. The point is that it is possible in a few hundred extra lines; a lock-free red-black tree is a research-paper-sized project. -
"What if
_random_level()always returns 0?" Then you have a plain singly-linked sorted list and every operation is O(n). The probability of this happening over a million inserts is (1 - p)^{10^6} \approx 10^{-300{,}000}. You will not meet it. If you wanted a deterministic guarantee, use a red-black tree; if you are happy with an astronomical-odds probabilistic one, use a skip list. -
"Why
p = 0.5?" It minimises expected search steps times memory-per-node. Smallerp(say 0.25) gives a shorter expected search but more memory per node because each node carries aforwardarray longer than it needs. Largerp(0.75) shortens the array but makes searches walk farther at each level. All the usual derivations live in Pugh's original paper;p = 0.5is the textbook sweet spot and what LevelDB uses. -
"Does the memtable persist across crashes?" Not by itself. A memtable is RAM; a crash wipes it. The LSM story is that every memtable insert is also appended to a write-ahead log — a small sequential file on disk that can be replayed after a crash to reconstruct the in-memory state. That log is built in Build 5 of this track; this chapter leaves the memtable volatile and lets you see the pure data-structure side first. A production LSM without a WAL loses the last few seconds of writes on a crash, which is unacceptable; but the WAL is orthogonal to how the memtable itself is structured, so it is presented later.
-
"Why flush at a size threshold, not a time interval?" Either works; size thresholds dominate because they give predictable disk behaviour. If the memtable flushes every 4 MB, SSTables are always about 4 MB, the index-to-data ratio is stable, and compaction can reason about its working set. Time thresholds make the file sizes depend on write rate, which makes every other decision harder. Production engines use size with an optional "and at least every N minutes" guard to bound how much data a crash can lose.
Going deeper
The skip list in sixty lines is enough to run a memtable. This section fills in the bigger picture — the alternatives that real engines use, how the structure becomes lock-free, how flushing decisions get made, and the tie to the crash-recovery story that arrives in Build 5.
Red-black tree alternatives
Some engines — notably early versions of Cassandra — used a balanced BST (specifically ConcurrentSkipListMap's red-black sibling, TreeMap) as the memtable. The choice between RB tree and skip list is narrow and empirical:
- Cache locality. A red-black tree with pooled node allocation keeps related nodes closer in memory than a skip list's pointer-chased levels. On workloads where memtable misses dominate, the RB tree can be 10–20% faster on lookup.
- Memory overhead. A skip list node carries multiple pointers (one per level); average memory per key is about 1.33× a BST node's two pointers. On huge memtables this matters.
- Concurrency ceiling. The skip list wins by a mile here. Above a dozen concurrent writers, the RB tree's rebalance locking kills throughput; the skip list scales near-linearly.
- Code complexity. The skip list wins again. Correct concurrent rebalancing is a long tail of edge cases; local CAS on forward pointers is not.
Engines that care most about single-threaded read latency (analytics-oriented stores) sometimes use RB trees. Engines that care most about write throughput under concurrency (transactional stores) use skip lists. There is no wrong answer; LevelDB/RocksDB picked the skip list and almost everyone else followed.
Lock-free skip lists
The chapter's implementation is single-threaded — insert() is not safe to call from two threads at once. Promoting it to lock-free takes three changes:
- Atomic pointer updates. Every
forward[i]assignment becomes a compare-and-swap. If the CAS fails (someone else inserted first), retry from where you descended. - Mark-and-sweep deletion. Deletion cannot simply detach a node — a concurrent search might be mid-traversal through it. Instead, the node is first logically deleted by setting a "marked" bit on its forward pointers, then physically removed by the next thread that walks past it. This two-step protocol is due to Harris (2001).
- Memory reclamation. You cannot
free()a node while readers might still hold pointers to it. Real implementations use epoch-based reclamation or hazard pointers to delay deallocation until no reader can be inside the node.
The surprise is that none of this requires locking the whole structure — readers never block writers, writers rarely block each other. Doug Lea's ConcurrentSkipListMap implementation (open-source in OpenJDK) is the canonical reference; it fits in about 1500 lines.
Memtable size thresholds
How big is a memtable? The answer depends on three tensions:
- Flush time. Bigger memtable means a bigger, slower flush, which stalls writes (or requires double-buffering) during the flush.
- SSTable size distribution. Very small SSTables mean lots of files, more places to look during a read, and more compaction overhead. Very large SSTables mean the write-stall cost above, and a bigger "unit" of compaction.
- Crash-recovery cost. On crash, the WAL is replayed into a fresh memtable. A bigger threshold means a longer replay.
Typical production values:
- LevelDB default: 4 MB. Aggressive flushing; lots of small SSTables; aggressive compaction.
- RocksDB default: 64 MB. Balances flush stalls against SSTable count.
- Cassandra default: 256 MB+. Tuned for sustained high-throughput writes on machines with plenty of RAM.
The threshold is a tuning knob, not a law. Picking it well requires watching your own workload — the user's write rate, the machine's IOPS, and how long a flush actually takes.
The memtable needs a WAL friend (preview of Build 5)
A memtable by itself is not durable. If the process crashes between two flushes, the contents of the current memtable vanish — potentially seconds of writes gone. For a system calling itself a database, that is not acceptable.
The fix is the write-ahead log (WAL): a small, append-only file that receives every write before the memtable insert returns. After a crash, the database starts with an empty memtable and replays the WAL to reconstruct it. When the memtable is flushed to an SSTable, the corresponding WAL range is truncated or discarded — the on-disk SSTable has taken over durability for those writes.
The WAL is orthogonal to the memtable structure — you can log writes before or after the skip-list insert; the skip list is none the wiser. That is why this chapter introduces the memtable on its own, in a pure in-memory form. Build 5 of this curriculum returns to durability with the full WAL + fsync + group-commit story, and wires the WAL into this same put() function. For now, accept that in-RAM is fast and sorted, and durability is a concern we are explicitly deferring by one layer.
Why "probabilistic" is fine in practice
One lingering worry for engineers trained on deterministic data structures: isn't it dangerous to let a critical piece of the database depend on a coin flip?
The answer, once you put numbers on it, is no. The probability that a skip list's search time exceeds c \log_2 n for any c > 1 decays exponentially in c. For c = 5 (a search taking five times the expected number of steps), the probability is below 10^{-12}. For a memtable that does a million lookups a day, the expected time to encounter even one unusually slow search is on the order of millennia.
Compare with deterministic structures: a red-black tree has a worst case of 2 \log_2 n, which happens routinely on adversarial inputs. The skip list's expected case is better than the RB tree's worst case, and the tails are tighter than any real workload cares about. Randomisation here is not a compromise — it is a genuinely strong guarantee that happens to be stated statistically.
Where this leads next
The memtable sits in RAM, sorted, absorbing writes. At some point it crosses its size threshold and has to be moved to disk. That move — one sequential pass over iter_sorted(), writing records to a fresh SSTable and building the sparse index from chapter 12 as it goes — is the subject of the next chapter. A few dozen lines of code, and you will have your first fully sorted on-disk file written entirely from random-order writes.
After that, the rest of Build 3 builds the pieces that make the design practical: the read path across memtable + stack of SSTables, Bloom filters to skip most SSTables without a disk probe, and compaction strategies that call the merge() function from chapter 12 on a schedule. Every remaining chapter treats the memtable as a given — a black box that absorbs writes and iterates sorted — which is why getting it right here is worth sixty lines of care.
References
- William Pugh, Skip Lists: A Probabilistic Alternative to Balanced Trees (CACM, 1990) — the original paper introducing skip lists with the analysis and the suggested implementation. epaperpress.com PDF.
- Timothy Harris, A Pragmatic Implementation of Non-Blocking Linked-Lists (DISC, 2001) — introduces the mark-and-sweep deletion protocol that makes lock-free skip lists practical. cl.cam.ac.uk PDF.
- Keir Fraser, Practical Lock-Freedom (PhD thesis, Cambridge, 2004), Ch. 4 — extended treatment of concurrent skip lists, including the memory-reclamation problem and epoch-based solutions. cam.ac.uk PDF.
- Sanjay Ghemawat and Jeff Dean, LevelDB memtable implementation — the real, readable C++ skip list used in LevelDB and the ancestor of RocksDB's default memtable. github.com/google/leveldb — db/skiplist.h.
- Doug Lea, ConcurrentSkipListMap — the canonical production-grade lock-free skip list in OpenJDK's
java.util.concurrent, with extensive commentary. OpenJDK source. - Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 3 — concise overview of memtables, SSTables, and where the skip-list choice sits in the LSM design space. dataintensive.net.