In short

The previous chapter left one large question unanswered. SSTables on disk are sorted — that was the whole point — but the writes a database receives do not arrive sorted. A put("zip", ...) is followed by a put("age", ...), then a put("locale", ...), in any order. You cannot keep a file on disk sorted under random insertion without shuffling bytes around, which is the expensive thing you were trying to avoid. The LSM answer is to add one small layer in front of the disk — the memtable, an in-memory sorted structure that absorbs writes as they arrive and keeps them in key order. When it has accumulated enough (typically a few tens of megabytes), you serialise the whole thing to a new SSTable in one sorted pass and start a fresh empty memtable. Sorting in RAM is cheap; sorting on disk is not; the memtable is the place where that cheap sort happens. This chapter implements a memtable as a skip list — a probabilistic sorted structure with O(\log n) insert and lookup, sorted iteration for free, and (crucially, for later chapters) a dramatically simpler concurrency story than balanced trees. The entire thing fits in about sixty lines of Python. You also wire the memtable into put() and get() so it becomes the top level of the LSM read path — every lookup checks the memtable first before touching disk.

The last chapter gave you a storage primitive — an SSTable with a sparse index, merged with a twenty-line streaming merge — and a question it did not answer: how do you keep a file sorted when writes arrive in any order?

The answer, stated once in the previous chapter's common confusions and unpacked here, is that you do not. On-disk files are immutable. Random writes go to a small, dynamic, sorted structure in RAM called the memtable. When the memtable is big enough, it is flushed to disk in one sequential sorted pass as a new SSTable, and a fresh empty memtable starts accepting writes.

This chapter builds that memtable. The data structure is a skip list — not because it is the only option, but because it is the one that works cleanest for this specific job.

The writes-arrive-out-of-order problem

Picture the write stream hitting your database over one second:

put("zip",    "600001")
put("age",    "19")
put("city",   "delhi")
put("name",   "dipti")
put("age",    "20")       # overwrite
put("locale", "en-IN")
put("role",   "admin")

The keys arrive in no particular order — zip before age, an overwrite of age before the unrelated locale and role. A real production workload looks like this, only a million times noisier.

You already decided, in the previous chapter, that on-disk files will be sorted SSTables. The sparse index, the linear-time merge, the cheap range queries — all three depend on the sort invariant. So somewhere between "writes arriving in any order" and "sorted file on disk" a sort has to happen.

Sorting on disk is out. A sorted file cannot accept an insertion in the middle without shifting bytes — a random write across the whole file, in the worst case. That is the opposite of sequential I/O, the thing this whole design was supposed to preserve.

Sorting in RAM is cheap. O(\log n) per insert into any balanced ordered structure; nanoseconds of work; no disk touched at all. If you can hold recent writes in RAM, keep them sorted as they arrive, and only occasionally move them to disk in one pre-sorted batch, you get cheap random writes (from the application's point of view) and cheap sequential writes (from the disk's point of view) simultaneously.

That buffer is the memtable. Its job is narrow and specific:

Five requirements. The data structure you pick is whatever satisfies them most simply.

Why a skip list

The obvious choices for an "in-memory sorted structure" are a balanced BST (red-black, AVL), a sorted array with binary search, or a skip list. All three are O(\log n) for lookup; they differ on the other axes.

A sorted array is excellent for lookup (binary search; cache-friendly) and terrible for insertion (shift half the array on every write, O(n)). A memtable processing a thousand writes per second on a million-entry structure would spend all its time shifting bytes. Ruled out.

A balanced BST — red-black or AVL — handles both operations in O(\log n) and is what std::map uses under the hood. It is the right answer if you need worst-case guarantees and you have a single thread. The trouble is concurrency. A red-black tree rebalance after an insert can touch O(\log n) nodes, some on the path from the inserted node up to the root. To support concurrent readers and writers safely, you either lock a large portion of the tree during a rebalance (coarse locks, poor parallelism) or implement elaborate lock-coupling / node-versioning schemes that are notoriously hard to get right. Production LSM engines support thousands of concurrent readers on the memtable, so this matters.

A skip list — William Pugh's 1989 structure — hits the same O(\log n) expected bounds with two structural properties that make the concurrency story clean:

Those two properties let a skip list be lock-free. Readers walk the bottom-level linked list; writers do a small number of compare-and-swap operations on pointer fields; no global or subtree locks are needed. Java's ConcurrentSkipListMap, LevelDB's MemTable, and RocksDB's default memtable are all skip lists for exactly this reason. The coding is also simpler — this chapter's implementation is under sixty lines; a correct concurrent red-black tree is closer to a thousand.

A four-level skip list with fast lanesA skip list shown as four horizontal lanes of pointers over eight sorted keys. The bottom lane, level 0, is a complete linked list in order: age, city, email, locale, name, phone, role, zip. Above it, level 1 contains pointers only through age, email, name, role, zip — every second node on average. Level 2 contains pointers through age, name, zip — every fourth node on average. Level 3, the top lane, contains only age and zip. Each lane is drawn with arrows from left to right, and vertical bars connect the occurrences of the same key across levels. A small legend at the right-hand side shows a coin-flip icon labelled "p = 1/2" indicating that each node's height is drawn from a geometric distribution by coin flips at insertion.L3L2L1L0(full list)agecityemaillocalenamephonerolezipageemailnamerolezipagenamezipagezipEach node chooses its height independently.Prob(height ≥ k) = p^k, here p = 1/2.Expected lookup time: O(log n).
A skip list over eight keys. Level 0 is a complete sorted linked list. Each higher level contains a random subset of the nodes below it, forming "fast lanes" that let a search skip large gaps. Search starts at the top-left, walks right while the next key is still less than the target, drops down a level when it would overshoot, and repeats. The expected path length is $O(\log n)$.

The mental model is simple. The bottom level is a sorted linked list of every key. Above it sit shorter "express" linked lists that skip over most of the nodes. Search starts at the top-left corner and walks right as far as it can without overshooting the target, then drops down a level and repeats. Each drop-down halves (in expectation) the remaining search window.

That is the whole idea. Now the code.

The implementation — about sixty lines

We need four operations — insert(k, v), get(k), iter_sorted(), and a size-in-bytes estimate so the flusher knows when to trigger. A fifth, delete(k), is modelled as inserting a tombstone value; this chapter keeps the surface small and handles tombstones properly in chapter 15.

# memtable.py — a skip-list-backed memtable for an LSM-tree
import random

_MAX_LEVEL = 16          # supports ~2^16 = 65k entries at p=0.5; more than enough for a memtable
_P         = 0.5         # probability of promoting to the next level at insertion

class _Node:
    __slots__ = ("key", "value", "forward")
    def __init__(self, key, value, level):
        self.key, self.value = key, value
        # forward[i] is the next node at level i; one pointer per level this node lives on
        self.forward = [None] * (level + 1)

class SkipListMemtable:
    def __init__(self):
        # the head is a sentinel: its key is None ("-infinity"), it lives on every level
        self._head  = _Node(None, None, _MAX_LEVEL)
        self._level = 0                   # highest level currently in use (0-indexed)
        self._bytes = 0                   # approximate bytes of user data held

    def _random_level(self):
        # geometric distribution: keep flipping coins until tails, capped at _MAX_LEVEL
        lvl = 0
        while random.random() < _P and lvl < _MAX_LEVEL:
            lvl += 1
        return lvl

    def insert(self, key, value):
        update = [self._head] * (_MAX_LEVEL + 1)   # per level, the last node whose next key < target
        cur = self._head
        # top-down descent: at each level walk right while next.key < key, then drop down
        for i in range(self._level, -1, -1):
            while cur.forward[i] is not None and cur.forward[i].key < key:
                cur = cur.forward[i]
            update[i] = cur
        # now cur.forward[0] is either the existing node for this key, or the successor
        nxt = cur.forward[0]
        if nxt is not None and nxt.key == key:     # overwrite: update value in place
            self._bytes += len(value) - len(nxt.value)
            nxt.value = value
            return
        lvl = self._random_level()
        if lvl > self._level:                      # structure grew — splice head at new levels
            for i in range(self._level + 1, lvl + 1):
                update[i] = self._head
            self._level = lvl
        new_node = _Node(key, value, lvl)
        for i in range(lvl + 1):                   # splice new node in at each of its levels
            new_node.forward[i] = update[i].forward[i]
            update[i].forward[i] = new_node
        self._bytes += len(key) + len(value)

    def get(self, key):
        cur = self._head
        for i in range(self._level, -1, -1):
            while cur.forward[i] is not None and cur.forward[i].key < key:
                cur = cur.forward[i]
        nxt = cur.forward[0]
        return nxt.value if (nxt is not None and nxt.key == key) else None

    def iter_sorted(self):
        # bottom level is a complete sorted linked list — exactly what the flusher wants
        cur = self._head.forward[0]
        while cur is not None:
            yield cur.key, cur.value
            cur = cur.forward[0]

    def approx_bytes(self):
        return self._bytes

About sixty lines including the node class and blank lines. Read it once end-to-end and notice three things.

Why the update array: insert has to splice the new node into every level it will participate in. At each level the "splice point" is the last node whose next key is still less than the target. Rather than re-descend from the head once per level, the descent records one splice point per level on its way down, in the update list. One O(\log n) pass instead of O(\log^2 n).

Why the sentinel head: the head is a fake node whose key is treated as less than every real key. It simplifies the splicing code — there is always a node to write forward[i] = new_node into, even for the very first insert. Without a sentinel, every level-0 insert would need a "am I inserting at the head?" branch.

Why _random_level uses a geometric distribution: we want roughly half the nodes at level 0 only, a quarter up to level 1, an eighth up to level 2, and so on. That geometric fall-off is what gives the log-n search. Each node's height is drawn independently — there is no global bookkeeping, no rebalancing, no "too-heavy subtree" concept. You could not write a balanced BST in sixty lines; the probabilistic structure is what buys the simplicity.

Why not a simpler dict + periodic sort?

A tempting shortcut is "just use a Python dict, and when the memtable is full, sort it for the flush." That works — a single-pass sort of n entries is O(n \log n), which amortises to O(\log n) per insert on the flush. But it loses two properties the LSM read path needs:

The dict answer is a fine prototype for a single-shot insert-then-dump demo; it is not what a production memtable looks like. The skip list gives you ordered iteration and range queries for free, and puts you on the path to lock-free concurrency later.

Skip list vs the alternatives, side by side

Property Sorted array Balanced BST (RB/AVL) Skip list
Insert (single-threaded) O(n) — shift O(\log n) + rebalance O(\log n) expected
Lookup O(\log n) binary search O(\log n) O(\log n) expected
Sorted iteration O(n), sequential memory O(n), in-order walk O(n), walk level 0
Range query O(\log n + k) O(\log n + k) O(\log n + k)
Concurrent writers Terrible — single writer only Hard — rebalances touch many nodes Clean — local CAS per level
Worst-case guarantee Deterministic Deterministic Probabilistic (very tight)
Lines of code (simple impl) ~20 ~300+ (RB) ~60

The only column where the skip list loses is "worst-case guarantee." With very bad luck every node could roll the max height and the structure would degenerate. The probability of this on a million-entry structure is about 2^{-20 \cdot 10^6} — a number so small it is not worth reasoning about. Pragmatic worst case equals expected case.

Balanced BSTs are the "right" answer if you demand a deterministic O(\log n) bound for every operation, tolerate a thousand lines of code, and can afford the concurrency complexity. Skip lists are the "right" answer for a memtable, which is exactly the workload — short-lived, concurrent, sorted, throw-away — that the probabilistic trade-off is designed for.

Put and get through the memtable

Once the skip list exists, wiring it into the database's hot path is three lines of change. The put() function just inserts into the memtable. The get() function checks the memtable first; only on a miss does it fall through to the on-disk SSTables (which you will build in the next chapter). This "memtable is the top level of the LSM" structure is the whole read path in miniature.

# lsm.py — the shape of put/get once the memtable exists
class LSMStore:
    def __init__(self):
        self.memtable  = SkipListMemtable()
        self.sstables  = []                   # newest-first list; filled in chapter 14 onwards
        self.flush_bytes = 4 * 1024 * 1024    # 4 MB threshold for this chapter's demo

    def put(self, key, value):
        self.memtable.insert(key, value)
        if self.memtable.approx_bytes() >= self.flush_bytes:
            self._flush()                    # chapter 14: serialise memtable to a new SSTable

    def get(self, key):
        v = self.memtable.get(key)
        if v is not None:
            return v                         # memtable hit — newest value, always wins
        for sst in self.sstables:            # newest-first on-disk probe
            v = sst.get(key)
            if v is not None:
                return v
        return None

Two things to notice.

First, the memtable is always consulted first on a read, because it holds the newest values. A key that was overwritten ten seconds ago lives in the memtable with its new value; an older SSTable on disk might still carry the stale value. Checking memtable first preserves "latest write wins" for free, without any version numbers.

Second, put() is unconditionally a memtable insert. There is no "should I write this to disk?" decision on the hot path. The disk is touched only when the memtable crosses its size threshold, which amortises the cost across many writes. A million small put()s cost a million skip-list inserts plus one or two big sequential SSTable writes — not a million random disk writes.

This is the LSM's central trick, condensed: writes go to RAM, flushed in bulk; reads check RAM then disk, newest-first. Every remaining chapter of Build 3 fills in pieces of that sentence.

A trace

A trace of seven writes and four reads through the memtable

Start with an empty memtable. Apply the seven writes from the top of the chapter, in the order they arrived, then read four keys. Watch the skip list sort the data as it arrives.

m = SkipListMemtable()
m.insert("zip",    "600001")
m.insert("age",    "19")
m.insert("city",   "delhi")
m.insert("name",   "dipti")
m.insert("age",    "20")      # overwrite — same slot, value replaced in place
m.insert("locale", "en-IN")
m.insert("role",   "admin")

After all seven inserts, iter_sorted() walks level 0 front to back. It yields — regardless of the random levels the individual nodes happened to roll:

age    -> 20         (overwrite applied; 19 is gone)
city   -> delhi
locale -> en-IN
name   -> dipti
role   -> admin
zip    -> 600001

Six entries — the overwrite of age collapsed two inserts into one slot, exactly what you want. Point lookups:

m.get("name")      # -> 'dipti'      — walks top-level, drops down, lands on 'name'
m.get("age")       # -> '20'         — newest value, because overwrite updated in place
m.get("mobile")    # -> None         — descent lands between 'locale' and 'name'; next key > 'mobile'
m.get("zip")       # -> '600001'

Trace get("name") in detail, on the particular shape shown in the earlier SVG (levels: age has height 3, name has height 2, zip has height 3; city, email, locale, phone, role are level 0 only):

start: cur = head, i = 3
L3: head.forward[3] = age; age.key='age' < 'name', advance. cur = age.
    age.forward[3] = zip; zip.key='zip' > 'name', drop. i = 2.
L2: age.forward[2] = name; name.key='name' < 'name'? No, equal. drop. i = 1.
L1: age.forward[1] = email; email.key='email' < 'name', advance. cur = email.
    email.forward[1] = name; equal to target, drop. i = 0.
L0: email.forward[0] = locale; 'locale' < 'name', advance. cur = locale.
    locale.forward[0] = name; key matches — return 'dipti'.

Six pointer hops. For a million-entry memtable the same search would take about twenty. That is the log-n shape in action.

Common confusions

Going deeper

The skip list in sixty lines is enough to run a memtable. This section fills in the bigger picture — the alternatives that real engines use, how the structure becomes lock-free, how flushing decisions get made, and the tie to the crash-recovery story that arrives in Build 5.

Red-black tree alternatives

Some engines — notably early versions of Cassandra — used a balanced BST (specifically ConcurrentSkipListMap's red-black sibling, TreeMap) as the memtable. The choice between RB tree and skip list is narrow and empirical:

Engines that care most about single-threaded read latency (analytics-oriented stores) sometimes use RB trees. Engines that care most about write throughput under concurrency (transactional stores) use skip lists. There is no wrong answer; LevelDB/RocksDB picked the skip list and almost everyone else followed.

Lock-free skip lists

The chapter's implementation is single-threaded — insert() is not safe to call from two threads at once. Promoting it to lock-free takes three changes:

  1. Atomic pointer updates. Every forward[i] assignment becomes a compare-and-swap. If the CAS fails (someone else inserted first), retry from where you descended.
  2. Mark-and-sweep deletion. Deletion cannot simply detach a node — a concurrent search might be mid-traversal through it. Instead, the node is first logically deleted by setting a "marked" bit on its forward pointers, then physically removed by the next thread that walks past it. This two-step protocol is due to Harris (2001).
  3. Memory reclamation. You cannot free() a node while readers might still hold pointers to it. Real implementations use epoch-based reclamation or hazard pointers to delay deallocation until no reader can be inside the node.

The surprise is that none of this requires locking the whole structure — readers never block writers, writers rarely block each other. Doug Lea's ConcurrentSkipListMap implementation (open-source in OpenJDK) is the canonical reference; it fits in about 1500 lines.

Memtable size thresholds

How big is a memtable? The answer depends on three tensions:

Typical production values:

The threshold is a tuning knob, not a law. Picking it well requires watching your own workload — the user's write rate, the machine's IOPS, and how long a flush actually takes.

The memtable needs a WAL friend (preview of Build 5)

A memtable by itself is not durable. If the process crashes between two flushes, the contents of the current memtable vanish — potentially seconds of writes gone. For a system calling itself a database, that is not acceptable.

The fix is the write-ahead log (WAL): a small, append-only file that receives every write before the memtable insert returns. After a crash, the database starts with an empty memtable and replays the WAL to reconstruct it. When the memtable is flushed to an SSTable, the corresponding WAL range is truncated or discarded — the on-disk SSTable has taken over durability for those writes.

The WAL is orthogonal to the memtable structure — you can log writes before or after the skip-list insert; the skip list is none the wiser. That is why this chapter introduces the memtable on its own, in a pure in-memory form. Build 5 of this curriculum returns to durability with the full WAL + fsync + group-commit story, and wires the WAL into this same put() function. For now, accept that in-RAM is fast and sorted, and durability is a concern we are explicitly deferring by one layer.

Why "probabilistic" is fine in practice

One lingering worry for engineers trained on deterministic data structures: isn't it dangerous to let a critical piece of the database depend on a coin flip?

The answer, once you put numbers on it, is no. The probability that a skip list's search time exceeds c \log_2 n for any c > 1 decays exponentially in c. For c = 5 (a search taking five times the expected number of steps), the probability is below 10^{-12}. For a memtable that does a million lookups a day, the expected time to encounter even one unusually slow search is on the order of millennia.

Compare with deterministic structures: a red-black tree has a worst case of 2 \log_2 n, which happens routinely on adversarial inputs. The skip list's expected case is better than the RB tree's worst case, and the tails are tighter than any real workload cares about. Randomisation here is not a compromise — it is a genuinely strong guarantee that happens to be stated statistically.

Where this leads next

The memtable sits in RAM, sorted, absorbing writes. At some point it crosses its size threshold and has to be moved to disk. That move — one sequential pass over iter_sorted(), writing records to a fresh SSTable and building the sparse index from chapter 12 as it goes — is the subject of the next chapter. A few dozen lines of code, and you will have your first fully sorted on-disk file written entirely from random-order writes.

After that, the rest of Build 3 builds the pieces that make the design practical: the read path across memtable + stack of SSTables, Bloom filters to skip most SSTables without a disk probe, and compaction strategies that call the merge() function from chapter 12 on a schedule. Every remaining chapter treats the memtable as a given — a black box that absorbs writes and iterates sorted — which is why getting it right here is worth sixty lines of care.

References

  1. William Pugh, Skip Lists: A Probabilistic Alternative to Balanced Trees (CACM, 1990) — the original paper introducing skip lists with the analysis and the suggested implementation. epaperpress.com PDF.
  2. Timothy Harris, A Pragmatic Implementation of Non-Blocking Linked-Lists (DISC, 2001) — introduces the mark-and-sweep deletion protocol that makes lock-free skip lists practical. cl.cam.ac.uk PDF.
  3. Keir Fraser, Practical Lock-Freedom (PhD thesis, Cambridge, 2004), Ch. 4 — extended treatment of concurrent skip lists, including the memory-reclamation problem and epoch-based solutions. cam.ac.uk PDF.
  4. Sanjay Ghemawat and Jeff Dean, LevelDB memtable implementation — the real, readable C++ skip list used in LevelDB and the ancestor of RocksDB's default memtable. github.com/google/leveldb — db/skiplist.h.
  5. Doug Lea, ConcurrentSkipListMap — the canonical production-grade lock-free skip list in OpenJDK's java.util.concurrent, with extensive commentary. OpenJDK source.
  6. Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 3 — concise overview of memtables, SSTables, and where the skip-list choice sits in the LSM design space. dataintensive.net.