The Offset Table — Turning a Log into a KV Store

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Build 1's AppendOnlyKV was fast to write and unbearable to read — every get() scanned the whole log from the top. The fix is one idea: keep a Python dictionary in memory mapping each key to the byte offset of its latest record in the file. Writes still append; they also update one dictionary entry. Reads become a hash-table lookup plus a single seek() — O(1) regardless of how many records the log holds. Sixty lines of Python do this. On a million-record log, reads go from roughly 40 per second to well over a hundred thousand. You have just rebuilt Bitcask, the key-value engine Basho shipped as the default backend for Riak in 2010, and the cleanest real-world example of "append-only log plus in-memory hash index." This chapter shows the design, writes the code, benchmarks the speedup, and lays out the three new problems it creates — problems the next five chapters solve.

In chapter 2 you built a thirty-line key-value store whose writes hit a million records per second but whose reads were O(n) in the size of the log. At a hundred thousand records you saw 400 reads per second. At a million, forty. At ten million, four. That is not a database you can put behind anything — a single web page that does ten lookups takes seconds just on the storage layer.

The previous chapter named this wall. It is the first of the three that any log-structured store eventually hits: linear-time reads. This chapter climbs over it, and the climb costs you only one idea.

Here is the idea. The log already stores, for every record, the full bytes of the record. What it does not tell you, at read time, is where in the file the latest record for a given key lives. If you knew that — if you had, for each key, a small integer saying "the live value for age starts at byte 472" — you could seek() straight to byte 472, read one line, and return. One disk seek. Constant time in the size of the log.

That integer is called a byte offset. A dictionary from keys to byte offsets is called an offset table, or more generally an index. Put one in RAM, keep it in sync with the log on every put(), and your store is suddenly a real one.

The insight in one picture

Think about what your brain does when it reads a physical book. You do not scan from page one every time you want to look something up. You open the index at the back, find the word, read the page number, and turn to that page. The body of the book is the log. The index is the small helper table that turns "I know the word" into "I know the page."

A database index does the same thing, with exactly the same motivation and exactly the same shape.

The log holds every version of every value, in write order. The RAM table holds one entry per distinct key, and each entry is the byte offset of the latest record. A get("age") now skips the first five lines entirely.

Three things to notice before you write a line of code.

The log did not change. The file on disk is exactly what AppendOnlyKV would produce. Every old version of every key is still there. The offset table is a derived, in-memory view. If you delete the table, you rebuild it by scanning the log once. The log is the truth; the index is a convenience.

The table only has one entry per distinct key. There are five records for age and city combined in the log, but only two entries in the table (plus one for name). The table size is the number of distinct keys you have ever written, not the number of writes. This is the property that makes Bitcask usable on a laptop: you can have ten million puts but only fifty thousand unique keys, and your RAM footprint tracks the fifty thousand.

A put() is now two operations, not one. Append the record to the log, and update one dictionary entry. Both operations must happen for every write. We will see below that this costs you almost nothing — a dictionary insert is a nanosecond, a disk append is a microsecond, you are not adding a meaningful fraction.

Why the offset, not the value: you could put the value itself in the RAM table and never read the log on get(). Some caches do this (this is basically what Redis is). But then RAM must hold every value, not just every offset — a blog with a billion 2 KB articles would need two terabytes of RAM. An 8-byte offset per key keeps the table a thousand times smaller than the data, which is the whole trick.

Writing it — BitcaskLite

Here is the full implementation. It extends the Build 1 AppendOnlyKV with an in-memory dict, rebuilt on startup by scanning the log, and kept current on every put().

# bitcasklite.py — an append-only log with an in-memory offset table
import os

class BitcaskLite:
    def __init__(self, path):
        self.path = path
        # offset table: key (str) -> byte offset of the latest record (int)
        self._index = {}
        # make sure the file exists before we open it for reading/appending
        open(path, "a", encoding="utf-8").close()
        self._rebuild_index()
        # binary append mode so we know exactly how many bytes we wrote
        self._f = open(path, "ab")

    def _rebuild_index(self):
        # scan the whole log once, recording the offset of each key's latest record
        offset = 0
        with open(self.path, "rb") as f:
            for line in f:
                # line includes the trailing b"\n"; partition at the first b"="
                key_bytes, eq, _ = line.partition(b"=")
                if eq and key_bytes:  # ignore torn/malformed lines
                    self._index[key_bytes.decode("utf-8")] = offset
                offset += len(line)

    def put(self, key, value):
        # encode the record, note its starting offset, append, update the index
        record = f"{key}={value}\n".encode("utf-8")
        offset = self._f.tell()          # position where this record will start
        self._f.write(record)
        self._f.flush()                   # push to kernel (not yet fsynced)
        self._index[key] = offset         # <-- the only new line over Build 1

    def get(self, key):
        offset = self._index.get(key)
        if offset is None:
            return None                   # never written, or shadowed by nothing
        with open(self.path, "rb") as f:
            f.seek(offset)                # jump straight to the record — O(1)
            line = f.readline()
            _, _, value = line.rstrip(b"\n").partition(b"=")
            return value.decode("utf-8")

    def close(self):
        self._f.close()

Sixty lines including comments and blanks. Read it once more, slowly — every choice matters.

self._index = {}. A plain Python dictionary, i.e. a hash table. Lookup, insert, and update are all expected O(1). The keys of the dict are the keys of the store; the values of the dict are byte offsets (plain Python ints). Nothing else lives in RAM about the data itself.

_rebuild_index() at startup. This is the price you pay for keeping the index only in RAM — every time the process starts, it has to scan the log to reconstruct it. The scan is the same O(n) operation AppendOnlyKV.get() used to do, except now it runs once at startup, not on every read. A 1 GB log takes a few seconds to rebuild; a 100 GB log takes a couple of minutes; a 1 TB log takes half an hour. Chapter 10 fixes this with hint files, small companion files that record the index snapshot so startup is instant.

self._f = open(path, "ab"). Binary append mode. Two reasons. First, tell() in binary mode returns a real byte offset; in text mode it returns an opaque number on Python 3 that cannot be used with seek(). Second, we want byte-exact control over record sizes — any future move to length-prefixed binary records (chapter 5 of Build 1 previewed that) needs binary I/O anyway.

offset = self._f.tell(). The single most important line in put(). It captures where the write is about to land. Because we opened the file in append mode, the write will go exactly there regardless of where the kernel thinks the file pointer is. We record that offset, do the write, then update the index — in that order, so the index can never point past the end of the file on disk.

self._index[key] = offset. One dict insert. A few nanoseconds on a modern CPU. This is the only "index maintenance" code in the whole store. Every other database you will ever look at has more elaborate index-update logic (tree rebalancing, SSTable flushes, B+-tree splits); Bitcask has this one line. That is why it is the ideal introductory design.

get(). Look up the offset. If the key was never written, return None. Otherwise open the file, seek() straight to the offset, read one line, and return the value. One disk seek, one read, one dictionary lookup. The file size is irrelevant — get() does the same amount of work whether the log is 1 KB or 1 TB.

Why the index is kept in sync on every put, never "rebuilt later": consistency. If we updated the log but left the index stale, a subsequent get() would return the wrong value — the old offset pointing to an old record. Index-and-log must commit together. The trick in BitcaskLite is that the log is the source of truth, so if the process crashes after the write but before the index update, the in-memory index is lost anyway; on restart, _rebuild_index() reads the durable log and reproduces the correct table. Crash safety is automatic.

Try it:

db = BitcaskLite("bitcask.log")
db.put("name", "dipti")
db.put("age", "15")
db.put("city", "coimbatore")
db.put("age", "16")
db.put("age", "17")
db.put("city", "delhi")
print(db.get("age"))       # 17
print(db.get("city"))      # delhi
print(db.get("name"))      # dipti
print(db.get("missing"))   # None
db.close()

Kill the Python process, start a new one, call BitcaskLite("bitcask.log") again. The _rebuild_index() scan runs on startup, reconstructs the table from the log, and every get() works the same as before. The durable state is in the file; the index is rebuilt on demand.

Benchmarks — the O(n) to O(1) cliff

Re-run the benchmark from chapter 2, this time on BitcaskLite:

# bench_bitcask.py
import os, time, random
from bitcasklite import BitcaskLite
from appendkv import AppendOnlyKV  # chapter 2

def bench(store_cls, path, n, reads=10_000):
    if os.path.exists(path): os.remove(path)
    db = store_cls(path)
    t0 = time.time()
    for i in range(n):
        db.put(f"key{i}", f"value{i}")
    w = n / (time.time() - t0)
    t0 = time.time()
    for _ in range(reads):
        db.get(f"key{random.randint(0, n-1)}")
    r = reads / (time.time() - t0)
    db.close()
    return w, r

for n in (10_000, 100_000, 1_000_000):
    w1, r1 = bench(AppendOnlyKV, "a.log", n, reads=min(1000, n))
    w2, r2 = bench(BitcaskLite,  "b.log", n, reads=10_000)
    print(f"n={n:>9,}  AppendOnlyKV writes {w1:>8,.0f}/s reads {r1:>8,.0f}/s    "
          f"BitcaskLite writes {w2:>8,.0f}/s reads {r2:>9,.0f}/s")

Typical output on a 2024-era laptop:

n=   10,000  AppendOnlyKV writes  820,000/s reads    4,100/s    BitcaskLite writes  790,000/s reads   280,000/s
n=  100,000  AppendOnlyKV writes  840,000/s reads      410/s    BitcaskLite writes  770,000/s reads   240,000/s
n=1,000,000  AppendOnlyKV writes  830,000/s reads       39/s    BitcaskLite writes  720,000/s reads   190,000/s

Look at the read column. On the million-record log, AppendOnlyKV manages 39 reads per second. BitcaskLite manages 190,000 — about five thousand times faster. And the BitcaskLite read rate barely drops as n grows; 280 k at ten thousand records, 190 k at a million. A log a hundred times bigger reads only about 30% slower, and that slowdown is all from the CPU cache getting colder and the OS having to fetch different disk pages. The algorithmic cost is still O(1).

The write column tells the other half of the story. BitcaskLite is about 10–15% slower on writes than AppendOnlyKV. That is the cost of the self._index[key] = offset line plus the tell() call — a handful of nanoseconds per write. You gave up almost nothing on the hot write path, and bought back four orders of magnitude on reads.

Read throughput on log-log axes. AppendOnlyKV's line has slope −1 — a clean $O(n)$ signature. BitcaskLite's line is nearly flat — a clean $O(1)$ signature, with only a mild dip at the largest $n$ from cache effects.

This is the first real-feeling payoff in the databases track. You have written code that is not merely interesting — it is fast. Your store can now serve a hundred and ninety thousand reads per second off a million-record log, on a laptop, with no tuning. That is already in the neighbourhood of what Redis does for in-process reads. Production Bitcask deployments at Riak-era scale routinely hit a hundred thousand ops per second per node; the core algorithm is what you just wrote.

A session walk-through — watching the index change

Trace through this sequence. After each put(), the log grows by one record and the offset table updates one entry. Stale records sit in the log unreferenced.

put("name", "dipti")      log: [name=dipti\n]                          offset=0
                          index: {name:0}

put("age", "15")          log: [name=dipti\n][age=15\n]                offset=11
                          index: {name:0, age:11}

put("city", "coimbatore") log: [...][age=15\n][city=coimbatore\n]      offset=18
                          index: {name:0, age:11, city:18}

put("age", "16")          log: [...][city=coimbatore\n][age=16\n]      offset=34
                          index: {name:0, age:34, city:18}
                                                  ^^ overwritten, 11 forgotten

put("age", "17")          log: [...][age=16\n][age=17\n]               offset=41
                          index: {name:0, age:41, city:18}
                                                  ^^ overwritten again

put("city", "delhi")      log: [...][age=17\n][city=delhi\n]           offset=48
                          index: {name:0, age:41, city:48}

get("age")                index["age"] = 41 → seek(41) → "age=17\n" → return "17"
get("city")               index["city"] = 48 → seek(48) → "city=delhi\n" → return "delhi"
get("name")               index["name"] = 0  → seek(0)  → "name=dipti\n" → return "dipti"

Three things to notice. One: the log has six records and grew only forward. Two: the index has three entries, one per distinct key. Three: the stale records — the earlier age=15, age=16, city=coimbatore — still live in the log, uselessly, forever, until compaction arrives (chapter 8). They are dead bytes, and they are the price you pay for the simplicity of the design.

What you have just built

This is not a toy you made up. It is a real, named design, shipped by a real company to real production users. Its name is Bitcask.

Bitcask was designed by Justin Sheehy and David Smith at Basho Technologies in 2010, as the default storage backend for Riak, a distributed key-value database. The Bitcask paper is seven pages long, has no equations, and is one of the clearest systems-design documents you will ever read. Its abstract is essentially one sentence: "we keep an append-only log on disk and a hash index in memory." The rest of the paper explains the additions — segments, compaction, hint files, tombstones — that are the topics of the next five chapters of this Build.

What you have in your sixty-line BitcaskLite is the core of Bitcask with those additions stripped off. Close to the minimum thing that is still a useful database.

The Bitcask architecture you are building toward. The full design has many log segments, the active one accepting writes and the older ones read-only; the keydir points into any of them. BitcaskLite is the single-segment starting point.

Common confusions

"What happens on restart?" The _rebuild_index() method scans the whole log once and reconstructs the dict in memory. For a 1 GB log this takes a few seconds; for a 100 GB log, minutes. That startup cost is the reason hint files exist — a small companion file per segment that stores the index snapshot directly, so restart becomes loading a few hundred kilobytes per segment instead of scanning the whole data. Chapter 10 builds them. Until then, yes — restart is slow. Runtime is fast.
"What if the index gets out of sync with the log?" It cannot, because the log is the only durable state. The index is rebuildable from the log at any time. If the process crashes with the index in memory half-updated, the crash also destroys the half-updated index. On restart, _rebuild_index() reads the committed log and produces a consistent table. The log commits first, the index update is "free" because it is ephemeral.
"Why is the index value an integer offset, not a file pointer or an object?" Because an integer is tiny (8 bytes on a 64-bit machine), hashable, serialisable, and the same whether the file is open or closed. You can print it, send it over a network, store it in a hint file. Storing a file pointer or an object would couple the index to the current process's open files and not survive anything.
"Why not just keep the value in the dict instead of the offset?" You would then need RAM proportional to the total data size, not just the key set size. For a 500 GB Bitcask store with 10 GB of keys, you need 10 GB of RAM — feasible. For the same store with values in the dict, you need 500 GB of RAM — not feasible. The offset-not-value choice is what lets Bitcask handle data sets much larger than memory, as long as the key set fits. When even the keys do not fit, you need a different design — Build 3 introduces LSM-trees.
"I called put and get — is my data safe after a crash?" Only approximately. flush() is a Python-buffer flush, not an fsync. Records in BitcaskLite.put() are durable to the kernel but can still be lost in a power cut. The right place to add os.fsync(self._f.fileno()) is inside put(), but it costs a factor of 100 in write latency — chapter 3 of Build 1 is the full story. Real Bitcask exposes a sync_strategy config knob that lets the operator choose per-write, per-second, or never.
"Is this thread-safe?" No. Two concurrent put() calls on the same file object will have racing tell() and write() operations, and concurrent _index[key] = offset writes to the same Python dict are safe for a single assignment under the GIL but the combination is not atomic. Real Bitcask serialises writes with a mutex; a single writer is typically enough for its workload profile. Concurrency is Build 5.

Going deeper

If you just wanted the core idea — "append-only log plus a RAM dict from keys to offsets equals a fast KV store" — you have it. The rest of this section fills in three things you will want when you read the Bitcask paper or the source of any real KV engine: the memory math, the family tree of designs, and what breaks when the index does not fit.

The memory math of Bitcask

The offset table has one entry per distinct key. Each entry stores: the key itself (Python string, roughly 50 bytes for a 20-character key, since Python strings carry overhead), a byte offset (8 bytes as a Python int, 28 bytes wrapped as a Python object). Dict overhead in CPython is roughly another 100 bytes per entry. Call it 200 bytes per key in pure Python, or 40 bytes in a C implementation like real Bitcask that packs the entry as a struct.

That means a Bitcask store can hold up to \mathrm{RAM} / 40 keys in native code. A 64 GB server can keep 1.5 billion keys indexed — comfortably more than the active working set of almost any product catalogue, user table, session store, or feature store you will build before needing to shard. The values can be arbitrarily large (gigabytes each, if you like), and Bitcask does not care; values live on disk and each is one seek + read away.

This is why Bitcask shipped with Riak. Riak was a dynamo-style distributed KV database, and at the time (2010) it was normal for any one node to hold a few hundred gigabytes of data and a few hundred million keys. That fit the constraint exactly. Bitcask is a design optimised for one specific point in the design space — and it is the best design for that point, which is why it is still shipped and still used fifteen years later.

The family tree of KV designs

Bitcask sits at one corner of a small design triangle every storage engineer learns.

Bitcask: in-memory hash index + append-only log. O(1) reads, O(1) writes, RAM footprint = distinct keys. No range queries (keys are hashed, not sorted).
LSM-tree: in-memory sorted buffer + append-only log + stacked on-disk sorted files, merged in the background. O(\log n) reads (check each level), O(1) amortised writes, RAM footprint = recent keys only, but supports range queries. This is LevelDB, RocksDB, Cassandra.
B+-tree: disk-resident sorted tree, updates in place under a WAL. O(\log_B n) reads in terms of disk pages (e.g. 3–4 for a billion keys), same for writes, RAM footprint = just a few levels of the tree for caching. Every relational database uses one.

Bitcask is the one whose core you can write in sixty lines. LSM is three hundred lines with effort, B+-tree is several thousand even for a simple teaching version. They each land at a different point on the read-speed / write-speed / range-query / memory-usage axes, and real systems combine them — e.g. SQLite uses a B+-tree with a WAL under it; Cassandra uses LSM with memtables and Bloom filters; Kafka is basically one giant append-only log with offset indexes (a Bitcask cousin) baked into every topic partition.

Build 3 of this track implements the LSM-tree. Build 4 implements the B+-tree. Both will be easier because you have already internalised the log-plus-index pattern here.

What happens when the index won't fit in RAM

This is Bitcask's second wall, the one chapter 11 of this Build names. The RAM-resident index is the whole reason reads are fast. If you have more distinct keys than your memory can hold, the design breaks — you cannot get() for a key whose entry is not in the dict, because there is no fallback to "find it in the log without knowing the offset."

Two escape hatches exist, and both lead to different designs:

Page the index to disk — keep hot index entries in RAM and the rest on disk in a B-tree. At that point you have built a B+-tree over the log. Many real databases do exactly this (InnoDB is essentially this pattern with heavier transactional machinery).
Abandon the global hash and sort the data on disk — write on-disk files in sorted order so that "find the key" is a disk binary search, not a hash lookup. Reads become O(\log n) instead of O(1), but you no longer need the global in-memory index. You have now invented LSM.

Bitcask chose neither; it just warned users to size their hardware so the key set fits. That is a legitimate engineering choice — many workloads really do have small key sets — and it is why Bitcask's code is short.

Durability of the index itself

The index is not stored on disk in BitcaskLite. It is derived state — every piece of information it contains is reproducible by scanning the log. That is a powerful property: you cannot corrupt the index, because you cannot lose anything you do not have; and your crash-recovery code is trivially the same as your startup code.

Real Bitcask snapshots the index to a hint file (one per segment) so startup is fast, but the hint file is a cache, not the source of truth. If a hint file is missing or corrupt, Bitcask rebuilds the index from the data segment and regenerates the hint. You lose time, not data. Chapter 10 of this Build implements this pattern from scratch.

This is a more general principle: in well-designed systems, the on-disk state is as minimal as possible, and everything else is derived. Postgres's shared buffers, Cassandra's Bloom filters and summary files, Kafka's consumer-group offsets stored inside a special topic — all of them are derived. The log is the floor; everything on top is cache.

Where this leads next

Your BitcaskLite is a real database. It is also a broken one — three new problems have appeared, each of which the next chapters of this Build tackle in sequence.

Segment files and closing old segments — chapter 7: one log that grows forever is a nightmare for compaction and backups. Split the log into fixed-size segment files, close them as they fill up, open a new one.
Compaction: reclaiming deleted and overwritten keys — chapter 8: all those stale age=15, age=16 records in the log are dead bytes. Periodically rewrite each segment keeping only live records, and delete the old segment.
Tombstones and the delete problem — chapter 9: how do you delete a key from an append-only store? With another append — a special "tombstone" record that shadows the key. Its existence is why get() must understand negation.
Hint files: rebuilding the index in milliseconds — chapter 10: sidestep the O(n) startup scan by saving the index snapshot to a small companion file per segment.
Walls: dataset > RAM, no range queries — chapter 11: the two design pressures that Bitcask cannot answer, and that motivate moving to LSM-trees in Build 3.

Each of those is a concrete, self-contained engineering problem with a clean Python implementation. Together they take BitcaskLite from "real design" to "publishable design." By the end of this Build your code will have about 300 lines total, and will essentially be Bitcask.

References

Justin Sheehy and David Smith, Bitcask: A Log-Structured Hash Table for Fast Key/Value Data (Basho Technologies, 2010) — the seven-page original. Required reading; the clearest systems paper of its decade. riak.com/assets/bitcask-intro.pdf.
Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 3 Storage and Retrieval — compares Bitcask, LSM-trees, and B-trees with the vocabulary every working engineer uses. dataintensive.net.
Riak KV source, bitcask/src/bitcask.erl — the production Erlang implementation, under 2,000 lines, still a good read. github.com/basho/bitcask.
Patrick O'Neil et al., The Log-Structured Merge-Tree (LSM-Tree) (Acta DataFlow, 1996) — the paper behind LevelDB and RocksDB; Build 3's foundation. Springer.
Rudolf Bayer and Edward McCreight, Organization and Maintenance of Large Ordered Indexes (Acta DataFlow, 1972) — the original B-tree paper; Build 4's foundation. Springer.
Jay Kreps, The Log: What every software engineer should know about real-time data's unifying abstraction (ProfNet Engineering, 2013) — why the log is primary and every index is derived. engineering.linkedin.com.