In short
BitcaskLite from the last chapter has O(1) reads, but its log is one giant file that grows forever. That breaks every practical operation you would ever want to run against a database: you cannot compact it without rewriting the whole file, you cannot back it up incrementally, and a single bad byte anywhere in the middle corrupts the entire store. The fix is one of the oldest ideas in systems: chop the log into fixed-size pieces. Keep exactly one segment open for writing (the active segment). When it reaches a size threshold — say 64 MB — close it, fsync it, mark it read-only, and open a new active segment. The store now has a directory full of immutable closed segments plus one growing active one. The index changes shape by exactly nine characters: key → offset becomes key → (segment_id, offset). That small change unlocks everything the next three chapters need — compaction can rewrite one closed segment at a time, backups can copy closed segments without coordination with the writer, recovery only needs to scan the active segment for torn tails. This chapter builds BitcaskSegmented in about fifty lines, explains why segment size is a three-way trade, and walks through what happens when a crash lands in the middle of a rotation.
In the previous chapter you built BitcaskLite — an append-only log plus an in-memory hash index from key to byte offset. Reads jumped from O(n) to O(1). Writes stayed at a microsecond. A million records on a laptop, 190,000 reads per second. By any fair measure, it is a real database.
It is also doomed.
Open the benchmark output from last chapter and notice the file size column you ignored. A million records produced an 18 MB log. A hundred million would produce a 1.8 GB log. A year of writes on a moderately busy service — a session store, an IoT ingest, a feature store — produces a log that is hundreds of gigabytes, every byte of it a single inseparable file. Think about what you can actually do with such a file.
You cannot compact it without reading the whole thing and rewriting the whole thing in one blocking operation. You cannot back it up incrementally — rsync has to diff every byte to notice that only the last few kilobytes changed. You cannot split it across two disks. You cannot hand a frozen slice of it to a replica. A single-bit corruption at byte 42,000,000,000 makes the scanner throw up its hands somewhere around minute forty of recovery. One file, one crash blast radius, one everything.
The fix is to stop writing one file and start writing many. That is this chapter.
Why one file breaks everything
Think of the log the way an accountant thinks of a ledger. You never erase a ledger entry — that is the whole point of a ledger. But no accountant alive keeps one infinite roll of paper. They keep a volume for this year, a volume for last year, a volume for the year before. Volumes closed long ago sit on a shelf, immutable, labelled by date, available for audit but untouched by the day-to-day pen. The current volume is the only one that gets written in.
That is the idea. In storage-engine language, each volume is a segment file.
Before you write any code, notice what segmentation buys you, and what it costs.
What it buys. Compaction can operate on one closed segment (or a pair of them) at a time, in the background, without pausing writes — because writes always go to the active segment, which compaction never touches. Backups can snapshot the closed segments cheaply: they do not change, so a file-level copy is safe without any locking. A single corrupt byte now blasts only one segment, not the whole store. You can spread segments across disks, tiers, even cloud object storage. And recovery — the startup scan that rebuilds the index — only needs to look for torn tails in the active segment, because closed segments were fsync-ed at close time and have integrity proofs on every record.
What it costs. The in-memory index now has to remember which segment each key lives in, not just where in "the" file. A tiny change in data type: nine characters in the Python — a tuple instead of an integer. And the read path has to open the right segment file before it can seek. Also tiny: file descriptors are cheap, and a production implementation keeps a small LRU cache of open handles. The runtime cost is effectively zero.
That is the trade. The rest of this chapter makes it concrete.
The segmented design, in four rules
Four rules are all you need.
- Segments are numbered. A monotonically increasing integer per segment, written as a zero-padded prefix in the filename:
segment-0000.log,segment-0001.log, and so on. The number is the segment's identity, reusable in the index, printable in logs. - Exactly one segment is active. The one with the highest number. All writes go there. Every other segment is closed, meaning its file has been
fsync-ed and its handle is open read-only (or not open at all). - Rotate at a size threshold. After each
put(), if the active segment is bigger thanSEGMENT_SIZE, close it and open a new active segment. A typical threshold is 64 MB — big enough that rotation is rare, small enough that one segment is a workable unit of compaction. - The index maps key → (segment_id, offset). Not just offset. This is the one data-structure change in the whole chapter, and it propagates into every piece of code the index touches — which turns out to be exactly two places:
put()(writes the tuple) andget()(reads the tuple and usessegment_idto pick the file).
Put those together and you have BitcaskSegmented.
BitcaskSegmented — the code
Here is the full implementation, extending BitcaskLite with segmentation. About fifty lines of real logic, plus some bookkeeping.
# bitcask_segmented.py — append-only KV store with rotating segment files
import os, glob
SEGMENT_SIZE = 64 * 1024 * 1024 # 64 MB — rotate once the active segment exceeds this
class BitcaskSegmented:
def __init__(self, data_dir):
self.data_dir = data_dir
os.makedirs(data_dir, exist_ok=True)
# keydir: key (str) -> (segment_id: int, offset: int)
self._index = {}
# open file handles for every known segment, keyed by segment_id
self._handles = {}
# discover existing segments, load them, then open the active one
self._load_existing_segments()
self._open_active_segment()
# ---- segment discovery and rotation ---------------------------------
def _segment_path(self, seg_id):
return os.path.join(self.data_dir, f"segment-{seg_id:04d}.log")
def _load_existing_segments(self):
# find every segment-NNNN.log on disk, in numerical order
paths = sorted(glob.glob(os.path.join(self.data_dir, "segment-*.log")))
for path in paths:
seg_id = int(os.path.basename(path)[8:12])
self._scan_segment(seg_id, path)
self._handles[seg_id] = open(path, "rb")
# the next active segment is one past the largest existing id, or 0
self._active_id = (max(self._handles) + 1) if self._handles else 0
def _scan_segment(self, seg_id, path):
# rebuild the keydir entries from this segment's bytes
offset = 0
with open(path, "rb") as f:
for line in f:
key_bytes, eq, _ = line.partition(b"=")
if eq and key_bytes:
self._index[key_bytes.decode("utf-8")] = (seg_id, offset)
offset += len(line)
def _open_active_segment(self):
path = self._segment_path(self._active_id)
# binary append for exact byte offsets
self._active = open(path, "ab")
# keep a read handle too, so get() can read from the active segment
self._handles[self._active_id] = open(path, "rb")
def _rotate_if_needed(self):
if self._active.tell() < SEGMENT_SIZE:
return
# close the active segment: flush Python buffer, fsync, mark immutable
self._active.flush()
os.fsync(self._active.fileno())
self._active.close()
# open the next segment as the new active one
self._active_id += 1
self._open_active_segment()
# ---- public API -----------------------------------------------------
def put(self, key, value):
record = f"{key}={value}\n".encode("utf-8")
offset = self._active.tell() # byte offset within the active segment
self._active.write(record)
self._active.flush()
self._index[key] = (self._active_id, offset) # the one-line structural change
self._rotate_if_needed()
def get(self, key):
entry = self._index.get(key)
if entry is None:
return None
seg_id, offset = entry
f = self._handles[seg_id] # pick the right segment
f.seek(offset)
line = f.readline()
_, _, value = line.rstrip(b"\n").partition(b"=")
return value.decode("utf-8")
def close(self):
self._active.flush()
os.fsync(self._active.fileno())
for f in self._handles.values():
f.close()
self._active.close()
Read through it once, and notice how little is different from BitcaskLite.
SEGMENT_SIZE = 64 * 1024 * 1024. The threshold at which the active segment rotates. 64 MB is a sensible default — real Bitcask defaults to 2 GB, LevelDB to 4 MB per SSTable, RocksDB tunes per level. The right number is a three-way trade covered below.
self._index[key] = (seg_id, offset) in put(). The only structural change. An integer becomes a tuple. Everything else about the index — it lives in RAM, it has one entry per live key, it is rebuilt from disk on startup — is unchanged.
_rotate_if_needed(). Called after every put(). Cheap: one integer comparison. Once in a blue moon — roughly every 64 MB of writes — it closes the active file and opens a new one. The rotation itself is three syscalls (flush, fsync, close) plus an open; a millisecond of latency for one write out of every few million. Unnoticeable in any workload.
_handles dict. A cache of open read handles, one per segment, indexed by segment id. This is why get() is O(1) instead of O(\text{open time}) — the file is already open when the read arrives. A real implementation would bound the cache with LRU eviction because open file descriptors are a kernel resource (the default soft limit is 1024 on most Linux systems), but for the hundreds-of-segments regime of a teaching Bitcask this is fine.
_load_existing_segments() at startup. Walks the data directory, finds every segment-NNNN.log, and scans them in order — lowest ID first. Because segments are scanned in order, a newer write for the same key overrides an older one naturally (the _index[key] = ... assignment at the end of _scan_segment happens after the earlier assignment from the previous segment, so the later wins). This is the same last-write-wins rule that made chapter 2's scanner work — applied across segments.
fsync on close and rotation. When a segment closes, we fsync it. That is the durability barrier. After _rotate_if_needed() returns, the closed segment is guaranteed to be on disk; everything later belongs to the new active one. If the process crashes, recovery only needs to consider torn tails in the single active segment — because every closed segment has passed through a completed fsync.
Sixty lines, counting comments. The core change from BitcaskLite is maybe fifteen lines of new code. The rest is moving BitcaskLite's single-file assumptions to a directory of files.
What the index now looks like
A picture is worth a lot here. Suppose over the life of the store you have written seven records across three segments. Closed segments: 0000 and 0001. Active: 0002. The index has one entry per distinct key, and each entry points to the latest version.
# after a sequence of puts that rotated twice
db._index == {
"name": (0, 0), # name=dipti in segment-0000 at byte 0
"city": (1, 41_000), # city=chennai in segment-0001 at byte 41,000
"age": (2, 18_400), # age=17 in segment-0002 at byte 18,400 — active segment
"score": (2, 22_100), # score=99 in segment-0002 at byte 22,100 — active segment
}
Four distinct keys, across three segments. name was written once and never updated, so its entry still points into the oldest segment. city was last updated during the middle segment's lifetime. age and score are recent. Each entry points to exactly one segment, and the segment-id tells you which file to open.
Every stale value — the age=15, age=16 records in segment-0001, an earlier city in segment-0000 — still sits in its segment file, unreferenced by the keydir, taking up bytes. That is dead data, and it will not shrink until compaction (chapter 8) arrives.
A rotation in slow motion
Watching the log roll over
Shrink SEGMENT_SIZE to 200 bytes so you can trigger a rotation in ten puts instead of half a million, and walk through the state after each write.
# tiny_rotation_demo.py
import bitcask_segmented as bs
bs.SEGMENT_SIZE = 200 # absurdly small — just for the demo
db = bs.BitcaskSegmented("demo-data")
db.put("name", "dipti") # 11 bytes -> segment-0000, size 11
db.put("age", "15") # 7 bytes -> segment-0000, size 18
db.put("city", "coimbatore") # 16 bytes -> segment-0000, size 34
# ... many more ...
db.put("score", "A" * 200) # crosses the threshold -> rotation fires
db.put("year", "2026") # goes into segment-0001 at offset 0
Trace what the store looks like after each operation.
after put("name","dipti"):
active = segment-0000 (11 B)
index = {"name": (0, 0)}
after put("age","15"):
active = segment-0000 (18 B)
index = {"name": (0, 0), "age": (0, 11)}
after put("city","coimbatore"):
active = segment-0000 (34 B)
index = {"name": (0, 0), "age": (0, 11), "city": (0, 18)}
... after enough puts to cross 200 bytes ...
after put("score","AAA...A") which pushes active past 200 B:
1. record is written into segment-0000 (size now 210 B)
2. _rotate_if_needed() detects 210 > 200
3. segment-0000 is flushed, fsynced, closed
4. segment-0001 is opened as the new active segment
active = segment-0001 (0 B)
index = {"name": (0,0), "age": (0,11), "city": (0,18), "score": (0, 34)}
^^^^ note: score points into segment-0000, not the new active one
after put("year","2026"):
active = segment-0001 (10 B)
index += {"year": (1, 0)} # first record in the new segment
get("score") -> index["score"] = (0, 34) -> open segment-0000,
seek(34), readline -> "score=AAA...A"
get("year") -> index["year"] = (1, 0) -> open segment-0001,
seek(0), readline -> "year=2026"
Three things to notice.
The rotation fires after the write that crossed the threshold. That is the correct order: the record belongs in the old segment, because that is where we wrote it and where the index entry points. Only the next write goes into the new segment. Flipping this order (rotating first, then writing) is a common bug — it leaves a zero-byte gap at the end of the old segment and an off-by-one in the index.
The keydir entry for score is (0, 34) — the id of the segment it was actually written to, not the id of the newly-opened active segment. The new segment is empty at that moment; its first tenant is year.
get() is segment-oblivious. It looks up the tuple, picks the right file handle, seeks, and reads. The same code path works regardless of whether the requested key lives in the newest segment or the oldest.
That walk-through is the whole behaviour. Reads do not care about rotations; writes trigger them at a predictable threshold; the keydir is the only thing that knows which segment each key is in.
Common confusions
-
"Why not just
mmapthe whole log instead of segmenting?"mmapgives you a single contiguous address-space view of a file, which is convenient, but it does not solve any of the problems segmentation solves. You still have one giant file with one giant blast radius on corruption; you still cannot rewrite parts of it while writes continue; backups still have to diff the whole thing; a 100 GB store still needs 100 GB of virtual address space (not a problem on 64-bit, but it is ugly). Worse,mmapon a log is subtly dangerous because dirty pages flush on the kernel's schedule, not yours, and you lose the fine-grainedfsynccontrol that makes rotation a reliable durability boundary. Every production log-structured store (LevelDB, RocksDB, Kafka, Bitcask, Postgres WAL) segments its log.mmapis a page-layout tool, not a log-lifecycle tool. -
"When is it safe to delete a closed segment?" Only when no key in the keydir points into it. If you delete a segment that has even one live pointer, the next
get()for that key will fail to open the file. The usual sequence is: compaction (chapter 8) reads a closed segment, writes the live records into a fresh segment, updates the keydir to point at the new segment, and only then deletes the old one. Never delete a closed segment by hand; let the compactor do it, because the compactor is the one that can prove no live pointers remain. -
"What if a record is bigger than
SEGMENT_SIZE?" The simplest answer — the one this code adopts — is to let it be: the active segment is allowed to grow pastSEGMENT_SIZEfor the duration of one record, and rotation fires on the next put. That means in practice the true maximum segment size isSEGMENT_SIZE + max_record_size. For 64 MB segments and multi-megabyte values, that overshoot is a few percent. Real Bitcask rejects records larger than the segment size entirely; LSM engines handle oversized values with a separate blob file per oversized record. Choose yourSEGMENT_SIZEcomfortably larger than any reasonable value. -
"Does rotation block writes?" In this simple implementation, yes, for roughly one
fsync— a few milliseconds on a spinning disk, a hundred microseconds on SSD. Production implementations do thefsyncoff the write path in a background thread, and the rotation on the write path is just "swap file handles" — sub-microsecond. For a teaching database that writes a few thousand times a second, the blockingfsyncis fine. -
"Won't I run out of file descriptors?" Yes, eventually. A store with 1,000 segments has 1,000 open read handles. Linux's default soft limit is 1,024 file descriptors per process, so you crash at about that many segments. The fix is an LRU cache: keep the N most-recently-used read handles open, close the rest, reopen on demand. Real Bitcask uses a cache of 64 handles by default. The keydir itself does not care; it keeps segment IDs, and open/close is the file system's problem.
-
"Why a global segment ID instead of a UUID or a timestamp?" Because monotonic integers sort naturally (so
sorted(glob(...))gives correct chronological order), are compact in the keydir (one Python int per entry vs a 16-byte UUID), and are trivially reproducible on startup. Timestamps are fine in principle but tie segment identity to clock correctness, which is a mess nobody needs.
Going deeper
If you just wanted "break the log into segments, keep one active, update the keydir to name its segment" — you have it. The rest of this section covers the engineering knobs you will meet the moment you stop reading the chapter and start tuning the store.
The three-way trade of segment size
SEGMENT_SIZE is not a free parameter. Its value pushes on three things.
-
Small segments (say 1 MB) mean rotations are frequent — every second or so under heavy load. That is fine for rotation latency (rotation is cheap) but catastrophic for two downstream systems: compaction has to merge thousands of tiny files to reclaim dead space, and the directory bloats to tens of thousands of files, which most filesystems handle but none handle cheerfully. File descriptor pressure spikes. Startup scan time grows, because scanning a segment has some per-file overhead (the
open/seek/statcost is amortised away for big segments, not for small ones). -
Medium segments (tens to hundreds of MB) are the usual sweet spot. Rotations happen every few minutes under typical load. Compaction has a manageable number of files to juggle. The directory listing is readable by humans. Real Bitcask defaults to 2 GB, which is towards the large end; LevelDB's SSTables default to 2–4 MB but are layered (so you have many levels, each with different sizes). 64 MB is a fine teaching default because it stresses none of the limits on a laptop.
-
Large segments (gigabytes) mean fewer files on disk — attractive for backups and filesystem metadata. But they reduce compaction granularity: the smallest unit of "rewrite, discarding dead bytes" is one segment, so a huge segment cannot be partially compacted. They also worsen the crash blast radius: a single bad byte takes out a bigger chunk of the log. And they delay durability: a record written to the active segment is only guaranteed on disk when either (a) the active segment is
fsync-ed on a timer, or (b) rotation closes it. Bigger segments mean rarer rotations mean longer "unsynced window" under naive configuration.
The right choice depends on your write rate, your expected working-set size, your filesystem's tolerance for directories with many files, and your compaction strategy. Bitcask's 2 GB default is tuned for Riak's expected node footprint (hundreds of GB per node); a small embedded KV store would pick 16 or 32 MB; Kafka's log segments are 1 GB by default. No one default is right; the knob exists so you can pick per deployment.
File-descriptor budgets
A segment with 100,000 keys spread across 1,000 closed segments needs at most 1,000 open read handles, plus one active write handle, plus a few for the filesystem (stdin/stdout/stderr, a config file, logging). That is inside Linux's default 1,024 soft limit, but only barely.
Production Bitcask ships with an LRU cache of read handles. The default is 64 — a number chosen because most read workloads are Zipfian, with a hot set of segments receiving the overwhelming majority of accesses, and the tail of cold segments rarely getting reopened. When a get() arrives for a segment whose handle is not cached, you open() it, read, and either keep the handle (evicting the LRU victim) or close it immediately. The sophistication is small; the operational consequence is huge — you can run a store with a million segments on a default-configured Linux box without ever tripping EMFILE.
Our teaching implementation keeps every handle open. That is fine for hundreds of segments; do not ship it.
Crashing in the middle of a rotation
What happens if the process dies between "wrote record to segment-N, it pushed past SEGMENT_SIZE" and "fsync segment-N, close it, open segment-N+1"? Walk through it.
-
Case 1: the record is already in the kernel page cache but not fsynced. On a process crash (not a machine crash), the kernel keeps the page cache; it will flush on its own schedule. On a machine crash, the record may be lost. Either way,
_load_existing_segments()scans segment-N on startup, sees whatever was durable at the crash moment, and rebuilds the keydir from that. If the record was lost, the keydir does not contain it; that is the correct recovery for a write that never completed its durability contract. -
Case 2: the record is fsynced but segment-N+1 has not yet been created. Startup sees only segment-N. Sets
_active_id = N+1, but because segment-N has grown past SEGMENT_SIZE, it is effectively "too full" yet still the highest-numbered file. The simplest recovery is to treat it as closed and start writing to N+1 immediately; the store is consistent. -
Case 3: segment-N+1 has been created (zero bytes) but the active handle has not yet written anything. Startup finds N+1 as an empty file and opens it as the active segment. Correct behaviour.
-
Case 4: segment-N+1 has been created and a few bytes have been written. Those bytes are either a complete record or a torn one. The text-framing scanner ignores torn records (the line has no
=); the eventual CRC-framed scanner rejects them by checksum. Either way, recovery is clean.
None of these cases corrupt data that was previously committed. The worst case is losing the most recent write, which is the same guarantee an un-fsync-ed append-only log gives you. Rotation does not add any new failure modes; it just adds places where a crash can land. The recovery logic (scan every segment in order, rebuild keydir) handles all of them uniformly.
How real Bitcask does it
The Riak Bitcask implementation is about 2,000 lines of Erlang and differs from ours in three interesting ways.
- CRC-framed records, not text. Every record carries a CRC32 and a length prefix. Torn tails and bit flips are detected and discarded at scan time. Chapter 6 of this Build covered the framing format; a production segment file is a sequence of such framed records, not a sequence of text lines.
- Hint files per segment. For every closed segment
segment-NNNN.log, Bitcask writes a companionsegment-NNNN.hintcontaining a compact serialisation of every live key's (key, segment_id, offset, size, timestamp) record. On startup, reading the hint file is 10× to 100× faster than scanning the data segment, because hint files are much smaller (no values, only metadata). Chapter 10 of this Build builds them. Our teaching code skips hints and scans every segment fully on startup, which is fine up to a few gigabytes of total data but painful beyond that. - Single-writer, many-reader concurrency. Only one process (or one thread) writes to the active segment at a time, protected by a file lock. Many readers can read any segment in parallel because closed segments are immutable and the active segment's append-only writes do not move existing bytes. That concurrency model — a single-writer lock, lock-free readers — is Build 5 territory. For now our teaching code is single-threaded by construction.
The core idea is the same as ours, though: segments are numbered, exactly one is active, rotation closes the active segment, the keydir names the segment. Bitcask and our store are rearranging the same pieces.
Why segmentation is a prerequisite for nearly every later feature
Before leaving this chapter, look forward. Every subsequent feature in this Build — and every storage-engine feature you will build in the rest of your career — assumes a segmented log:
- Compaction (chapter 8) needs a unit of rewrite. That unit is one segment.
- Tombstones (chapter 9) need a place for "delete" records to live until compaction reaps them. They go in the active segment like any other record.
- Hint files (chapter 10) are companion files, one per closed segment.
- Backups and replication are segment-by-segment operations; closed segments can be shipped without any coordination with the writer.
- Tiered storage — keeping hot segments on SSD and cold segments on object storage — is trivial once segments exist, impossible without them.
Segmentation looks like a simple bookkeeping change and it is. But it is the bookkeeping change that makes every other idea in a log-structured database implementable. One file equals no options; many files equals every option. Once your store segments its log, the rest of storage engineering opens up in front of you.
Where this leads next
- Compaction: reclaiming deleted and overwritten keys — chapter 8: the most important operation in the life of a segmented log. Read a closed segment, keep only records whose keys still point to them in the keydir, write the survivors into a new segment, delete the old one. Dead bytes become free space; the store's disk footprint approaches the live-set size.
- Tombstones and the delete problem — chapter 9: how to delete a key from an append-only store. A "delete" is really a write of a special tombstone record that shadows the key until the segment containing the last live value has been compacted away. Why tombstones cannot be deleted the moment they appear.
- Hint files: rebuilding the index in milliseconds — chapter 10: the missing piece of fast startup. A small companion file per closed segment that stores the keydir entries for that segment. Startup becomes reading a few megabytes of hint files instead of scanning every byte of every data segment.
By the end of chapter 10, BitcaskSegmented will be a working, compacting, fast-restarting Bitcask. Then Build 3 opens the second wall — range queries and working-sets larger than RAM — with LSM-trees.
References
- Justin Sheehy and David Smith, Bitcask: A Log-Structured Hash Table for Fast Key/Value Data (Basho Technologies, 2010) — the original seven-page paper. Section on segment files and merging is the blueprint this chapter follows. riak.com/assets/bitcask-intro.pdf.
- Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 3 Storage and Retrieval — the section on "hash indexes" covers segmented logs with the same vocabulary used here. dataintensive.net.
- Basho, bitcask/src/bitcask_fileops.erl — the production segment-rotation implementation in Erlang, about 600 lines, surprisingly readable. github.com/basho/bitcask.
- Jay Kreps, The Log: What every software engineer should know about real-time data's unifying abstraction (LinkedIn Engineering, 2013) — why segmenting the log is a first-class design decision, not a bookkeeping detail. engineering.linkedin.com.
- Apache Kafka, Log Segments in the documentation — Kafka's segmentation strategy for partition logs, with tunable
segment.bytesandsegment.ms. kafka.apache.org/documentation/#log. - Sanjay Ghemawat and Jeff Dean, LevelDB Implementation Notes — LevelDB's SSTable design, which is a sorted-segment cousin of Bitcask's unsorted segments; a good contrast read. github.com/google/leveldb/blob/main/doc/impl.md.