In short

BitcaskLite from the last chapter has O(1) reads, but its log is one giant file that grows forever. That breaks every practical operation you would ever want to run against a database: you cannot compact it without rewriting the whole file, you cannot back it up incrementally, and a single bad byte anywhere in the middle corrupts the entire store. The fix is one of the oldest ideas in systems: chop the log into fixed-size pieces. Keep exactly one segment open for writing (the active segment). When it reaches a size threshold — say 64 MB — close it, fsync it, mark it read-only, and open a new active segment. The store now has a directory full of immutable closed segments plus one growing active one. The index changes shape by exactly nine characters: key → offset becomes key → (segment_id, offset). That small change unlocks everything the next three chapters need — compaction can rewrite one closed segment at a time, backups can copy closed segments without coordination with the writer, recovery only needs to scan the active segment for torn tails. This chapter builds BitcaskSegmented in about fifty lines, explains why segment size is a three-way trade, and walks through what happens when a crash lands in the middle of a rotation.

In the previous chapter you built BitcaskLite — an append-only log plus an in-memory hash index from key to byte offset. Reads jumped from O(n) to O(1). Writes stayed at a microsecond. A million records on a laptop, 190,000 reads per second. By any fair measure, it is a real database.

It is also doomed.

Open the benchmark output from last chapter and notice the file size column you ignored. A million records produced an 18 MB log. A hundred million would produce a 1.8 GB log. A year of writes on a moderately busy service — a session store, an IoT ingest, a feature store — produces a log that is hundreds of gigabytes, every byte of it a single inseparable file. Think about what you can actually do with such a file.

You cannot compact it without reading the whole thing and rewriting the whole thing in one blocking operation. You cannot back it up incrementally — rsync has to diff every byte to notice that only the last few kilobytes changed. You cannot split it across two disks. You cannot hand a frozen slice of it to a replica. A single-bit corruption at byte 42,000,000,000 makes the scanner throw up its hands somewhere around minute forty of recovery. One file, one crash blast radius, one everything.

The fix is to stop writing one file and start writing many. That is this chapter.

Why one file breaks everything

Think of the log the way an accountant thinks of a ledger. You never erase a ledger entry — that is the whole point of a ledger. But no accountant alive keeps one infinite roll of paper. They keep a volume for this year, a volume for last year, a volume for the year before. Volumes closed long ago sit on a shelf, immutable, labelled by date, available for audit but untouched by the day-to-day pen. The current volume is the only one that gets written in.

That is the idea. In storage-engine language, each volume is a segment file.

One file versus many segmentsTwo rows. Top row labelled "before": a single long horizontal bar labelled bitcask.log growing rightward with an arrow showing unbounded growth and annotations that all operations touch the whole file. Bottom row labelled "after": four smaller boxes in a row labelled segment-0000, segment-0001, segment-0002 (all marked closed with a lock icon) and segment-0003 labelled active with a pen icon. Annotations show closed segments are immutable; only the active segment accepts writes.Before — one file, foreverbitcask.logevery read, write, compact, backup touches the whole fileAfter — segments + one active writersegment-0000closed (read-only)segment-0001closed (read-only)segment-0002closed (read-only)segment-0003ACTIVE (append here)compaction rewrites one closed segment at a time — writer never stops
The disk layout before and after segmentation. The old design lumps everything into one file that all operations must touch. The new design has one active segment (where all writes go) and a growing pile of closed segments, each of which is a frozen, read-only chunk of the log — a proper unit of work for compaction, backup, and replication.

Before you write any code, notice what segmentation buys you, and what it costs.

What it buys. Compaction can operate on one closed segment (or a pair of them) at a time, in the background, without pausing writes — because writes always go to the active segment, which compaction never touches. Backups can snapshot the closed segments cheaply: they do not change, so a file-level copy is safe without any locking. A single corrupt byte now blasts only one segment, not the whole store. You can spread segments across disks, tiers, even cloud object storage. And recovery — the startup scan that rebuilds the index — only needs to look for torn tails in the active segment, because closed segments were fsync-ed at close time and have integrity proofs on every record.

What it costs. The in-memory index now has to remember which segment each key lives in, not just where in "the" file. A tiny change in data type: nine characters in the Python — a tuple instead of an integer. And the read path has to open the right segment file before it can seek. Also tiny: file descriptors are cheap, and a production implementation keeps a small LRU cache of open handles. The runtime cost is effectively zero.

That is the trade. The rest of this chapter makes it concrete.

The segmented design, in four rules

Four rules are all you need.

  1. Segments are numbered. A monotonically increasing integer per segment, written as a zero-padded prefix in the filename: segment-0000.log, segment-0001.log, and so on. The number is the segment's identity, reusable in the index, printable in logs.
  2. Exactly one segment is active. The one with the highest number. All writes go there. Every other segment is closed, meaning its file has been fsync-ed and its handle is open read-only (or not open at all).
  3. Rotate at a size threshold. After each put(), if the active segment is bigger than SEGMENT_SIZE, close it and open a new active segment. A typical threshold is 64 MB — big enough that rotation is rare, small enough that one segment is a workable unit of compaction.
  4. The index maps key → (segment_id, offset). Not just offset. This is the one data-structure change in the whole chapter, and it propagates into every piece of code the index touches — which turns out to be exactly two places: put() (writes the tuple) and get() (reads the tuple and uses segment_id to pick the file).

Put those together and you have BitcaskSegmented.

On-disk layout of a segmented storeA directory tree on the left labelled "bitcask-data/" containing four files: segment-0000.log with size 64 MB and a lock icon marking it closed; segment-0001.log with size 64 MB and a lock icon; segment-0002.log with size 23 MB and a pen icon marking it active. A right panel shows the in-memory keydir with sample entries like "name" mapped to (0, 0), "age" mapped to (2, 18400), "city" mapped to (1, 41000), each with an arrow pointing into the appropriate segment file in the tree on the left. Below, a thin strip labelled "timeline of rotations" shows three segment rotations at t1, t2, and now.on diskbitcask-data/segment-0000.log64 MBclosedsegment-0001.log64 MBclosedsegment-0002.log23 MBactivein RAM — keydir"name" → (seg=0, off=0)"age" → (seg=2, off=18400)"city" → (seg=1, off=41000)"score" → (seg=2, off=22100)timeline of rotationsseg-0000 closedseg-0001 closednow (seg-0002 active)t=0t=now
The full picture. On disk: closed segments of roughly equal size plus one active segment accepting writes. In RAM: one entry per live key, pointing into exactly one segment at a known byte offset. Every read is one dict lookup, one file open (cached), one seek, one readline.

BitcaskSegmented — the code

Here is the full implementation, extending BitcaskLite with segmentation. About fifty lines of real logic, plus some bookkeeping.

# bitcask_segmented.py — append-only KV store with rotating segment files
import os, glob

SEGMENT_SIZE = 64 * 1024 * 1024   # 64 MB — rotate once the active segment exceeds this

class BitcaskSegmented:
    def __init__(self, data_dir):
        self.data_dir = data_dir
        os.makedirs(data_dir, exist_ok=True)
        # keydir: key (str) -> (segment_id: int, offset: int)
        self._index = {}
        # open file handles for every known segment, keyed by segment_id
        self._handles = {}
        # discover existing segments, load them, then open the active one
        self._load_existing_segments()
        self._open_active_segment()

    # ---- segment discovery and rotation ---------------------------------

    def _segment_path(self, seg_id):
        return os.path.join(self.data_dir, f"segment-{seg_id:04d}.log")

    def _load_existing_segments(self):
        # find every segment-NNNN.log on disk, in numerical order
        paths = sorted(glob.glob(os.path.join(self.data_dir, "segment-*.log")))
        for path in paths:
            seg_id = int(os.path.basename(path)[8:12])
            self._scan_segment(seg_id, path)
            self._handles[seg_id] = open(path, "rb")
        # the next active segment is one past the largest existing id, or 0
        self._active_id = (max(self._handles) + 1) if self._handles else 0

    def _scan_segment(self, seg_id, path):
        # rebuild the keydir entries from this segment's bytes
        offset = 0
        with open(path, "rb") as f:
            for line in f:
                key_bytes, eq, _ = line.partition(b"=")
                if eq and key_bytes:
                    self._index[key_bytes.decode("utf-8")] = (seg_id, offset)
                offset += len(line)

    def _open_active_segment(self):
        path = self._segment_path(self._active_id)
        # binary append for exact byte offsets
        self._active = open(path, "ab")
        # keep a read handle too, so get() can read from the active segment
        self._handles[self._active_id] = open(path, "rb")

    def _rotate_if_needed(self):
        if self._active.tell() < SEGMENT_SIZE:
            return
        # close the active segment: flush Python buffer, fsync, mark immutable
        self._active.flush()
        os.fsync(self._active.fileno())
        self._active.close()
        # open the next segment as the new active one
        self._active_id += 1
        self._open_active_segment()

    # ---- public API -----------------------------------------------------

    def put(self, key, value):
        record = f"{key}={value}\n".encode("utf-8")
        offset = self._active.tell()                 # byte offset within the active segment
        self._active.write(record)
        self._active.flush()
        self._index[key] = (self._active_id, offset) # the one-line structural change
        self._rotate_if_needed()

    def get(self, key):
        entry = self._index.get(key)
        if entry is None:
            return None
        seg_id, offset = entry
        f = self._handles[seg_id]                    # pick the right segment
        f.seek(offset)
        line = f.readline()
        _, _, value = line.rstrip(b"\n").partition(b"=")
        return value.decode("utf-8")

    def close(self):
        self._active.flush()
        os.fsync(self._active.fileno())
        for f in self._handles.values():
            f.close()
        self._active.close()

Read through it once, and notice how little is different from BitcaskLite.

SEGMENT_SIZE = 64 * 1024 * 1024. The threshold at which the active segment rotates. 64 MB is a sensible default — real Bitcask defaults to 2 GB, LevelDB to 4 MB per SSTable, RocksDB tunes per level. The right number is a three-way trade covered below.

self._index[key] = (seg_id, offset) in put(). The only structural change. An integer becomes a tuple. Everything else about the index — it lives in RAM, it has one entry per live key, it is rebuilt from disk on startup — is unchanged.

_rotate_if_needed(). Called after every put(). Cheap: one integer comparison. Once in a blue moon — roughly every 64 MB of writes — it closes the active file and opens a new one. The rotation itself is three syscalls (flush, fsync, close) plus an open; a millisecond of latency for one write out of every few million. Unnoticeable in any workload.

_handles dict. A cache of open read handles, one per segment, indexed by segment id. This is why get() is O(1) instead of O(\text{open time}) — the file is already open when the read arrives. A real implementation would bound the cache with LRU eviction because open file descriptors are a kernel resource (the default soft limit is 1024 on most Linux systems), but for the hundreds-of-segments regime of a teaching Bitcask this is fine.

_load_existing_segments() at startup. Walks the data directory, finds every segment-NNNN.log, and scans them in order — lowest ID first. Because segments are scanned in order, a newer write for the same key overrides an older one naturally (the _index[key] = ... assignment at the end of _scan_segment happens after the earlier assignment from the previous segment, so the later wins). This is the same last-write-wins rule that made chapter 2's scanner work — applied across segments.

fsync on close and rotation. When a segment closes, we fsync it. That is the durability barrier. After _rotate_if_needed() returns, the closed segment is guaranteed to be on disk; everything later belongs to the new active one. If the process crashes, recovery only needs to consider torn tails in the single active segment — because every closed segment has passed through a completed fsync.

Sixty lines, counting comments. The core change from BitcaskLite is maybe fifteen lines of new code. The rest is moving BitcaskLite's single-file assumptions to a directory of files.

What the index now looks like

A picture is worth a lot here. Suppose over the life of the store you have written seven records across three segments. Closed segments: 0000 and 0001. Active: 0002. The index has one entry per distinct key, and each entry points to the latest version.

# after a sequence of puts that rotated twice
db._index == {
    "name":  (0, 0),       # name=dipti in segment-0000 at byte 0
    "city":  (1, 41_000),  # city=chennai in segment-0001 at byte 41,000
    "age":   (2, 18_400),  # age=17 in segment-0002 at byte 18,400 — active segment
    "score": (2, 22_100),  # score=99 in segment-0002 at byte 22,100 — active segment
}

Four distinct keys, across three segments. name was written once and never updated, so its entry still points into the oldest segment. city was last updated during the middle segment's lifetime. age and score are recent. Each entry points to exactly one segment, and the segment-id tells you which file to open.

Every stale value — the age=15, age=16 records in segment-0001, an earlier city in segment-0000 — still sits in its segment file, unreferenced by the keydir, taking up bytes. That is dead data, and it will not shrink until compaction (chapter 8) arrives.

A rotation in slow motion

Watching the log roll over

Shrink SEGMENT_SIZE to 200 bytes so you can trigger a rotation in ten puts instead of half a million, and walk through the state after each write.

# tiny_rotation_demo.py
import bitcask_segmented as bs
bs.SEGMENT_SIZE = 200   # absurdly small — just for the demo

db = bs.BitcaskSegmented("demo-data")
db.put("name", "dipti")          # 11 bytes -> segment-0000, size 11
db.put("age", "15")              # 7 bytes -> segment-0000, size 18
db.put("city", "coimbatore")     # 16 bytes -> segment-0000, size 34
# ... many more ...
db.put("score", "A" * 200)       # crosses the threshold -> rotation fires
db.put("year", "2026")           # goes into segment-0001 at offset 0

Trace what the store looks like after each operation.

after put("name","dipti"):
  active = segment-0000 (11 B)
  index = {"name": (0,  0)}

after put("age","15"):
  active = segment-0000 (18 B)
  index = {"name": (0, 0), "age": (0, 11)}

after put("city","coimbatore"):
  active = segment-0000 (34 B)
  index = {"name": (0, 0), "age": (0, 11), "city": (0, 18)}

... after enough puts to cross 200 bytes ...

after put("score","AAA...A") which pushes active past 200 B:
  1. record is written into segment-0000 (size now 210 B)
  2. _rotate_if_needed() detects 210 > 200
  3. segment-0000 is flushed, fsynced, closed
  4. segment-0001 is opened as the new active segment
  active = segment-0001 (0 B)
  index = {"name": (0,0), "age": (0,11), "city": (0,18), "score": (0, 34)}
          ^^^^ note: score points into segment-0000, not the new active one

after put("year","2026"):
  active = segment-0001 (10 B)
  index += {"year": (1, 0)}  # first record in the new segment

get("score") -> index["score"] = (0, 34) -> open segment-0000,
                seek(34), readline -> "score=AAA...A"
get("year")  -> index["year"]  = (1, 0)  -> open segment-0001,
                seek(0),  readline -> "year=2026"

Three things to notice.

The rotation fires after the write that crossed the threshold. That is the correct order: the record belongs in the old segment, because that is where we wrote it and where the index entry points. Only the next write goes into the new segment. Flipping this order (rotating first, then writing) is a common bug — it leaves a zero-byte gap at the end of the old segment and an off-by-one in the index.

The keydir entry for score is (0, 34) — the id of the segment it was actually written to, not the id of the newly-opened active segment. The new segment is empty at that moment; its first tenant is year.

get() is segment-oblivious. It looks up the tuple, picks the right file handle, seeks, and reads. The same code path works regardless of whether the requested key lives in the newest segment or the oldest.

That walk-through is the whole behaviour. Reads do not care about rotations; writes trigger them at a predictable threshold; the keydir is the only thing that knows which segment each key is in.

Common confusions

Going deeper

If you just wanted "break the log into segments, keep one active, update the keydir to name its segment" — you have it. The rest of this section covers the engineering knobs you will meet the moment you stop reading the chapter and start tuning the store.

The three-way trade of segment size

SEGMENT_SIZE is not a free parameter. Its value pushes on three things.

The right choice depends on your write rate, your expected working-set size, your filesystem's tolerance for directories with many files, and your compaction strategy. Bitcask's 2 GB default is tuned for Riak's expected node footprint (hundreds of GB per node); a small embedded KV store would pick 16 or 32 MB; Kafka's log segments are 1 GB by default. No one default is right; the knob exists so you can pick per deployment.

File-descriptor budgets

A segment with 100,000 keys spread across 1,000 closed segments needs at most 1,000 open read handles, plus one active write handle, plus a few for the filesystem (stdin/stdout/stderr, a config file, logging). That is inside Linux's default 1,024 soft limit, but only barely.

Production Bitcask ships with an LRU cache of read handles. The default is 64 — a number chosen because most read workloads are Zipfian, with a hot set of segments receiving the overwhelming majority of accesses, and the tail of cold segments rarely getting reopened. When a get() arrives for a segment whose handle is not cached, you open() it, read, and either keep the handle (evicting the LRU victim) or close it immediately. The sophistication is small; the operational consequence is huge — you can run a store with a million segments on a default-configured Linux box without ever tripping EMFILE.

Our teaching implementation keeps every handle open. That is fine for hundreds of segments; do not ship it.

Crashing in the middle of a rotation

What happens if the process dies between "wrote record to segment-N, it pushed past SEGMENT_SIZE" and "fsync segment-N, close it, open segment-N+1"? Walk through it.

None of these cases corrupt data that was previously committed. The worst case is losing the most recent write, which is the same guarantee an un-fsync-ed append-only log gives you. Rotation does not add any new failure modes; it just adds places where a crash can land. The recovery logic (scan every segment in order, rebuild keydir) handles all of them uniformly.

How real Bitcask does it

The Riak Bitcask implementation is about 2,000 lines of Erlang and differs from ours in three interesting ways.

The core idea is the same as ours, though: segments are numbered, exactly one is active, rotation closes the active segment, the keydir names the segment. Bitcask and our store are rearranging the same pieces.

Why segmentation is a prerequisite for nearly every later feature

Before leaving this chapter, look forward. Every subsequent feature in this Build — and every storage-engine feature you will build in the rest of your career — assumes a segmented log:

Segmentation looks like a simple bookkeeping change and it is. But it is the bookkeeping change that makes every other idea in a log-structured database implementable. One file equals no options; many files equals every option. Once your store segments its log, the rest of storage engineering opens up in front of you.

Where this leads next

By the end of chapter 10, BitcaskSegmented will be a working, compacting, fast-restarting Bitcask. Then Build 3 opens the second wall — range queries and working-sets larger than RAM — with LSM-trees.

References

  1. Justin Sheehy and David Smith, Bitcask: A Log-Structured Hash Table for Fast Key/Value Data (Basho Technologies, 2010) — the original seven-page paper. Section on segment files and merging is the blueprint this chapter follows. riak.com/assets/bitcask-intro.pdf.
  2. Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 3 Storage and Retrieval — the section on "hash indexes" covers segmented logs with the same vocabulary used here. dataintensive.net.
  3. Basho, bitcask/src/bitcask_fileops.erl — the production segment-rotation implementation in Erlang, about 600 lines, surprisingly readable. github.com/basho/bitcask.
  4. Jay Kreps, The Log: What every software engineer should know about real-time data's unifying abstraction (LinkedIn Engineering, 2013) — why segmenting the log is a first-class design decision, not a bookkeeping detail. engineering.linkedin.com.
  5. Apache Kafka, Log Segments in the documentation — Kafka's segmentation strategy for partition logs, with tunable segment.bytes and segment.ms. kafka.apache.org/documentation/#log.
  6. Sanjay Ghemawat and Jeff Dean, LevelDB Implementation Notes — LevelDB's SSTable design, which is a sorted-segment cousin of Bitcask's unsorted segments; a good contrast read. github.com/google/leveldb/blob/main/doc/impl.md.