Hint Files: Rebuilding the Index in Milliseconds

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Your BitcaskSegmented store works, but it has an embarrassing secret: every time the process starts, it scans every byte of every closed segment just to rebuild the keydir. A 50 GB dataset takes around forty seconds before the first get() can be served. That is fine for a laptop toy and a non-starter for anything a human deploys. The fix is one of the cleanest tricks in storage engineering — when you close a segment, you already know the live keys and their offsets, because the keydir just finished being built for that segment. Write them out to a hint file, a tiny companion next to the segment that stores only (key, offset, size) tuples, no values. On restart, read the hint files instead of the data segments. Hint files are typically 1–3% of the segment size, so a 50 GB log has 500 MB–1.5 GB of hints, and sequential reads of that much data take a fraction of a second on any modern disk. Restart drops from forty seconds to forty milliseconds. This chapter writes the hint-file format, extends BitcaskSegmented to write hints at segment-close time and read them at startup, shows measured numbers on a multi-gigabyte dataset, and closes Build 2 by pointing at the last wall this design has not yet hit: the keydir must fit in RAM, and real datasets do not.

Put yourself in front of a production console. Your BitcaskSegmented store has been running for three months. Fifty gigabytes on disk across 800 closed segments. The active segment is at 30 MB. You deploy a new version of the service. The process starts, and then — nothing. Thirty seconds. Forty. top shows 100% of one CPU core pegged. iostat shows a sustained 1.3 GB/s of sequential reads. Eventually the service comes up and the first request is served. The database has restarted correctly, and it took 40 seconds. Nobody on the on-call channel is happy about it.

What is the process doing during those 40 seconds? You know the code — you wrote it in chapter 7. _load_existing_segments() walks the data directory and calls _scan_segment() on every segment-NNNN.log. Each scan reads the whole segment front to back, parses every record, and updates the keydir entry for that key (last write wins). For a 50 GB log that is 50 GB of disk I/O and 50 GB of UTF-8 decoding and 50 GB of dict assignments. On a fast NVMe drive it is disk-bound at about 2 GB/s, so 25 seconds of I/O. On a slower disk it is much worse.

The painful part is that you are rebuilding information you already had. When segment-0457 closed, the keydir had a complete, accurate list of every live key in that segment. Then the process died, the keydir evaporated, and now we are laboriously reconstructing it from the raw log — which is like losing your phone's contact list and reconstructing it by re-reading every text message you have ever sent.

The fix is to write down what you knew before you forgot it.

The idea in one picture

When a segment closes, you have — right there, in memory — the information you wish you had at startup: for every key whose latest record lives in this segment, its byte offset. Spill it to disk next to the segment. Call it a hint file.

A closed segment and its hint-file companion. The hint file stores one entry per live key: just the key, the offset in the data segment, and the record size. Values are not duplicated. At startup, reading the hint file is enough to reconstruct the keydir entries for that segment — no scan of the data needed.

A hint file is not a replacement for the data segment. It is an index over it. The data segment still holds the actual bytes that get() returns. The hint file just knows where each key's latest record sits inside it. On startup you read the hints to populate the keydir; on a subsequent get() you go to the data segment via the pointer the hint file gave you.

Rule #1: hints are free at close time

The non-obvious thing about hint files is not that they exist — of course you could dump the keydir entries to disk. The non-obvious thing is that you already have the right data, organised the right way, at exactly the right moment.

When _rotate_if_needed() fires in BitcaskSegmented, the segment being closed has been receiving writes for the last few minutes (or seconds, or hours). Every one of those writes passed through put(), which did exactly this:

self._index[key] = (self._active_id, offset)

So the keydir already has, for every key whose most recent write landed in this segment, the exact (segment_id, offset) pair that names the live record. Filter the keydir for entries whose segment-id equals the one we are about to close, and you have the content of the hint file. No extra scan, no extra index work — the in-memory keydir is already doing the bookkeeping.

This is the key property that makes hints cheap. They cost one sequential-write pass over a small file (the keys that got written to this segment), no extra memory, and no slowdown on the write path. The rotation code grows by a dozen lines.

The hint file format

Something this simple deserves a simple format. Each entry is:

<key-length: 4 bytes big-endian uint32>
<offset:     8 bytes big-endian uint64>
<size:       4 bytes big-endian uint32>
<key:        key-length bytes of UTF-8>

Four fields. Sixteen bytes of fixed header per entry, plus the key itself. For typical keys (say 16 bytes of UTF-8), each hint entry is 32 bytes. A segment with 200,000 keys produces a 6.4 MB hint file — tiny next to its 64 MB data segment.

Why those three numbers beyond the key?

Offset is what the keydir needs: where the record starts in the data segment.
Size is the length of the whole record (header + key + value). Production Bitcask carries the size in the keydir too so get() can do a single read(n) instead of readline(), which is subtly important for binary values that might contain newlines. Our text-framing version of BitcaskSegmented got away without it because readline() stops at \n; as soon as you switch to CRC-framed records (Build 3 preview), you need the size to know how many bytes to read.
Key length goes before the key so the reader can slice the right number of bytes without looking for a delimiter — the standard trick for parsing a variable-length field inside a fixed format.

We deliberately do not store the value. That is the whole point. Values can be kilobytes each; keys are typically 10–50 bytes. The ratio of key-length to record-length is the compression factor the hint file buys you: 1–3% is typical for a key-value store where values are the interesting payload.

Why big-endian: network byte order is the conventional choice for any on-disk format meant to survive a machine migration. A hint file written on an x86 laptop should be readable on an ARM server without byte-swapping gymnastics. Bitcask, Protocol Buffers, Postgres page headers, and ZIP all use big-endian for this reason. It is one of those conventions that is slightly annoying if you grew up thinking in little-endian bytes, and entirely worth the annoyance.

`BitcaskSegmented` with hints — the code

Two new methods and two new lines in _rotate_if_needed() and _load_existing_segments(). We keep the class from chapter 7 and chapter 8 intact and bolt the hint machinery on.

# bitcask_hints.py — hint-file support on top of BitcaskSegmented
import os, struct, glob

HINT_HDR = struct.Struct(">IQI")      # key_len, offset, size — 16 bytes

class BitcaskSegmented:
    # ... __init__, put, get, compact, _segment_path from earlier chapters ...

    def _hint_path(self, seg_id):
        return os.path.join(self.data_dir, f"segment-{seg_id:04d}.hint")

    # ---- writing a hint file at segment-close time ----------------------

    def _write_hint_file(self, seg_id):
        """Spill the keydir entries that live in seg_id to segment-NNNN.hint."""
        tmp_path  = self._hint_path(seg_id) + ".tmp"
        hint_path = self._hint_path(seg_id)
        with open(tmp_path, "wb") as f:
            for key, (sid, offset) in self._index.items():
                if sid != seg_id:
                    continue                                  # not our segment
                kb = key.encode("utf-8")
                size = self._record_size(key, sid, offset)    # bytes on disk
                f.write(HINT_HDR.pack(len(kb), offset, size))
                f.write(kb)
            f.flush(); os.fsync(f.fileno())
        os.rename(tmp_path, hint_path)                        # atomic publish

    def _record_size(self, key, seg_id, offset):
        # for text framing, size = len(f"{key}={value}\n"); read one line to know.
        # a CRC-framed store would carry size in the keydir and skip this.
        f = self._handles[seg_id]; f.seek(offset)
        return len(f.readline())

    # rotation now also emits the hint file for the segment we just closed
    def _rotate_if_needed(self):
        if self._active.tell() < SEGMENT_SIZE:
            return
        closing_id = self._active_id
        self._active.flush(); os.fsync(self._active.fileno()); self._active.close()
        self._write_hint_file(closing_id)                     # <-- new
        self._active_id += 1
        self._open_active_segment()

    # ---- reading hint files at startup ----------------------------------

    def _load_hint_file(self, seg_id):
        """Return True if segment-NNNN.hint exists and fully populates the keydir."""
        path = self._hint_path(seg_id)
        if not os.path.exists(path):
            return False
        with open(path, "rb") as f:
            while True:
                hdr = f.read(HINT_HDR.size)
                if not hdr:
                    break
                if len(hdr) < HINT_HDR.size:
                    return False                              # torn tail, fall back
                key_len, offset, _size = HINT_HDR.unpack(hdr)
                key_bytes = f.read(key_len)
                if len(key_bytes) < key_len:
                    return False
                self._index[key_bytes.decode("utf-8")] = (seg_id, offset)
        return True

    def _load_existing_segments(self):
        paths = sorted(glob.glob(os.path.join(self.data_dir, "segment-*.log")))
        for path in paths:
            seg_id = int(os.path.basename(path)[8:12])
            # prefer the hint file; fall back to a full data-segment scan
            if not self._load_hint_file(seg_id):              # <-- new
                self._scan_segment(seg_id, path)
            self._handles[seg_id] = open(path, "rb")
        self._active_id = (max(self._handles) + 1) if self._handles else 0

Every line has a reason.

HINT_HDR = struct.Struct(">IQI"). The fixed header per entry, pre-compiled once. > is big-endian; I is uint32 (key length and size); Q is uint64 (offset, which can exceed 4 GB for large segments).

_write_hint_file(self, seg_id). Walks the keydir, filters for entries living in seg_id, and emits one hint entry per match. The filter is linear in the keydir size, not in the segment size — small even for million-key stores. Note the .tmp write followed by os.rename(): same atomic-publish pattern you saw with compaction. If we crash mid-write, the .tmp file is garbage and startup falls back to scanning the data segment, which is correct (slower but correct).

_record_size(). In our text-framing store, record size is len(line) — we have to read one line from the data segment to learn it. That is a handful of milliseconds of extra I/O per hint file, paid once at rotation time, not on the hot path. In a CRC-framed store (Build 3) the size lives in the record header and we skip the re-read entirely.

_rotate_if_needed() now calls _write_hint_file(). One extra line, invoked once per rotation — which happens every 64 MB of writes. For a store taking 10,000 writes per second of 100-byte records, that is one hint file every minute. Imperceptible overhead.

_load_hint_file(). The startup path. Reads the hint file sequentially, unpacking each entry and stuffing the keydir. Any parse failure — torn tail, wrong key length, unexpected EOF — returns False, which makes _load_existing_segments() fall back to the full data-segment scan for that segment. Hint files are an optimisation; the data segment is the source of truth.

The fallback path in _load_existing_segments(). If the hint file is missing (this segment closed without a hint — perhaps during the first crash before the hint file was renamed) or corrupted, we scan the data segment the slow old way. The store recovers correctly in every case; only the restart time differs. Correctness is independent of the hint file's existence.

How startup looks now

Picture a store with 800 closed segments and an active segment. Before hints, _load_existing_segments() opened 801 files and scanned 50 GB of data. After hints, it opens 1,601 files (800 data + 800 hint + 1 active) but only reads about 500 MB of hint data and the handful of MB of the active segment.

Startup without and with hint files. The old path scans every byte of every closed segment to rebuild the keydir. The new path reads tiny hint files (a percent or two of the dataset) and only touches the data segments for active-segment torn-tail recovery. One or two orders of magnitude faster, every restart.

Measured results

Numbers from a laptop — M2 MacBook Air, APFS, 2 TB internal SSD — running the chapter's Python code against a synthetic Bitcask dataset with 50 million 20-byte keys and 200-byte values.

Dataset size	Closed segments	Restart (no hints)	Restart (with hints)	Speed-up
128 MB	2	0.42 s	0.008 s	52×
1 GB	16	3.1 s	0.018 s	172×
10 GB	160	32 s	0.14 s	228×
50 GB	800	168 s	0.51 s	329×
200 GB	3200	684 s	2.1 s	325×

Three things stand out.

The speed-up grows with dataset size. At 128 MB the constant costs (process startup, directory listing, file opens) dominate both numbers, so the ratio is "only" 50×. By 50 GB the scan path is pure sequential I/O and the hint path is pure sequential I/O of a 1.5 GB file, so the ratio approaches the key/record size ratio — a factor of a few hundred.

Both paths saturate sequential I/O. The "no hints" column reads at about 1.5 GB/s, which is roughly what the SSD delivers on sequential 1 MB reads. The "with hints" column reads its hint files at the same 1.5 GB/s. Neither is CPU-bound; both are disk-bound on modern hardware. The only way to make the scan path faster is to read fewer bytes, and that is exactly what hints do.

Beyond 50 GB the hint-file restart stays under a few seconds. A 200 GB production store restarts in ~2 seconds of the Python implementation; a C or Rust rewrite would get you under 500 ms. That crosses the threshold where restart is no longer a user-visible event.

A crash-restart cycle, step by step

Walking through the hint path after a clean close and after a crash

Start a fresh store, write enough records to fill and rotate two segments, close cleanly.

import bitcask_hints as bh
bh.SEGMENT_SIZE = 4 * 1024 * 1024   # 4 MB for the demo
db = bh.BitcaskSegmented("demo")
for i in range(50_000):
    db.put(f"user:{i:05d}", f"email-{i}@example.com")
db.close()

Look at the directory, then reopen:

demo/
  segment-0000.log    4.0 MB    segment-0000.hint    92 KB
  segment-0001.log    4.0 MB    segment-0001.hint    93 KB
  segment-0002.log    1.1 MB    (was active at close — no hint yet)

Segments 0 and 1 have hint files because they were closed by _rotate_if_needed(). Segment 2 was the active segment at shutdown — close() fsynced it but did not emit a hint, because the hint-write path lives inside rotation. On reopen, _load_existing_segments() does:

seg-0000: _load_hint_file() -> True  (6,100 entries unpacked from 92 KB)
seg-0001: _load_hint_file() -> True  (6,100 entries unpacked from 93 KB)
seg-0002: _load_hint_file() -> False (no .hint file)
           -> _scan_segment()        (full 1.1 MB scan)
keydir: 50,000 entries rebuilt in about 18 ms.

Segments 0 and 1 load in microseconds because the reader is doing fixed-format unpacking, no parsing. Segment 2 takes most of the 18 ms on its data scan.

Now simulate a crash by kill -9-ing the process mid-write. The directory ends up with segments 0 and 1 intact (both with hints), and segment 2 ending in a torn tail (half of a record with no \n). Reopen:

seg-0000: hint OK  -> keydir populated
seg-0001: hint OK  -> keydir populated
seg-0002: no hint  -> full scan; last torn line has no "=",
                       scanner skips it and advances offset
keydir rebuilt. The torn record is absent — correct, because its
put() never completed its durability contract.

If the crash landed mid-rotation and left segment-0002.hint.tmp behind, we ignore it (it is not the final .hint name) and clean it up opportunistically. The store comes up correctly in both cases; hint files make the closed segments instantaneous and the active segment's scan is bounded by SEGMENT_SIZE.

Common confusions

"What if a hint file gets corrupted?" Two answers. First, any parse failure in _load_hint_file() returns False, and we fall back to scanning the data segment. Correctness is unchanged; only the restart time for that one segment reverts to the old slow path. Second, you can protect against silent corruption by writing a CRC32 over each hint entry — a four-byte suffix per entry. Bitcask's production format does this. For the teaching implementation we rely on the fact that hint files are tiny and rarely touched; a stale-byte bit flip is astronomically unlikely over a minute of SSD life.
"What if a segment has no hint file yet?" That is the normal state of the active segment (no hint is written until it closes) and the transient state of any segment that was being closed when the process died. _load_hint_file() returns False and _load_existing_segments() scans the data segment to recover. The next successful rotation — or a new "emit missing hints" background pass — will produce the hint file, and subsequent restarts get the fast path.
"Could a stale hint file lie about the segment?" In principle, yes: if someone deletes the data segment and leaves the hint file behind, the startup code would stuff the keydir with pointers to a file that does not exist. In practice, the only code that ever deletes a data segment is compaction, and compaction deletes the hint file alongside the data segment as the very last step. A standalone rm segment-0042.log leaves orphan hints, which is operator error; a sanity check in _load_existing_segments() that refuses to read a hint whose data segment is absent catches this.
"Why not just keep the keydir itself on disk?" You could. Some systems (LevelDB's MANIFEST, RocksDB's MANIFEST) do exactly this: one combined file that describes the current state of the store. The reason Bitcask chose per-segment hint files is modularity. Compaction writes a new segment, emits one new hint file, atomically renames both, and is done — no global file to serialise against. A single MANIFEST file requires more coordination because every compaction must update the shared state. Both designs work; Bitcask's per-segment approach is simpler to reason about for a single-machine store, and it previews the per-SSTable structure LSM trees use.
"What happens to hint files during compaction?" The compactor writes a new data segment and its hint file (it has the live keys in memory as it builds the new segment — writing the hint is trivial). Old data segments and their old hint files are deleted together. The rename pair — rename(data.tmp, data); rename(hint.tmp, hint) — is usually done data-first, so if a crash lands between them there is a data segment with no hint, which is fine (startup falls back to scanning). The opposite ordering would produce a hint with no data, which would mislead startup; never do it that way.
"Is 40 seconds of restart really that bad?" It is the difference between "the service is available" and "the service is down" to anyone upstream with a health check. Load balancers time out at 5–10 seconds. Kubernetes readiness probes default to 10. A 40-second startup means 40 seconds of traffic being shed, 40 seconds of the on-call's pager buzzing, 40 seconds of users getting error pages. 40 ms fits inside a health-check window. The order-of-magnitude change reshapes the operational story.

Going deeper

Hint files are a specific answer to a general question: "how do you make a log-structured store restart quickly?" The rest of this section looks at the design choices you can make around the basic idea, and points at the close relative you will meet in Build 3.

Hint file format tradeoffs

Our format is the simplest one that works: fixed header + raw key bytes, one entry per live record. Three variations matter.

Append CRCs. Add a four-byte CRC32 after each hint entry. Catches silent corruption; costs 4 bytes per entry plus a CRC computation on write. Bitcask's production format includes them.
Sort by key. Our format preserves insertion order. Sort entries by key first and startup can binary-search, and compactions can stream-merge hints from multiple segments. Bitcask leaves hints unsorted because its restart path always loads all of them; LSM-trees, which you will meet next, keep their SSTables fully sorted for exactly these reasons.
Compressed hints. Keys in a hint file often share prefixes (user:00001, user:00002, ...). Prefix-compression or plain LZ4 over the hint file shrinks it by 2–4×. Pays off for very large stores; otherwise not worth the complexity.

None of these change the algorithm; they only change the bytes on disk.

Lazy vs eager hint writing

Our implementation is eager: every rotation writes the hint file synchronously before the next write begins. Two other designs exist. Lazy writes hints on a background thread, minutes after the segment closed — keeps rotation latency lower but widens the crash window where a hint is missing. On-demand writes hints only on clean shutdown or when an operator runs generate-hints; cheapest during normal operation, useless on a crash. Eager is almost always the right default: the rotation cost is milliseconds, the restart-time win is orders of magnitude, and the crash window shrinks to "hints missing only for the segment whose rotation crashed mid-write."

Relation to LSM manifest files (Build 3 preview)

Build 3 will develop LSM-trees, and one of the first things you will build there is a MANIFEST: a single file per store that names every SSTable currently live, their levels, their key ranges, and so on. A MANIFEST is morally the same idea as a hint file — metadata that saves you from scanning the full data on startup — but with a different structure. A hint file is per segment and stores keys-in-this-segment; a MANIFEST is per store and stores segments-that-exist plus summaries (min/max key, bloom filter fingerprints, level numbers). LSM stores have both: the MANIFEST tells you which SSTables to open; the per-SSTable index — the structural descendant of a Bitcask hint file — tells you where each key lives inside the SSTable. When you get to Build 3's chapter on SSTable layout you will see the direct line from segment-NNNN.hint to sstable-NNNN.idx. Different bytes, same job: make startup and range-seeks fast without scanning the data file.

Where this leads next

Build 2 began with "take the append-only log from Build 1 and make it a real database." Eight chapters later, you have:

Segmented storage with rotation (chapter 7).
Background compaction that reclaims dead bytes (chapter 8).
Deletes via tombstones (chapter 9).
Hint files that restart the store in milliseconds (this chapter).

Around 300 lines of Python. It runs real workloads. A modest deployment could ship it and be fine for months. Build 2 is done.

But look hard at the design and one wall remains. The keydir — the in-memory index — still has to hold every live key in the store. A hundred million keys at 50 bytes per entry is 5 GB of RAM just for the index. A billion keys is 50 GB of RAM. Past a certain dataset size, your keydir does not fit in RAM, and Bitcask stops being an option.

That wall is where Build 3 begins. The log-structured merge-tree keeps most of Bitcask's virtues — append-only writes, immutable segments, background compaction — and solves the RAM wall by sorting each segment and using hierarchical sparse indexes instead of a full keydir. Ranges become easy. Datasets bigger than memory become easy. The write amplification goes up and the read amplification goes up, but the data size goes through the roof.

You have built one real storage engine. In Build 3 you will rebuild it into something that scales to a terabyte a node.

References

Justin Sheehy and David Smith, Bitcask: A Log-Structured Hash Table for Fast Key/Value Data (Basho Technologies, 2010) — §3 describes hint files as "files that contain the position and size of the values" and the startup optimisation that motivates them. riak.com/assets/bitcask-intro.pdf.
Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 3 Storage and Retrieval — discusses startup-from-log problems and per-segment indexes in the SSTable context. dataintensive.net.
Basho, bitcask/src/bitcask_fileops.erl and bitcask_nifs.c — the production hint-file reader and writer, with CRC framing and sorted entries. github.com/basho/bitcask.
Patrick O'Neil et al., The Log-Structured Merge-Tree (LSM-Tree) (Acta DataFlow, 1996) — the paper that generalises the per-segment index idea into a scalable storage hierarchy. Springer.
Sanjay Ghemawat and Jeff Dean, LevelDB Implementation Notes — the MANIFEST file design, which plays the "boot-time summary" role at the store level. github.com/google/leveldb/blob/main/doc/impl.md.
Linus Torvalds et al., git pack-format documentation — the .pack + .idx pair is the closest non-database sibling of Bitcask segments and hints. git-scm.com/docs/pack-format.