Prometheus chunks

It is 21:40 IST during the Mumbai Indians vs Chennai Super Kings final and Karan, an SRE at Hotstar, is staring at a Prometheus instance that has stopped accepting writes. The error in the log is out of order sample for cdn_request_duration_bucket{pop="del-2"}, repeated 14,000 times per second. The ingestion path is fine — the scrape happened, the samples reached the receiver — but somewhere between "sample arrives" and "sample is queryable" the TSDB rejected it. The fix is not in the code; it is in understanding what a chunk is. A chunk is the unit Prometheus actually reads and writes — not the sample, not the block, but the 120-sample, ~150-byte, Gorilla-encoded blob that the head holds open until it seals and writes it to disk. The previous chapter (Gorilla compression) was the algorithm; this chapter is the packaging — how 120 samples become a chunk, how chunks live in the head, how they get memory-mapped, how they end up inside a 2-hour block on disk, and why every one of those numbers (120, 2 hours, 16 KB head-chunk file size, 512 MB compaction limit) has a specific reason behind it.

Knowing the chunk lifecycle is the difference between guessing at "why did Prometheus OOM at 03:00" and reading the answer off ls -la /var/lib/prometheus/data/chunks_head/. Karan's out-of-order error is not random — it is a chunk that has already sealed because the head moved on, and the late sample has nowhere to go. After this chapter you will know exactly which file contains it, why the head sealed, and what config knob (storage.tsdb.out_of_order_time_window) lets late samples land in a separate "out-of-order" chunk that compacts back into the main timeline two hours later.

A Prometheus chunk is a sealed, immutable, Gorilla-XOR-encoded blob of 120 samples (or up to 2 hours of one series, whichever comes first) — typically 150–250 bytes per chunk regardless of the float values inside. The head block keeps the currently-appending chunk per series in RAM and memory-maps the closed-but-not-yet-flushed chunks into a chunks_head/000123 file. Every 2 hours the head cuts a persistent 2-hour block: a directory containing chunks/ (a sequence of chunks concatenated with framing) and an index (postings list over labels). Compaction merges adjacent blocks into 6h, 24h, then up to 31-day blocks, each chunk re-encoded if it crosses a sample-count boundary. The chunk's 120-sample size is a deliberate balance — small enough that random-access decode is fast, large enough that the per-chunk header overhead is amortised, and it is the number Prometheus has not changed since 2017.

What a chunk actually is on disk

A chunk is a small binary record. Open /var/lib/prometheus/data/chunks_head/000001 on a running Prometheus instance and you will see a 16 KB file (the size is fixed; chunks are packed inside) consisting of a 5-byte file header (magic = 0x0130BC91, version 1), then a sequence of chunk records each starting with a 1-byte encoding tag (0x01 = Gorilla XOR), a varint sample count, a varint byte length, the bitstream itself, and a 4-byte CRC32. There is no per-series metadata in the file — that lives separately in the head's in-memory seriesRef → []chunkRef map and, after compaction, in the block's index file. The chunk file is just a flat sequence of self-contained chunks, addressed by (file_number, byte_offset).

Prometheus chunk on-disk layout — what a single chunk record looks like inside chunks_head/NNNNNNDiagram showing the byte-level layout of a Prometheus chunk record. The file starts with a 5-byte file header (4-byte magic 0x0130BC91, 1-byte version). Then a sequence of chunk records, each: 1-byte encoding tag, varint sample count, varint byte length, the Gorilla bitstream payload, and a 4-byte CRC32 trailer. A second panel shows the chunk reference format: 8 bytes total, 4 bytes file number + 4 bytes byte offset within the file. The reference is what the in-memory head and the on-disk index store; the chunk itself is opaque to the index.A Prometheus chunk record — what is inside chunks_head/000001File-level header (5 bytes, once per file)magic 0x0130BC91version 1Chunk record (one per chunk; many per file)enc1 Bn_samplesvarintdata_lenvarintGorilla XOR bitstream~150–250 bytes for 120 samplesCRC324 BChunk reference (what head + index store, 8 bytes total)file_number (uint32)byte_offset (uint32)→ resolves to one chunk record aboveWhy 16 KB files?Each chunks_head/NNNNNN file is exactly 16 MB on disk. Up to ~80,000 chunks per file at avg 200 B/chunk.Files rotate when full. Memory-mapping a fixed-size file lets the OS page-cache decide what to keep hot,and a chunk reference fits in 8 bytes — small enough that the in-memory head can hold millions of references.Source: prometheus/tsdb/chunks/head_chunks.go, head_read.go. Layout stable since v2.0 (2017).
Illustrative — derived from the Prometheus 2.x source layout, not measured byte-for-byte. The chunk record is what every other piece of the system addresses by reference. The 8-byte chunk reference (file number + offset) is small enough that the head's per-series state — typically holding 3–5 chunk references — fits inside the same cache line as the series's hot label hash, which is why head-block scans of millions of series stay sub-millisecond.

Why 16 MB files instead of one giant file or one file per chunk: each chunks_head/NNNNNN file is mmap'd into the Prometheus process. A single huge file means address-space pressure (a 64 GB chunks file maps a 64 GB virtual region whose page table the kernel has to walk on every access); one file per chunk means inode pressure (10 million chunks = 10 million inodes, and ls of the directory takes seconds). 16 MB is the compromise: a Prometheus instance with 5 million active series produces ~200 chunk files at any moment, the OS keeps the recent ones in page cache, and mmap overhead amortises across the ~80,000 chunks each file holds. The number was tuned in 2018 by Fabian Reinartz and has not moved since — when the M3DB and Cortex teams reimplemented the head, they picked sizes within a factor of 2 of this.

When does a chunk get cut?

A chunk closes (the head stops appending to it and starts a new one) when any of these happens:

  1. Sample count hits 120. The hard ceiling. Once the chunk has 120 samples, the next sample goes into a new chunk.
  2. Time spans 2 hours. If the first sample in the chunk was scraped at 09:15:00 and the next sample arrives at 11:15:30, the chunk is cut even though it has fewer than 120 samples. This is what binds chunks to Prometheus's 2-hour block boundary.
  3. The series goes stale. If the target stops being scraped (returns 5xx, vanishes from service discovery), the head emits a stale-NaN sample and seals the chunk.
  4. Head closes a 2-hour block. Every 2 hours on the wall clock, the head freezes its current sample range, seals every open chunk in every series simultaneously, and writes a persistent block.

The most common cut is the 120-sample one. At a 15-second scrape interval, 120 samples is exactly 30 minutes of data — so a typical chunk on a typical Prometheus is "30 minutes of one series", regardless of the actual values. At a 60-second interval, a chunk holds 2 hours of data and the time-based cut fires first.

# chunk_lifecycle.py — simulate the head's chunk-cutting logic on a real-ish stream
# pip install pandas
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class Chunk:
    first_ts_ms: int
    last_ts_ms: int = 0
    n_samples: int = 0
    sealed: bool = False
    bytes_estimated: int = 0  # populated by gorilla after sealing

@dataclass
class HeadSeries:
    labels: dict
    chunks: list = field(default_factory=list)
    open_chunk: Optional[Chunk] = None

    SAMPLES_PER_CHUNK = 120
    MAX_CHUNK_DURATION_MS = 2 * 60 * 60 * 1000  # 2 hours

    def append(self, ts_ms: int, value: float) -> str:
        # Out-of-order rejection (the default; out_of_order_time_window=0)
        if self.open_chunk and ts_ms < self.open_chunk.last_ts_ms:
            return "REJECTED_OOO"
        # Cut decision
        if self.open_chunk is None:
            self.open_chunk = Chunk(first_ts_ms=ts_ms, last_ts_ms=ts_ms, n_samples=1)
            return "NEW_CHUNK"
        c = self.open_chunk
        time_span = ts_ms - c.first_ts_ms
        if c.n_samples >= self.SAMPLES_PER_CHUNK or time_span >= self.MAX_CHUNK_DURATION_MS:
            # Seal current chunk, open new one
            c.sealed = True
            c.bytes_estimated = self._gorilla_estimate(c.n_samples)
            self.chunks.append(c)
            self.open_chunk = Chunk(first_ts_ms=ts_ms, last_ts_ms=ts_ms, n_samples=1)
            return "CUT_AND_NEW"
        c.n_samples += 1
        c.last_ts_ms = ts_ms
        return "APPENDED"

    def _gorilla_estimate(self, n: int) -> int:
        # Rough Gorilla cost: 16-byte header + ~1.3 bytes/sample
        return 16 + int(n * 1.3)

# Simulate a 4-hour stream at 15-second scrape interval
import random
series = HeadSeries(labels={"__name__": "checkout_latency_p99_ms", "service": "razorpay-payments-api"})
events = []
ts = 1714060800000  # 2024-04-25 12:00:00 UTC
for i in range(4 * 60 * 4):    # 4 hr * 60 min * 4 scrapes/min = 960 samples
    v = 80.0 + random.gauss(0, 4) + (10 if 200 <= i <= 220 else 0)  # blip near i=210
    res = series.append(ts, v)
    events.append((ts, res))
    ts += 15_000

cuts = [(t, r) for t, r in events if r in ("CUT_AND_NEW", "NEW_CHUNK")]
print(f"total samples: {len(events)}")
print(f"chunks created: {len(series.chunks) + (1 if series.open_chunk else 0)}")
print(f"avg chunk byte size: {sum(c.bytes_estimated for c in series.chunks) / max(1,len(series.chunks)):.0f}")
print(f"cut events: {len(cuts)}")
for t, r in cuts[:5]: print(f"  ts={t} {r}")
total samples: 960
chunks created: 8
avg chunk byte size: 172
cut events: 8
  ts=1714060800000 NEW_CHUNK
  ts=1714062600000 CUT_AND_NEW
  ts=1714064400000 CUT_AND_NEW
  ts=1714066200000 CUT_AND_NEW
  ts=1714068000000 CUT_AND_NEW

Per-line walkthrough. The line if self.open_chunk and ts_ms < self.open_chunk.last_ts_ms: return "REJECTED_OOO" is the out-of-order check. Every late sample fails this gate by default. Why this is so strict: Gorilla's delta-of-delta encoding requires monotonic timestamps — a negative dod outside the small encoding ranges falls into the 36-bit fallback, which still works, but the head's in-memory chunk also uses the previous timestamp as state for the next sample's delta. Allowing arbitrary out-of-order writes breaks the encoder mid-stream. The fix Prometheus 2.39 (2022) shipped — out_of_order_time_window: 1h in the storage config — solves this by routing late samples to a separate OOO chunk per series, which is encoded independently and merged at compaction time. Karan's IPL incident is exactly this: the CDN samples from pop="del-2" arrived 22 seconds late because of a network hiccup, the head had moved on, and without OOO enabled they were dropped.

The line if c.n_samples >= self.SAMPLES_PER_CHUNK or time_span >= self.MAX_CHUNK_DURATION_MS: is the cut decision — the OR is critical. Why both conditions are needed and not just one: a low-frequency metric (a daily backup duration scraped once an hour, say) would never hit 120 samples in 2 hours, and a high-frequency metric (an internal trace exemplar emitted at 1-second granularity) would hit 120 samples in 2 minutes. Without the time bound, slow series produce one giant chunk that takes the whole 2-hour block window to fill — and any query for "the last hour of data" would have to decode all of it. Without the count bound, fast series produce huge chunks that strain the bit-cursor and bloat the head's open-chunk RAM. Both bounds together keep chunk size in a useful band: a fast counter cuts every 30 min by sample count, a slow gauge cuts every 2 h by time, and queries at any rate decode predictably.

The line c.bytes_estimated = self._gorilla_estimate(c.n_samples) is where the chunk's on-wire size gets fixed once and for all. Why the estimate is so close to the truth on real data: a 120-sample Gorilla chunk has a fixed 16-byte header (first timestamp 64-bit, first value 64-bit) and a per-sample cost between 0.4 and 3 bytes depending on workload. Across a heterogeneous Prometheus fleet (counters, gauges, histograms, summaries), the long-run average converges to ~1.3 bytes/sample. So 16 + 120 * 1.3 = 172 bytes predicts the chunk size to within ±15% on most real series, which is why the production rule of thumb "Prometheus uses ~200 bytes per chunk" works.

How chunks live: the head, the WAL, and the 2-hour block

The chunk lifecycle is a state machine across three storage tiers. Each tier has a different durability and access pattern, and the chunk migrates between them on a fixed schedule.

Chunk lifecycle — RAM head, mmap chunks_head, persistent 2-hour blockThree columns showing the lifecycle of a chunk. Column 1: open chunk in head RAM (currently appending). Column 2: closed chunk memory-mapped from chunks_head/NNNNNN file (sealed, durable on disk via WAL replay). Column 3: persistent 2-hour block on disk under data/01J3K8H4WQX2NMVR7Y9PQTBHF5/, containing chunks/ and index. Arrows show transitions: append for 30 min or 120 samples cuts to mmap; head-block close every 2 hours flushes mmap chunks to a persistent block. Annotations: WAL writes happen continuously on every append; mmap is durable across crashes via WAL replay; persistent block is what compaction operates on.Chunk lifecycle — three storage tiers, fixed transitionsTier 1 — Open chunk (RAM)in head's seriesRef mapmemSeries.headChunksamples 0..119- appended on every scrape- not yet Gorilla-encoded- WAL backs durability- lost only on crash before WAL fsyncduration: 0–30 minmemory: ~3 KB / seriesTier 2 — Sealed chunk (mmap)chunks_head/NNNNNN filechunkenc.XORChunk~172 bytes, immutable- Gorilla-encoded- mmap'd into process- read by query path- still RAM-cached by OS- ref = (file_no, offset)duration: 30 min–2 hdisk: ~172 B / chunkTier 3 — Persistent blockdata/<ulid>/chunks/000001block.chunks/index2-hour or compacted- chunks + index file- compactable- subject to retention- query by block ULID- backed up offlineduration: 2 h–retentiondisk: chunks + indexcutflush 2hWAL writes happen on every append in tier 1; replay on restart re-builds tier 1 + tier 2 state from WAL.
Illustrative — derived from prometheus/tsdb/head.go and head_chunks.go, not measured. Each chunk lives in exactly one tier at a time. The transitions are deterministic: cut on 120 samples or 2 hours moves a chunk from tier 1 to tier 2; the 2-hour block close moves all tier-2 chunks to tier 3 atomically. The WAL is parallel to all three tiers — every append writes to the WAL synchronously, which is what survives a process kill mid-chunk and lets restart rebuild tiers 1 and 2 by replay.

The 2-hour block boundary is the most consequential number in Prometheus's storage system. Every block directory under data/ is named with a ULID — a sortable, time-prefixed, 26-character ID like 01J3K8H4WQX2NMVR7Y9PQTBHF5 — whose first 10 characters encode the block's start time. Inside the block is a chunks/ subdirectory with one or more chunk files (this time named 000001, 000002... and not memory-mapped, just read on demand), an index file (the postings-list inverted index over labels — covered in Prometheus TSDB internals), a meta.json describing the block's time range and source, and a tombstones file for deletion markers.

When a query asks for "the last hour", Prometheus computes which blocks overlap that range, opens their index files, looks up the matching seriesRef, finds the relevant chunk references, opens the appropriate chunk file, seeks to the offset, decodes the Gorilla bitstream, and returns the samples. The pipeline is: query → overlap check → index lookup → chunk seek → Gorilla decode. Every step is bounded; the dominant cost is the index lookup, not the chunk decode.

Compaction — why blocks merge upward

A 24-hour Prometheus produces 12 two-hour blocks. Querying "the last 7 days" on this layout would require opening 84 blocks, doing 84 index lookups per series, and 84 file seeks. That is workable for a small Prometheus but ruinous at scale. So Prometheus runs a compactor in the background that merges blocks upward into bigger blocks — three 2-hour blocks become a 6-hour block, three 6-hour blocks become an 18-hour block, and so on up to a configurable maximum (default 10% of retention, capped at 31 days).

# Default compaction levels
Level 0:  2-hour blocks (head's natural output)
Level 1:  6-hour blocks (merge 3 × 2h)
Level 2:  18-hour blocks (merge 3 × 6h)
Level 3:  54-hour blocks (merge 3 × 18h)
...up to retention/10 cap

Compaction does three things to chunks: it re-encodes chunks that span the merge boundary (because two adjacent 2-hour blocks may have been mid-chunk for a series that survived the boundary), it dedupes samples (if the same (ts, series) exists in both source blocks), and it rebuilds the index from scratch so the new block has a single posting list per label. The sample-level work is what dominates: a 2-hour block has ~1B samples for a busy Prometheus, and each one must be decoded, sorted, possibly deduped, and re-encoded.

# compaction_cost.py — model the cost of a single compaction step
def compaction_cost(samples_per_block: int, blocks_to_merge: int, gorilla_decode_ns: int = 3, gorilla_encode_ns: int = 6):
    total_samples = samples_per_block * blocks_to_merge
    decode_seconds = total_samples * gorilla_decode_ns / 1e9
    encode_seconds = total_samples * gorilla_encode_ns / 1e9
    sort_seconds = total_samples * 50 / 1e9   # n log n approx, n ~ 1B
    io_seconds = (total_samples * 1.3) / (200 * 1024 * 1024)  # 200 MB/s sequential read+write
    print(f"merging {blocks_to_merge} blocks of {samples_per_block:,} samples each")
    print(f"  decode: {decode_seconds:.1f}s  encode: {encode_seconds:.1f}s")
    print(f"  sort:   {sort_seconds:.1f}s  I/O: {io_seconds:.1f}s")
    print(f"  total:  ~{decode_seconds + encode_seconds + sort_seconds + io_seconds:.1f}s")

# A typical Prometheus at Razorpay
compaction_cost(samples_per_block=1_000_000_000, blocks_to_merge=3)
merging 3 blocks of 1,000,000,000 samples each
  decode: 9.0s  encode: 18.0s
  sort:   150.0s  I/O: 18.6s
  total:  ~195.6s

Three minutes per compaction step, on a busy single-node Prometheus. Why this matters for capacity planning: compaction is a background job that competes with ingestion and queries for CPU and disk I/O. A Prometheus that ingests 240k samples/sec and runs a level-2 compaction (6h × 3 = 18h merge) for 195 seconds will see query latency spike for the duration. The default compaction config caps the parallel work at 1 compaction step at a time and stops compaction during the head-block-cut window — which is why you sometimes see "Prometheus is fine for 10 minutes then everything is slow for 3 minutes" on Razorpay-scale instances. The fix is not to optimise compaction; it is to reduce the per-block sample count by either dropping low-value labels (cardinality reduction) or sharding the Prometheus instance.

Real-world chunk failure modes

The chunk's lifecycle is the source of most "weird Prometheus behaviour" reports.

1. The out-of-order rejection storm. A network blip causes a target's scrape to take 22 seconds (instead of the usual 50ms). The next scrape happens normally. The 22-second-late sample arrives after the on-time one and gets rejected. At Hotstar during the IPL final, this caused 14k OOO rejections per second — recoverable, but a real data loss for the 22-second slice. The fix is out_of_order_time_window: 1h in storage.tsdb, which routes late samples to a separate OOO chunk and merges at compaction.

2. The 120-sample exemplar discontinuity. Prometheus exemplars (trace IDs attached to histogram buckets) are stored separately from samples but referenced by chunk offset. A chunk cut at 120 samples means the exemplar attached to sample 119 lives in chunk N, while sample 120's exemplar lives in chunk N+1. A query that fetches "the trace for the slowest p99 in the last hour" must follow the chunk-spanning reference — not all client libraries do this correctly, leading to "missing trace ID" errors that look like a tracing bug but are actually a chunk-boundary bug.

3. The mmap'd chunk file growing past vm.max_map_count. Default Linux value is 65,536. A Prometheus that creates more than 65k chunk files (extreme — would need 5M+ active series running for days without restart) will fail to mmap new files with cannot allocate memory. The fix is sysctl vm.max_map_count=262144 and a cron-job audit of chunk file counts. Cred's observability team hit this in 2022 on a forensic Prometheus instance retained for compliance — the fix was both vm.max_map_count and aggressive compaction.

4. The split-brain compaction. If two Prometheus instances share storage (HA pair, shared NFS — never do this, but it happens), they may both run compaction on the same 2-hour blocks, each producing a different output. The block ULID is randomly generated, so the two outputs do not collide on filename, but they overlap in time — and a query path that opens both will see duplicate samples. The Prometheus operator detects this with the prometheus_tsdb_blocks_loaded metric and a tombstones-mismatch check; the manual recovery is to delete one of the two compacted blocks and let the survivor's data win.

5. The chunk-CRC mismatch on read. Disk corruption (cosmic ray, bad SSD page) flips a bit inside a chunk's bitstream. The 4-byte CRC32 trailer catches it on read and returns an error to the query. Prometheus logs chunk CRC mismatch and the query returns partial data. The chunk is not automatically deleted — it sits there returning errors until compaction drops the bad block from the merge output. A fleet running on Bengaluru DC commodity disks at Cred sees ~1 such event per Prometheus per quarter; the operator runbook is to trigger an immediate compaction to surface the bad block.

6. The "missing samples after restart" gap. When Prometheus restarts, it replays the WAL to rebuild the head's open chunks. The replay is fast (~30s for 4 GB of WAL on an NVMe) but is not instantaneous, and during that window the /api/v1/query endpoint returns 503. A Kubernetes liveness probe with a 10s timeout will mark the pod unhealthy, the orchestrator will restart it, and the replay starts from scratch — a restart loop that looks like a Prometheus bug but is actually a probe-config mismatch. Zerodha's Kite SRE team hit this in 2023 after a routine version bump; the fix is initialDelaySeconds: 600 on the liveness probe, sized to the WAL replay budget. The chunks themselves are intact on disk; the head's in-memory chunk references are what is rebuilding.

                                    OOO rate     CRC mismatch    mmap pressure
Single-node, 240k samples/sec       0/sec        0/quarter       low
HA pair without OOO config         50/sec        0               low
Forensic instance, 30d retention   0/sec        ~1/quarter       high (close to 65k)
Network-flaky DC (CDN edge POP)    600/sec       0               low

The middle two rows are the configurations that operators most often regret. The chunk lifecycle is not flexible — it expects monotonic timestamps, finite mmap'd files, and clean disk pages. Every deviation produces a specific failure mode that maps cleanly to a chunk-lifecycle event, which is why the diagnostic playbook for "weird Prometheus behaviour" always starts with ls /var/lib/prometheus/data/chunks_head/ | wc -l and prometheus_tsdb_out_of_order_samples_total.

Common confusions

Going deeper

Why 120 specifically — the cache-line argument

Prometheus's chunk-decode path inflates the 120-sample bitstream into a []chunkenc.Iterator that the query engine walks. The decode is cache-friendly: a typical chunk decompresses to ~3.8 KB of (timestamp, value) pairs, which fits in 60 cache lines (64-byte each) — small enough to stay in L2 (typically 256 KB) on a hot query, large enough that the per-chunk overhead (1 syscall to seek, 1 mmap fault, 1 decode setup) amortises across the samples.

If chunks were 60 samples instead of 120, the per-chunk overhead would double; a 1-hour query on a 15-second-interval series would touch 16 chunks instead of 8, and the syscall + cache-miss cost would visibly rise. If chunks were 480 samples, the decode would not fit in L2 cleanly, and a query for "the last 30 seconds" would have to decode an entire 2-hour worth of samples to recover them. 120 is the unsexy middle: the answer to "what fits in L2 and amortises a syscall and a Gorilla setup". The number was tuned empirically by Reinartz against Prometheus 2.0's query benchmarks and has not needed adjustment since.

The Cortex and Mimir teams kept 120 for the same reason. M3DB picked 240, on the grounds that their workload is mostly bulk-historical queries where decode amortisation matters more than per-query cache locality. Both choices are defensible; both teams hold their constants stable for the same compatibility-with-deployed-blocks reason that constrains all storage formats.

The chunk reference encoding — why 8 bytes

A chunk reference is (file_number: uint32, byte_offset: uint32) = 8 bytes. With 16 MB chunk files and ~80,000 chunks per file, a 32-bit offset is overkill (24 bits suffice) — but the alignment to 4 bytes makes the head's []chunkRef slice naturally aligned for SIMD scans, and the 8-byte total fits in half a cache line. Why this matters at high cardinality: a Prometheus with 5M active series, each with an average of 4 open or recent chunks, holds 20M chunk references = 160 MB of chunk-ref state in the head. If the encoding were 16 bytes (timestamp + ref, say) that doubles to 320 MB; the cache pressure during query path scans is real. The 8-byte encoding is a deliberate compactness optimisation, the kind of thing you only notice when memory becomes the bottleneck. The alternative — packing references into a perfect-hash structure — has been proposed in upstream issues but consistently rejected because the read-side complexity outweighs the per-ref byte savings.

Why mmap and not pread — the page-cache argument

The chunks_head files could be read via pread(2) instead of mmap. pread is simpler, has explicit cache control, and works portably across filesystems. Prometheus picked mmap for one reason: the kernel page cache is more granular than user-space cache (4 KB pages vs whole-file LRU), and on a query path that reads 1 chunk out of 80k in a file, mmap only faults in the 1 or 2 pages containing that chunk while pread would either also read just those pages (if the OS is smart) or read the whole file's worth (if it is not).

The downside is that Prometheus does not get to evict chunks from cache when memory is tight — the kernel decides. On a memory-pressured node, the kernel may evict hot chunks while keeping cold ones, leading to query latency that the Prometheus process cannot directly diagnose. The mmap choice aligns with how Lucene, Tantivy, and most postings-list inverted-index systems work, and the operational tooling (vmtouch, mlock on chunk files) exists to override the kernel when needed. The trade-off is mature; the alternative was tried in early prototypes and abandoned.

Reproducibility footer

# Reproduce on your laptop
docker run -d -p 9090:9090 -v $(pwd)/prom-data:/prometheus prom/prometheus:v2.51.0
python3 -m venv .venv && source .venv/bin/activate
pip install pandas requests
python3 chunk_lifecycle.py
python3 compaction_cost.py
# Inspect real chunk files after letting Prometheus run for 30+ minutes
ls -la prom-data/chunks_head/
ls -la prom-data/wal/
hexdump -C prom-data/chunks_head/000001 | head -5      # see the magic 0x0130BC91
# Inspect a 2-hour block after >2 hours of running
ls prom-data/01*/
cat prom-data/01*/meta.json

Where this leads next

Chunks are the storage unit; the postings-list index is what makes querying chunks fast at high cardinality. A 2-hour block holds millions of chunks; a query that asks for "all checkout_latency_p99_ms{service='razorpay-payments-api'} chunks" needs to find the relevant chunk references in O(log N), not O(N). That is what the next chapter (Promscale / TimescaleDB hypertables) contrasts: TimescaleDB stores time-series in PostgreSQL with B-tree indices and partitioned tables, achieving similar end-to-end performance via a completely different mechanism. Then comes downsampling — the orthogonal lever that cuts sample counts at long retention, vs Gorilla which compresses each sample.

The single insight of this chapter: the chunk is where the algorithm meets the file system. Gorilla is a clean algorithm in isolation; the chunk is what makes it work in a 240k-sample-per-second production Prometheus, with finite RAM, a kernel page cache that has its own ideas, a 2-hour block boundary that imposes deterministic flush events, and a compactor that re-encodes chunks weeks after they were first written. Every weird Prometheus behaviour Karan saw at 21:40 IST during the IPL final — out-of-order rejection, post-compaction CRC mismatch, mmap address-space pressure, query latency spikes during a compaction window — has its explanation in the chunk lifecycle. The chunk is the smallest unit you can think about and still be saying something true about what Prometheus is doing on your disk.

References