Prometheus chunks

It is 21:40 IST during the Mumbai Indians vs Chennai Super Kings final and Karan, an SRE at Hotstar, is staring at a Prometheus instance that has stopped accepting writes. The error in the log is out of order sample for cdn_request_duration_bucket{pop="del-2"}, repeated 14,000 times per second. The ingestion path is fine — the scrape happened, the samples reached the receiver — but somewhere between "sample arrives" and "sample is queryable" the TSDB rejected it. The fix is not in the code; it is in understanding what a chunk is. A chunk is the unit Prometheus actually reads and writes — not the sample, not the block, but the 120-sample, ~150-byte, Gorilla-encoded blob that the head holds open until it seals and writes it to disk. The previous chapter (Gorilla compression) was the algorithm; this chapter is the packaging — how 120 samples become a chunk, how chunks live in the head, how they get memory-mapped, how they end up inside a 2-hour block on disk, and why every one of those numbers (120, 2 hours, 16 KB head-chunk file size, 512 MB compaction limit) has a specific reason behind it.

Knowing the chunk lifecycle is the difference between guessing at "why did Prometheus OOM at 03:00" and reading the answer off ls -la /var/lib/prometheus/data/chunks_head/. Karan's out-of-order error is not random — it is a chunk that has already sealed because the head moved on, and the late sample has nowhere to go. After this chapter you will know exactly which file contains it, why the head sealed, and what config knob (storage.tsdb.out_of_order_time_window) lets late samples land in a separate "out-of-order" chunk that compacts back into the main timeline two hours later.

A Prometheus chunk is a sealed, immutable, Gorilla-XOR-encoded blob of 120 samples (or up to 2 hours of one series, whichever comes first) — typically 150–250 bytes per chunk regardless of the float values inside. The head block keeps the currently-appending chunk per series in RAM and memory-maps the closed-but-not-yet-flushed chunks into a chunks_head/000123 file. Every 2 hours the head cuts a persistent 2-hour block: a directory containing chunks/ (a sequence of chunks concatenated with framing) and an index (postings list over labels). Compaction merges adjacent blocks into 6h, 24h, then up to 31-day blocks, each chunk re-encoded if it crosses a sample-count boundary. The chunk's 120-sample size is a deliberate balance — small enough that random-access decode is fast, large enough that the per-chunk header overhead is amortised, and it is the number Prometheus has not changed since 2017.

What a chunk actually is on disk

A chunk is a small binary record. Open /var/lib/prometheus/data/chunks_head/000001 on a running Prometheus instance and you will see a 16 KB file (the size is fixed; chunks are packed inside) consisting of a 5-byte file header (magic = 0x0130BC91, version 1), then a sequence of chunk records each starting with a 1-byte encoding tag (0x01 = Gorilla XOR), a varint sample count, a varint byte length, the bitstream itself, and a 4-byte CRC32. There is no per-series metadata in the file — that lives separately in the head's in-memory seriesRef → []chunkRef map and, after compaction, in the block's index file. The chunk file is just a flat sequence of self-contained chunks, addressed by (file_number, byte_offset).

Illustrative — derived from the Prometheus 2.x source layout, not measured byte-for-byte. The chunk record is what every other piece of the system addresses by reference. The 8-byte chunk reference (file number + offset) is small enough that the head's per-series state — typically holding 3–5 chunk references — fits inside the same cache line as the series's hot label hash, which is why head-block scans of millions of series stay sub-millisecond.

Why 16 MB files instead of one giant file or one file per chunk: each chunks_head/NNNNNN file is mmap'd into the Prometheus process. A single huge file means address-space pressure (a 64 GB chunks file maps a 64 GB virtual region whose page table the kernel has to walk on every access); one file per chunk means inode pressure (10 million chunks = 10 million inodes, and ls of the directory takes seconds). 16 MB is the compromise: a Prometheus instance with 5 million active series produces ~200 chunk files at any moment, the OS keeps the recent ones in page cache, and mmap overhead amortises across the ~80,000 chunks each file holds. The number was tuned in 2018 by Fabian Reinartz and has not moved since — when the M3DB and Cortex teams reimplemented the head, they picked sizes within a factor of 2 of this.

When does a chunk get cut?

A chunk closes (the head stops appending to it and starts a new one) when any of these happens:

Sample count hits 120. The hard ceiling. Once the chunk has 120 samples, the next sample goes into a new chunk.
Time spans 2 hours. If the first sample in the chunk was scraped at 09:15:00 and the next sample arrives at 11:15:30, the chunk is cut even though it has fewer than 120 samples. This is what binds chunks to Prometheus's 2-hour block boundary.
The series goes stale. If the target stops being scraped (returns 5xx, vanishes from service discovery), the head emits a stale-NaN sample and seals the chunk.
Head closes a 2-hour block. Every 2 hours on the wall clock, the head freezes its current sample range, seals every open chunk in every series simultaneously, and writes a persistent block.

The most common cut is the 120-sample one. At a 15-second scrape interval, 120 samples is exactly 30 minutes of data — so a typical chunk on a typical Prometheus is "30 minutes of one series", regardless of the actual values. At a 60-second interval, a chunk holds 2 hours of data and the time-based cut fires first.

# chunk_lifecycle.py — simulate the head's chunk-cutting logic on a real-ish stream
# pip install pandas
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class Chunk:
    first_ts_ms: int
    last_ts_ms: int = 0
    n_samples: int = 0
    sealed: bool = False
    bytes_estimated: int = 0  # populated by gorilla after sealing

@dataclass
class HeadSeries:
    labels: dict
    chunks: list = field(default_factory=list)
    open_chunk: Optional[Chunk] = None

    SAMPLES_PER_CHUNK = 120
    MAX_CHUNK_DURATION_MS = 2 * 60 * 60 * 1000  # 2 hours

    def append(self, ts_ms: int, value: float) -> str:
        # Out-of-order rejection (the default; out_of_order_time_window=0)
        if self.open_chunk and ts_ms < self.open_chunk.last_ts_ms:
            return "REJECTED_OOO"
        # Cut decision
        if self.open_chunk is None:
            self.open_chunk = Chunk(first_ts_ms=ts_ms, last_ts_ms=ts_ms, n_samples=1)
            return "NEW_CHUNK"
        c = self.open_chunk
        time_span = ts_ms - c.first_ts_ms
        if c.n_samples >= self.SAMPLES_PER_CHUNK or time_span >= self.MAX_CHUNK_DURATION_MS:
            # Seal current chunk, open new one
            c.sealed = True
            c.bytes_estimated = self._gorilla_estimate(c.n_samples)
            self.chunks.append(c)
            self.open_chunk = Chunk(first_ts_ms=ts_ms, last_ts_ms=ts_ms, n_samples=1)
            return "CUT_AND_NEW"
        c.n_samples += 1
        c.last_ts_ms = ts_ms
        return "APPENDED"

    def _gorilla_estimate(self, n: int) -> int:
        # Rough Gorilla cost: 16-byte header + ~1.3 bytes/sample
        return 16 + int(n * 1.3)

# Simulate a 4-hour stream at 15-second scrape interval
import random
series = HeadSeries(labels={"__name__": "checkout_latency_p99_ms", "service": "razorpay-payments-api"})
events = []
ts = 1714060800000  # 2024-04-25 12:00:00 UTC
for i in range(4 * 60 * 4):    # 4 hr * 60 min * 4 scrapes/min = 960 samples
    v = 80.0 + random.gauss(0, 4) + (10 if 200 <= i <= 220 else 0)  # blip near i=210
    res = series.append(ts, v)
    events.append((ts, res))
    ts += 15_000

cuts = [(t, r) for t, r in events if r in ("CUT_AND_NEW", "NEW_CHUNK")]
print(f"total samples: {len(events)}")
print(f"chunks created: {len(series.chunks) + (1 if series.open_chunk else 0)}")
print(f"avg chunk byte size: {sum(c.bytes_estimated for c in series.chunks) / max(1,len(series.chunks)):.0f}")
print(f"cut events: {len(cuts)}")
for t, r in cuts[:5]: print(f"  ts={t} {r}")

total samples: 960
chunks created: 8
avg chunk byte size: 172
cut events: 8
  ts=1714060800000 NEW_CHUNK
  ts=1714062600000 CUT_AND_NEW
  ts=1714064400000 CUT_AND_NEW
  ts=1714066200000 CUT_AND_NEW
  ts=1714068000000 CUT_AND_NEW

Per-line walkthrough. The line if self.open_chunk and ts_ms < self.open_chunk.last_ts_ms: return "REJECTED_OOO" is the out-of-order check. Every late sample fails this gate by default. Why this is so strict: Gorilla's delta-of-delta encoding requires monotonic timestamps — a negative dod outside the small encoding ranges falls into the 36-bit fallback, which still works, but the head's in-memory chunk also uses the previous timestamp as state for the next sample's delta. Allowing arbitrary out-of-order writes breaks the encoder mid-stream. The fix Prometheus 2.39 (2022) shipped — out_of_order_time_window: 1h in the storage config — solves this by routing late samples to a separate OOO chunk per series, which is encoded independently and merged at compaction time. Karan's IPL incident is exactly this: the CDN samples from pop="del-2" arrived 22 seconds late because of a network hiccup, the head had moved on, and without OOO enabled they were dropped.

The line if c.n_samples >= self.SAMPLES_PER_CHUNK or time_span >= self.MAX_CHUNK_DURATION_MS: is the cut decision — the OR is critical. Why both conditions are needed and not just one: a low-frequency metric (a daily backup duration scraped once an hour, say) would never hit 120 samples in 2 hours, and a high-frequency metric (an internal trace exemplar emitted at 1-second granularity) would hit 120 samples in 2 minutes. Without the time bound, slow series produce one giant chunk that takes the whole 2-hour block window to fill — and any query for "the last hour of data" would have to decode all of it. Without the count bound, fast series produce huge chunks that strain the bit-cursor and bloat the head's open-chunk RAM. Both bounds together keep chunk size in a useful band: a fast counter cuts every 30 min by sample count, a slow gauge cuts every 2 h by time, and queries at any rate decode predictably.

The line c.bytes_estimated = self._gorilla_estimate(c.n_samples) is where the chunk's on-wire size gets fixed once and for all. Why the estimate is so close to the truth on real data: a 120-sample Gorilla chunk has a fixed 16-byte header (first timestamp 64-bit, first value 64-bit) and a per-sample cost between 0.4 and 3 bytes depending on workload. Across a heterogeneous Prometheus fleet (counters, gauges, histograms, summaries), the long-run average converges to ~1.3 bytes/sample. So 16 + 120 * 1.3 = 172 bytes predicts the chunk size to within ±15% on most real series, which is why the production rule of thumb "Prometheus uses ~200 bytes per chunk" works.

How chunks live: the head, the WAL, and the 2-hour block

The chunk lifecycle is a state machine across three storage tiers. Each tier has a different durability and access pattern, and the chunk migrates between them on a fixed schedule.

Illustrative — derived from prometheus/tsdb/head.go and head_chunks.go, not measured. Each chunk lives in exactly one tier at a time. The transitions are deterministic: cut on 120 samples or 2 hours moves a chunk from tier 1 to tier 2; the 2-hour block close moves all tier-2 chunks to tier 3 atomically. The WAL is parallel to all three tiers — every append writes to the WAL synchronously, which is what survives a process kill mid-chunk and lets restart rebuild tiers 1 and 2 by replay.

The 2-hour block boundary is the most consequential number in Prometheus's storage system. Every block directory under data/ is named with a ULID — a sortable, time-prefixed, 26-character ID like 01J3K8H4WQX2NMVR7Y9PQTBHF5 — whose first 10 characters encode the block's start time. Inside the block is a chunks/ subdirectory with one or more chunk files (this time named 000001, 000002... and not memory-mapped, just read on demand), an index file (the postings-list inverted index over labels — covered in Prometheus TSDB internals), a meta.json describing the block's time range and source, and a tombstones file for deletion markers.

When a query asks for "the last hour", Prometheus computes which blocks overlap that range, opens their index files, looks up the matching seriesRef, finds the relevant chunk references, opens the appropriate chunk file, seeks to the offset, decodes the Gorilla bitstream, and returns the samples. The pipeline is: query → overlap check → index lookup → chunk seek → Gorilla decode. Every step is bounded; the dominant cost is the index lookup, not the chunk decode.

Compaction — why blocks merge upward

A 24-hour Prometheus produces 12 two-hour blocks. Querying "the last 7 days" on this layout would require opening 84 blocks, doing 84 index lookups per series, and 84 file seeks. That is workable for a small Prometheus but ruinous at scale. So Prometheus runs a compactor in the background that merges blocks upward into bigger blocks — three 2-hour blocks become a 6-hour block, three 6-hour blocks become an 18-hour block, and so on up to a configurable maximum (default 10% of retention, capped at 31 days).

# Default compaction levels
Level 0:  2-hour blocks (head's natural output)
Level 1:  6-hour blocks (merge 3 × 2h)
Level 2:  18-hour blocks (merge 3 × 6h)
Level 3:  54-hour blocks (merge 3 × 18h)
...up to retention/10 cap

Compaction does three things to chunks: it re-encodes chunks that span the merge boundary (because two adjacent 2-hour blocks may have been mid-chunk for a series that survived the boundary), it dedupes samples (if the same (ts, series) exists in both source blocks), and it rebuilds the index from scratch so the new block has a single posting list per label. The sample-level work is what dominates: a 2-hour block has ~1B samples for a busy Prometheus, and each one must be decoded, sorted, possibly deduped, and re-encoded.

# compaction_cost.py — model the cost of a single compaction step
def compaction_cost(samples_per_block: int, blocks_to_merge: int, gorilla_decode_ns: int = 3, gorilla_encode_ns: int = 6):
    total_samples = samples_per_block * blocks_to_merge
    decode_seconds = total_samples * gorilla_decode_ns / 1e9
    encode_seconds = total_samples * gorilla_encode_ns / 1e9
    sort_seconds = total_samples * 50 / 1e9   # n log n approx, n ~ 1B
    io_seconds = (total_samples * 1.3) / (200 * 1024 * 1024)  # 200 MB/s sequential read+write
    print(f"merging {blocks_to_merge} blocks of {samples_per_block:,} samples each")
    print(f"  decode: {decode_seconds:.1f}s  encode: {encode_seconds:.1f}s")
    print(f"  sort:   {sort_seconds:.1f}s  I/O: {io_seconds:.1f}s")
    print(f"  total:  ~{decode_seconds + encode_seconds + sort_seconds + io_seconds:.1f}s")

# A typical Prometheus at Razorpay
compaction_cost(samples_per_block=1_000_000_000, blocks_to_merge=3)

merging 3 blocks of 1,000,000,000 samples each
  decode: 9.0s  encode: 18.0s
  sort:   150.0s  I/O: 18.6s
  total:  ~195.6s

Three minutes per compaction step, on a busy single-node Prometheus. Why this matters for capacity planning: compaction is a background job that competes with ingestion and queries for CPU and disk I/O. A Prometheus that ingests 240k samples/sec and runs a level-2 compaction (6h × 3 = 18h merge) for 195 seconds will see query latency spike for the duration. The default compaction config caps the parallel work at 1 compaction step at a time and stops compaction during the head-block-cut window — which is why you sometimes see "Prometheus is fine for 10 minutes then everything is slow for 3 minutes" on Razorpay-scale instances. The fix is not to optimise compaction; it is to reduce the per-block sample count by either dropping low-value labels (cardinality reduction) or sharding the Prometheus instance.

Real-world chunk failure modes

The chunk's lifecycle is the source of most "weird Prometheus behaviour" reports.

1. The out-of-order rejection storm. A network blip causes a target's scrape to take 22 seconds (instead of the usual 50ms). The next scrape happens normally. The 22-second-late sample arrives after the on-time one and gets rejected. At Hotstar during the IPL final, this caused 14k OOO rejections per second — recoverable, but a real data loss for the 22-second slice. The fix is out_of_order_time_window: 1h in storage.tsdb, which routes late samples to a separate OOO chunk and merges at compaction.

2. The 120-sample exemplar discontinuity. Prometheus exemplars (trace IDs attached to histogram buckets) are stored separately from samples but referenced by chunk offset. A chunk cut at 120 samples means the exemplar attached to sample 119 lives in chunk N, while sample 120's exemplar lives in chunk N+1. A query that fetches "the trace for the slowest p99 in the last hour" must follow the chunk-spanning reference — not all client libraries do this correctly, leading to "missing trace ID" errors that look like a tracing bug but are actually a chunk-boundary bug.

3. The mmap'd chunk file growing past vm.max_map_count. Default Linux value is 65,536. A Prometheus that creates more than 65k chunk files (extreme — would need 5M+ active series running for days without restart) will fail to mmap new files with cannot allocate memory. The fix is sysctl vm.max_map_count=262144 and a cron-job audit of chunk file counts. Cred's observability team hit this in 2022 on a forensic Prometheus instance retained for compliance — the fix was both vm.max_map_count and aggressive compaction.

4. The split-brain compaction. If two Prometheus instances share storage (HA pair, shared NFS — never do this, but it happens), they may both run compaction on the same 2-hour blocks, each producing a different output. The block ULID is randomly generated, so the two outputs do not collide on filename, but they overlap in time — and a query path that opens both will see duplicate samples. The Prometheus operator detects this with the prometheus_tsdb_blocks_loaded metric and a tombstones-mismatch check; the manual recovery is to delete one of the two compacted blocks and let the survivor's data win.

5. The chunk-CRC mismatch on read. Disk corruption (cosmic ray, bad SSD page) flips a bit inside a chunk's bitstream. The 4-byte CRC32 trailer catches it on read and returns an error to the query. Prometheus logs chunk CRC mismatch and the query returns partial data. The chunk is not automatically deleted — it sits there returning errors until compaction drops the bad block from the merge output. A fleet running on Bengaluru DC commodity disks at Cred sees ~1 such event per Prometheus per quarter; the operator runbook is to trigger an immediate compaction to surface the bad block.

6. The "missing samples after restart" gap. When Prometheus restarts, it replays the WAL to rebuild the head's open chunks. The replay is fast (~30s for 4 GB of WAL on an NVMe) but is not instantaneous, and during that window the /api/v1/query endpoint returns 503. A Kubernetes liveness probe with a 10s timeout will mark the pod unhealthy, the orchestrator will restart it, and the replay starts from scratch — a restart loop that looks like a Prometheus bug but is actually a probe-config mismatch. Zerodha's Kite SRE team hit this in 2023 after a routine version bump; the fix is initialDelaySeconds: 600 on the liveness probe, sized to the WAL replay budget. The chunks themselves are intact on disk; the head's in-memory chunk references are what is rebuilding.

                                    OOO rate     CRC mismatch    mmap pressure
Single-node, 240k samples/sec       0/sec        0/quarter       low
HA pair without OOO config         50/sec        0               low
Forensic instance, 30d retention   0/sec        ~1/quarter       high (close to 65k)
Network-flaky DC (CDN edge POP)    600/sec       0               low

The middle two rows are the configurations that operators most often regret. The chunk lifecycle is not flexible — it expects monotonic timestamps, finite mmap'd files, and clean disk pages. Every deviation produces a specific failure mode that maps cleanly to a chunk-lifecycle event, which is why the diagnostic playbook for "weird Prometheus behaviour" always starts with ls /var/lib/prometheus/data/chunks_head/ | wc -l and prometheus_tsdb_out_of_order_samples_total.

Common confusions

"A chunk is the same as a block." No — a chunk holds 120 samples of one series; a block holds 2 hours of all series. A 2-hour block contains millions of chunks, packed into chunk files, addressed by the block's index postings list. The chunk is the storage unit; the block is the query unit, the retention unit, and the compaction unit. Confusing the two leads to sizing mistakes — "my Prometheus has 100k chunks" sounds like a lot but is small (one 2-hour block at 50k series). "My Prometheus has 100k blocks" is unphysical.
"The 120-sample chunk size is configurable." Effectively no. The constant samplesPerChunk = 120 in prometheus/tsdb/chunks/chunks.go is a compile-time value, not a config flag. Forks (M3DB, Cortex) sometimes change it for their own reasons, but vanilla Prometheus has held it at 120 since 2017. The compaction-time chunk re-encoding may produce chunks with up to 240 samples (when merging two adjacent half-full chunks for the same series across a block boundary), but the head always cuts at 120.
"Out-of-order samples are dropped forever." Only on the default config (out_of_order_time_window: 0). Setting a non-zero window (e.g. 1h) tells the head to accept samples up to that lateness and store them in a separate OOO chunk file (chunks_head_ooo/). At compaction time, OOO chunks are merged with the in-order chunks of the same series for the same time range, producing a single deduplicated chunk in the resulting block. The cost is a small amount of additional disk and a small amount of additional read amplification on queries.
"Memory-mapped chunks are paged out by Prometheus." They are paged in and out by the kernel, not Prometheus. Prometheus does mmap(2) on the chunk files; the kernel manages the actual page residency via the standard page-cache LRU. This is why Prometheus's RSS does not include the chunk file sizes — the chunk pages are counted in the kernel's page cache, not the process's heap. Reading a "Prometheus uses 4 GB of RAM" top output and seeing 60 GB of chunk files on disk is normal; only the 4 GB is process-resident, the rest is kernel-cached chunk pages that show up as cached in /proc/meminfo.
"A chunk is a fixed size in bytes." No — it is a fixed size in samples (120) but variable in bytes depending on the workload's compressibility. A flat counter chunk encodes in ~50 bytes; a noisy gauge crossing exponent boundaries encodes in ~400 bytes. The 1.3 bytes/sample is a fleet average. Sizing a Prometheus by "I have 5M series × 120 samples × 1.3 bytes/sample = 780 MB" is a useful first cut but underestimates by ~20% for noisy workloads and overestimates by ~30% for stable counters.
"Compaction speeds up queries." It speeds up long-range queries (7-day, 30-day) by reducing the block count; it slows down short-range queries during the compaction window because the compactor competes for CPU and I/O. The right framing: compaction trades steady-state query speed for transient-window contention. A Prometheus that needs predictable query latency on a tight SLO sometimes disables compaction entirely (storage.tsdb.no-lockfile=true storage.tsdb.retention.time=24h) and pays for it with a higher block count — fine if retention is short.

Going deeper

Why 120 specifically — the cache-line argument

Prometheus's chunk-decode path inflates the 120-sample bitstream into a []chunkenc.Iterator that the query engine walks. The decode is cache-friendly: a typical chunk decompresses to ~3.8 KB of (timestamp, value) pairs, which fits in 60 cache lines (64-byte each) — small enough to stay in L2 (typically 256 KB) on a hot query, large enough that the per-chunk overhead (1 syscall to seek, 1 mmap fault, 1 decode setup) amortises across the samples.

If chunks were 60 samples instead of 120, the per-chunk overhead would double; a 1-hour query on a 15-second-interval series would touch 16 chunks instead of 8, and the syscall + cache-miss cost would visibly rise. If chunks were 480 samples, the decode would not fit in L2 cleanly, and a query for "the last 30 seconds" would have to decode an entire 2-hour worth of samples to recover them. 120 is the unsexy middle: the answer to "what fits in L2 and amortises a syscall and a Gorilla setup". The number was tuned empirically by Reinartz against Prometheus 2.0's query benchmarks and has not needed adjustment since.

The Cortex and Mimir teams kept 120 for the same reason. M3DB picked 240, on the grounds that their workload is mostly bulk-historical queries where decode amortisation matters more than per-query cache locality. Both choices are defensible; both teams hold their constants stable for the same compatibility-with-deployed-blocks reason that constrains all storage formats.

The chunk reference encoding — why 8 bytes

A chunk reference is (file_number: uint32, byte_offset: uint32) = 8 bytes. With 16 MB chunk files and ~80,000 chunks per file, a 32-bit offset is overkill (24 bits suffice) — but the alignment to 4 bytes makes the head's []chunkRef slice naturally aligned for SIMD scans, and the 8-byte total fits in half a cache line. Why this matters at high cardinality: a Prometheus with 5M active series, each with an average of 4 open or recent chunks, holds 20M chunk references = 160 MB of chunk-ref state in the head. If the encoding were 16 bytes (timestamp + ref, say) that doubles to 320 MB; the cache pressure during query path scans is real. The 8-byte encoding is a deliberate compactness optimisation, the kind of thing you only notice when memory becomes the bottleneck. The alternative — packing references into a perfect-hash structure — has been proposed in upstream issues but consistently rejected because the read-side complexity outweighs the per-ref byte savings.

Why mmap and not pread — the page-cache argument

The chunks_head files could be read via pread(2) instead of mmap. pread is simpler, has explicit cache control, and works portably across filesystems. Prometheus picked mmap for one reason: the kernel page cache is more granular than user-space cache (4 KB pages vs whole-file LRU), and on a query path that reads 1 chunk out of 80k in a file, mmap only faults in the 1 or 2 pages containing that chunk while pread would either also read just those pages (if the OS is smart) or read the whole file's worth (if it is not).

The downside is that Prometheus does not get to evict chunks from cache when memory is tight — the kernel decides. On a memory-pressured node, the kernel may evict hot chunks while keeping cold ones, leading to query latency that the Prometheus process cannot directly diagnose. The mmap choice aligns with how Lucene, Tantivy, and most postings-list inverted-index systems work, and the operational tooling (vmtouch, mlock on chunk files) exists to override the kernel when needed. The trade-off is mature; the alternative was tried in early prototypes and abandoned.

Reproducibility footer

# Reproduce on your laptop
docker run -d -p 9090:9090 -v $(pwd)/prom-data:/prometheus prom/prometheus:v2.51.0
python3 -m venv .venv && source .venv/bin/activate
pip install pandas requests
python3 chunk_lifecycle.py
python3 compaction_cost.py
# Inspect real chunk files after letting Prometheus run for 30+ minutes
ls -la prom-data/chunks_head/
ls -la prom-data/wal/
hexdump -C prom-data/chunks_head/000001 | head -5      # see the magic 0x0130BC91
# Inspect a 2-hour block after >2 hours of running
ls prom-data/01*/
cat prom-data/01*/meta.json

Where this leads next

Chunks are the storage unit; the postings-list index is what makes querying chunks fast at high cardinality. A 2-hour block holds millions of chunks; a query that asks for "all checkout_latency_p99_ms{service='razorpay-payments-api'} chunks" needs to find the relevant chunk references in O(log N), not O(N). That is what the next chapter (Promscale / TimescaleDB hypertables) contrasts: TimescaleDB stores time-series in PostgreSQL with B-tree indices and partitioned tables, achieving similar end-to-end performance via a completely different mechanism. Then comes downsampling — the orthogonal lever that cuts sample counts at long retention, vs Gorilla which compresses each sample.

Gorilla compression: double-delta + XOR — the encoding inside every chunk; this chapter is its packaging.
Prometheus TSDB internals — the broader on-disk layout including the postings-list index that addresses chunks.
Promscale and TimescaleDB hypertables — the SQL-layer alternative storage model.
Downsampling for long retention — orthogonal lever to per-chunk compression.
Cardinality limits in Prometheus, Datadog, Honeycomb — what happens to chunks when label cardinality blows up.

The single insight of this chapter: the chunk is where the algorithm meets the file system. Gorilla is a clean algorithm in isolation; the chunk is what makes it work in a 240k-sample-per-second production Prometheus, with finite RAM, a kernel page cache that has its own ideas, a 2-hour block boundary that imposes deterministic flush events, and a compactor that re-encodes chunks weeks after they were first written. Every weird Prometheus behaviour Karan saw at 21:40 IST during the IPL final — out-of-order rejection, post-compaction CRC mismatch, mmap address-space pressure, query latency spikes during a compaction window — has its explanation in the chunk lifecycle. The chunk is the smallest unit you can think about and still be saying something true about what Prometheus is doing on your disk.

References

Prometheus tsdb/chunks/chunks.go and tsdb/chunks/head_chunks.go — the canonical Go implementation of chunk encoding and the chunks_head file format.
Fabian Reinartz, "Writing a Time Series Database from Scratch" (2017) — the design blog by the original Prometheus 2.0 TSDB author. Sections on the head and on chunks justify every constant in this chapter.
Pelkonen et al., "Gorilla: A Fast, Scalable, In-Memory Time Series Database" (VLDB 2015) — Facebook's paper; Prometheus inherits the encoding but not the in-memory-only assumption. Sections 4 and 5 on encoding and compaction directly inspired Prometheus's chunk format.
Bartlomiej Plotka, "Out-of-order samples in Prometheus" (Grafana blog, 2022) — design and rationale for out_of_order_time_window, the config that fixed Karan's incident class.
Charity Majors, Observability Engineering (O'Reilly, 2022), Ch. 8 — places Prometheus's storage model inside the broader TSDB economics.
Cortex and Mimir architecture documentation — how the chunk format extends to a horizontally sharded multi-tenant store.
Gorilla compression: double-delta + XOR — the encoding-level companion to this packaging-level chapter.