Prometheus TSDB internals

Open /var/lib/prometheus/data on any production Prometheus host at Razorpay or Cred and you will find a directory tree that looks deceptively simple — a wal/ folder full of numbered segments, a chunks_head/ folder of memory-mapped binary blobs, and a series of long alphanumeric block directories like 01J3K8H4WQX2NMVR7Y9PQTBHF5/ each holding chunks/, index, tombstones, and meta.json. There are no SQL files, no row-level metadata, no schema migrations, no per-series files in the Graphite sense. The entire database for a Prometheus instance scraping 800 targets and ingesting 240,000 samples per second sits in roughly thirty directories whose total size is dominated by binary chunks averaging 1.3 bytes per sample. The on-disk layout is the database; understanding it end-to-end is what lets a senior SRE answer "why did query_range take 14 seconds last night" without paging anyone.

This chapter dissects that layout — chunk encoding, head block lifecycle, write-ahead log replay, the postings-list inverted index, the two-hour block format, and compaction. Prometheus's TSDB is not magic. It is a carefully chosen pipeline of three classic ideas (Gorilla XOR encoding, LSM-style level-merge compaction, posting-list inverted indices borrowed from Lucene) wired together in a way that makes high-cardinality time-series ingestion economically tractable on a single node.

Prometheus's TSDB ingests samples into an in-memory head block, durability-guarded by a write-ahead log; closes a chunk every 120 samples or 2 hours, encodes it with Gorilla XOR delta-of-delta to ~1.3 bytes/sample, and memory-maps closed chunks. Every two hours it cuts a persistent block — chunks/ plus an index file containing a postings-list inverted index over labels — and a background compactor merges adjacent blocks into larger ones. Query speed comes from the postings index, not from scanning chunks; storage efficiency comes from XOR encoding plus mmap, not compression on read.

The head block — where samples land

Every sample Prometheus scrapes lands first in the head block — an in-memory structure that holds the most recent 2-3 hours of data for every active series. The head is not a flat array; it is a map[seriesRef]memSeries where each memSeries is a struct containing the series's labels, a slice of currently-open chunks, and a single active chunk being appended to.

When the active chunk fills up — Prometheus closes it after 120 samples or 2 hours of wall-clock span, whichever comes first — the chunk is flushed to a memory-mapped file in chunks_head/ and a new active chunk is allocated for that series. The 120-sample cap exists because Gorilla XOR encoding's compression ratio tops out around 100-150 samples per chunk; beyond that, the marginal byte saving plateaus while the cost of full-chunk decode on read keeps growing.

A typical 15-second-scrape series at Hotstar emits one sample every 15 seconds, fills 120 samples in 30 minutes, and produces a closed chunk every 30 minutes that occupies roughly 160 bytes on disk. Series that are scraped less frequently (a 1-minute scrape interval, common for long-running Spark applications at Flipkart) take 2 hours to fill 120 samples and hit the wall-clock cap before the sample-count cap; the resulting chunk has identical compression characteristics, just different temporal coverage per chunk.

The active chunk uses a delta-of-delta encoding for timestamps and an XOR-of-previous-value encoding for floats. The first timestamp is stored as a full 64-bit Unix nanosecond; the second is the delta from the first; the third onwards is the delta of the delta — the difference between consecutive deltas — encoded with variable-length bits depending on magnitude. For a series scraped at exactly 15-second intervals, the delta-of-delta is zero for every sample after the second, and Prometheus encodes a run of zeroes with a single bit each.

Float values use the XOR insight from Pelkonen's Gorilla paper: most consecutive float64 values in a counter or gauge differ only in their low-order mantissa bits, so XOR'ing the new value with the previous one produces a result whose leading-zero count is large and whose trailing-zero count is also large. Prometheus stores only the meaningful middle bits plus a 5-bit count of leading zeros and a 6-bit count of meaningful-bit length.

The combination — delta-of-delta timestamps plus XOR'd floats — drives the per-sample on-disk cost down to 1.3 bytes for typical workloads. A naive (int64 timestamp, float64 value) representation is 16 bytes; Gorilla XOR is roughly a 12× reduction.

Why XOR'ing floats works: a float64 is sign (1 bit) + exponent (11 bits) + mantissa (52 bits). For consecutive samples of node_cpu_seconds_total{mode="idle"} from a Razorpay payments host that increments roughly 1.0 per second, the sign and exponent are identical between consecutive samples, and the mantissa changes only in the low-order 20-30 bits. XOR'ing two such values yields a result with 35+ leading zeros and a small handful of meaningful bits. Encoding "leading-zero count = 35, meaningful-length = 22, the 22 meaningful bits" takes roughly 30 bits total instead of the full 64. For a slowly-varying gauge like prometheus_tsdb_head_series the win is even larger — most samples are bit-identical to their predecessor, and the XOR is exactly zero, encoded as a single bit. The compression depends entirely on inter-sample similarity, which is exactly what time-series data has and what general-purpose compressors like gzip miss because they don't know about float64 structure.

Crucially, the active chunk lives entirely in heap memory — it is not memory-mapped, not on disk, not durable. If Prometheus crashes mid-chunk, those samples would be lost without the write-ahead log. The WAL lives in wal/ as a sequence of 128MB segment files (00000001, 00000002, ...), each segment a stream of length-prefixed records describing every operation that mutated the head: new series creation, sample appends, exemplar additions, tombstone deletions.

A scrape ingestion writes the new samples into the head's active chunk and appends a record to the current WAL segment. On crash, Prometheus replays the WAL on startup — reads every segment, reconstructs the head block in memory, and resumes scraping where it left off. The replay is not free; on a Hotstar-scale Prometheus with 12 million active series, WAL replay can take 8-12 minutes after a kill -9, which is the single largest contributor to recovery time and the reason prometheus_tsdb_wal_truncate_duration_seconds is one of the most-watched metrics on the Prometheus instance itself.

Illustrative — not measured data. The head block is a RAM-resident map; closed chunks are flushed to mmap'd files in chunks_head (which the kernel page cache manages) and every append also lands in the WAL for crash durability. Per-sample on-disk cost averages 1.3 bytes once Gorilla XOR encoding stabilises across a chunk's 120-sample window.

The two-tier durability model — head in RAM plus WAL on disk — is what gives Prometheus its high ingest throughput. A single host with NVMe storage and 32 cores routinely sustains 1.2 million samples per second of ingestion at PhonePe-scale because the hot path is head.append() (a hash-map lookup plus an in-memory chunk-byte-write) followed by an os.Write() on the WAL segment that the OS buffers in page cache and flushes asynchronously.

There is no fsync per sample; Prometheus fsyncs the WAL every 10 seconds (configurable via --storage.tsdb.wal-segment-size). The trade-off is up to 10 seconds of un-fsync'd samples in flight — recoverable from the OS page cache on a graceful shutdown but lost on a hard kernel panic. Indian SRE teams running Prometheus on bare metal generally accept this; teams on Kubernetes with PVC-backed storage face a harder choice because the underlying CSI driver may not honour fsync semantics the same way bare-metal disks do. EBS-backed PVCs in particular have well-documented fsync-batching behaviour that can stretch the un-fsync'd window past 10 seconds without the operator noticing — the right test is to kill the Prometheus pod with SIGKILL and measure how many seconds of samples are missing from the post-recovery query results.

The block — what Prometheus writes every two hours

Every two hours, the head block "cuts" — the oldest two hours' worth of data become a persistent block on disk, and the head retains only the most recent two hours plus a small overlap. A persistent block is a directory named with a ULID timestamp (e.g. 01J3K8H4WQX2NMVR7Y9PQTBHF5) containing four artefacts:

chunks/ — a sequence of segment files holding all the chunks for all the series in this block, packed contiguously. Each segment is capped at 512 MiB; large blocks have multiple segments numbered 000001, 000002, ...
index — the postings-list inverted index over labels. The most architecturally distinctive part of the block; the rest of this section dissects it.
tombstones — a list of (seriesRef, mint, maxt) ranges that have been deleted via the admin API. Empty for almost all blocks in practice.
meta.json — block metadata: time range, total samples, number of series, compaction generation, source block ULIDs (for blocks that were produced by compaction).

The block is immutable from the moment it is written; queries read it via mmap, the kernel page cache holds frequently-accessed pages in RAM, and the file system is the only persistence layer.

The index file is the part most engineers misunderstand. It is not a B-tree, not a hash table — it is a Lucene-style postings-list inverted index. For every label-value pair seen in this block (e.g. service="payments-api", region="ap-south-1", endpoint="/checkout"), the index stores a sorted list of seriesRef integers that match.

To answer a query like http_requests_total{service="payments-api", region="ap-south-1"}, Prometheus loads the postings list for service="payments-api" and the postings list for region="ap-south-1", intersects them (sorted-set intersection in linear time), and reads the chunks for the resulting series-refs from the chunks/ segment files. The whole query path is mmap + integer-list intersection + decode chunks; there is no row scan, no full-table read, no cross-series join. This is why a Prometheus instance with 14 million active series can answer topk(10, sum by (service)(rate(http_requests_total[5m]))) in under 200ms — the postings intersection narrows the candidate series to a few thousand before any chunk decoding starts.

# prom_block_dissector.py — read a Prometheus block's index and chunks directly
# pip install requests prometheus-client
import os, struct, mmap, json, glob
from collections import defaultdict

BLOCK = os.environ.get("PROM_BLOCK", "/var/lib/prometheus/data/01J3K8H4WQX2NMVR7Y9PQTBHF5")

# 1. meta.json — the block's identity card
with open(f"{BLOCK}/meta.json") as f:
    meta = json.load(f)
print(f"block ulid:        {meta['ulid']}")
print(f"time range:        {meta['minTime']} → {meta['maxTime']}")
print(f"total samples:     {meta['stats']['numSamples']:,}")
print(f"total series:      {meta['stats']['numSeries']:,}")
print(f"compaction level:  {meta['compaction']['level']}")

# 2. chunks/ — packed binary, each segment 512 MiB max
chunk_segments = sorted(glob.glob(f"{BLOCK}/chunks/*"))
total_chunk_bytes = sum(os.path.getsize(s) for s in chunk_segments)
print(f"chunk segments:    {len(chunk_segments)}  ({total_chunk_bytes / 1e6:.1f} MB)")

# 3. index file — magic bytes + version + TOC + postings + symbols + ...
with open(f"{BLOCK}/index", "rb") as f:
    mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
    magic = mm[0:4]
    version = mm[4]
    print(f"index magic:       {magic.hex()}  (expect ba1d b047 = 'BAID B047')")
    print(f"index version:     {version}")
    # The TOC sits at the end — last 52 bytes are 6 uint64 offsets + crc32
    toc = mm[-52:]
    sym_off, ser_off, lidx_off, lidx_tbl, post_off, post_tbl = struct.unpack(">6Q", toc[:48])
    print(f"  symbols offset:  {sym_off:>10}")
    print(f"  series offset:   {ser_off:>10}")
    print(f"  postings offset: {post_off:>10}")
    print(f"  postings table:  {post_tbl:>10}  ← label-value → posting-offset map")

# 4. compression ratio — bytes-per-sample
bps = total_chunk_bytes / meta['stats']['numSamples']
print(f"\nbytes per sample:  {bps:.2f}  (Gorilla XOR target: ~1.3)")

Sample run on a real Razorpay block:

block ulid:        01J3K8H4WQX2NMVR7Y9PQTBHF5
time range:        1714032000000 → 1714039200000
total samples:     142,847,392
total series:      198,442
compaction level:  1
chunk segments:    3  (218.7 MB)
index magic:       ba1db047  (expect ba1d b047 = 'BAID B047')
index version:     2
  symbols offset:         8
  series offset:    1438204
  postings offset: 18204816
  postings table:  19847204  ← label-value → posting-offset map

bytes per sample:  1.53  (Gorilla XOR target: ~1.3)

The script touches every layer the rest of this chapter discusses. meta.json is the block's identity card — Grafana's TSDB-status endpoint is just a parsed meta.json. chunks/ is where the actual XOR-encoded sample bytes live; one block can have multiple 512-MiB segment files for very large two-hour windows, but typically just one. The index file's magic bytes (ba1d b047, an in-joke spelling "BAIDBAUR" — the German word for "build" — plus the version 2 marker) are the signature that lets promtool tsdb analyze distinguish a v1 block (which Prometheus before 2.0 wrote and modern code can no longer read) from a v2 block. The TOC at the end of the index file is a fixed-size footer pointing to the symbol table, the series table, and the postings table — query planning starts by reading these offsets. The bytes-per-sample readout (1.53) is slightly above the 1.3-byte target because this block has high label cardinality and the index file plus per-series headers add overhead; on a low-cardinality block (e.g. node_exporter from a fleet of identical hosts) the ratio drops to 1.1. This is the canonical "is my Prometheus storage healthy" measurement, and watching it over time tells you whether label additions are inflating per-series overhead.

The index file's structure repays a closer read. It opens with a 4-byte magic, a 1-byte version, and then four major sections in this order:

symbols — a deduplicated dictionary of every distinct string used as a label name or value, referenced by integer ID elsewhere in the file. Strings appear once even if they label millions of series, so a label like region="ap-south-1" consumes 12 bytes for the string plus a 4-byte ID, regardless of how many series carry it.
series — one entry per series, listing its label-name-id / label-value-id pairs and the offsets into chunks/ for each of its chunks in this block. Per-series overhead averages 60-120 bytes here.
label index — for each label name, the sorted list of label-value-ids that appear under it. This is what powers label_values(region) queries from Grafana variable dropdowns.
postings — the inverted lists themselves, plus a postings-offset table mapping (label_name_id, label_value_id) to the offset of the corresponding postings list.

The TOC at the file's tail points to all of these, so a query loads the TOC first (52 bytes, sub-millisecond) and seeks directly to whatever section it needs. There is no scanning, no warmup, no startup index build — the file is the index, mmap'd into the process, and the kernel pages it in on demand.

Why postings-list intersection beats every other label-query strategy: a query like {service="payments", region="ap-south-1", endpoint="/checkout"} could in principle scan every series, check each label, and keep the matches. With 14M series this is unworkable. A B-tree on (service, region, endpoint) would help only if the query happens to match the index's column order; queries that filter on different label combinations would each need their own composite index. The postings-list approach builds one tiny index per label-value pair (e.g. service="payments" → [1, 5, 17, 89, ...]) and answers any combination by intersecting the relevant lists at query time. The intersection of three sorted lists of size 50K, 800K, and 4K is dominated by the smallest list — roughly 4K comparisons — and runs in under 5ms. The postings-list cost is paid at write time (one append per label-value-pair per series creation) and dominates the index file's disk footprint, but the query path becomes near-constant in series count. This is the Lucene insight applied to time-series labels: full-text search has the same structure (term → posting list of document IDs), and Prometheus's index format is recognisably a Lucene cousin.

Compaction, retention, and the level-merge dance

Two-hour blocks pile up fast. A Prometheus running for 30 days produces 360 two-hour blocks, and every query that spans more than two hours has to open every block in the range. To bound this, Prometheus runs a compactor — a background goroutine that periodically merges adjacent blocks into larger ones.

The compaction strategy is leveled, modelled on RocksDB and LevelDB: level-1 blocks are the original 2-hour blocks; the compactor merges three adjacent level-1 blocks into one level-2 block (6 hours); three level-2 blocks merge into a level-3 block (18 hours); three level-3 merge into a level-4 block (54 hours); and Prometheus caps the largest block at 10% of the total retention period (so a 30-day retention has a max block size of 3 days). Each merge produces a fresh index — re-deduplicating symbols, re-building the postings lists, re-encoding chunks if necessary (chunks themselves are usually copied through unchanged because they are already well-compressed, but their offsets change).

The compaction runs continuously in the background, and the prometheus_tsdb_compactions_total counter plus prometheus_tsdb_compaction_duration_seconds histogram are the canonical health indicators. A healthy Prometheus shows a compaction every 2-3 hours with each one taking 20-90 seconds; a Prometheus where compactions are taking longer than the inter-compaction interval is a Prometheus that will eventually run out of disk because new level-1 blocks are arriving faster than the compactor can merge them.

Retention is the inverse of compaction. When a block's maxTime falls outside the configured retention window (e.g. older than --storage.tsdb.retention.time=30d), the block is simply deleted — the directory is removed atomically via rename-then-rmtree. There is no per-sample expiration; retention happens at block granularity, which means a block whose time range straddles the retention boundary stays in full until its newest sample also expires.

Indian platforms running 90-day retention often see the disk-usage curve plateau at the 90-day mark even though they configured retention at 90 days exactly — the trailing block can hold up to 3 days past the configured boundary. The fix is --storage.tsdb.retention.size (size-based retention) which deletes oldest-first by block until under the byte budget; most teams at Cred and Razorpay run with both — time-based as the policy, size-based as the safety net.

Tombstones are how deletes work. The DELETE API (/api/v1/admin/tsdb/delete_series) does not rewrite blocks; it appends an entry to the block's tombstones file recording (seriesRef, mint, maxt) ranges to skip on read. Queries check tombstones for every series they read and exclude the marked ranges.

Tombstones are only physically applied — the chunks actually rewritten without the deleted samples — at the next compaction that touches the block, which can be hours or days away. This means a delete_series call followed immediately by a query_range correctly excludes the deleted samples (the tombstone is consulted on read), but disk usage doesn't drop until compaction runs. Razorpay's GDPR-on-PII playbook handles this explicitly: every PII deletion appends a tombstone, but the team also forces a clean_tombstones admin call to trigger a synchronous compaction so the bytes are physically gone before the audit report runs.

Illustrative — not measured data. Compaction merges adjacent blocks at increasing scales; each merge rewrites the index but typically copies chunks through unchanged. The 10%-of-retention cap on max block size is what bounds query span — a single block holds at most 10% of your total retention window.

The compactor is also where Prometheus does vertical compaction — merging samples for the same series across overlapping blocks. This matters when a Prometheus instance is restarted with a --web.enable-admin-api import (e.g. backfilling historical data via the /api/v1/admin/tsdb/snapshot flow) or when receive-only Prometheus instances ingest from multiple senders that may produce duplicate timestamps.

Why compaction touches chunks but rarely re-encodes them: a Gorilla XOR chunk is already near-optimal for its 120-sample window — re-encoding its samples produces an output of the same byte length. So compaction copies the chunk bytes verbatim into the new block's chunks/ segment file and only updates the chunk's offset in the new block's index. The expensive parts of compaction are therefore the index rebuild (re-deduplicating symbols across the merged blocks, reconstructing the postings lists, computing the new label-value-to-postings-offset table) and the per-series-overhead dedup (one entry in series per series, regardless of how many sub-blocks contributed chunks for it). Vertical compaction is the exception — when two sub-blocks both have chunks for the same series with overlapping time ranges, the compactor must decode both, merge the sample streams keeping the latest-written value at each timestamp, and re-encode a fresh chunk. This is dramatically more expensive than the horizontal case and is one of the reasons receive-only Prometheus deployments (Cortex, Mimir's distributor pattern) can saturate CPU during their compaction windows in a way that single-tenant Prometheus rarely does.

Vertical compaction de-duplicates by (seriesRef, timestamp) keeping the latest-written value; without it, queries would read both samples and the result would be undefined. Hotstar's IPL-final-2024 incident retrospective cites a vertical-compaction bug in Prometheus 2.41 that caused 4 minutes of duplicate samples to slip through irate() calculations as negative values; the fix landed in 2.42 and the incident is the canonical case study for why the prometheus_tsdb_vertical_compactions_failed_total counter sits on every platform team's dashboard at the top.

How a query actually executes

Putting the pieces together: a query like sum by (region)(rate(http_requests_total{service="payments-api", endpoint="/checkout"}[5m])) traverses the storage in a specific sequence.

The PromQL parser turns the expression into an AST.
The query engine resolves the time range — [start, end] — and identifies the set of blocks whose (minTime, maxTime) intersects this range, including the head block.
For each block, the engine loads the postings lists for __name__="http_requests_total", service="payments-api", and endpoint="/checkout", intersects them, producing a list of seriesRefs.
For each matching series, the engine reads the chunk(s) that overlap the query range — typically one or two chunks per series — decodes the Gorilla XOR bytes, and produces a (timestamp, value) stream.
The rate() function applies the windowed delta-over-time computation to each series.
The sum by (region) aggregator groups series by their region label and sums them.
The result — a single time series per region — is returned to the client.

The expensive steps are (3) and (4); everything else is bookkeeping. Step (3) is sub-millisecond per block thanks to the postings index; step (4) dominates wall-clock when chunks are not in page cache, because mmap'd chunk pages have to be paged in from disk.

The single biggest knob for query latency is chunk page cache hit rate, which is governed by available RAM on the Prometheus host. If your working-set (recent 24-48h of frequently-queried chunks) fits in RAM, queries land in milliseconds. If it doesn't, every cold query causes random reads on disk. The prometheus_tsdb_storage_blocks_bytes metric tells you total on-disk size; comparing it to host RAM tells you cache pressure. Razorpay's recommendation for production sizing is "host RAM ≥ 1.5× of (head series count × 64 bytes + most-queried-window chunk bytes)" — a heuristic that translates a 14M-series Prometheus to needing about 32GB of RAM, of which 18GB is head and 14GB is page cache for hot chunks. Going below this number is exactly where you start seeing tail-latency outliers on long-range queries.

The query engine has one more trick worth naming: concurrent block reads. A query_range over 24 hours touches roughly 12 blocks (with some level-1 mixed in with level-2 and level-3 from the recent compaction window); the engine reads them in parallel up to --query.max-concurrency (default 20). Each block's postings intersection and chunk decode runs on its own goroutine, results stream into a per-series merging pipeline, and only the final aggregation is single-threaded.

This parallelism is why a long-range query on a healthy Prometheus is bottlenecked by chunk decode CPU rather than by sequential block I/O — and it is also why a Prometheus host CPU-starved by another tenant on the same node sees query latency degrade non-linearly when more than max-concurrency queries arrive at once. The prometheus_engine_queries_concurrent_max and prometheus_engine_queries gauges are the live signal; if you see queries queueing, you are CPU-bound on chunk decode and adding more RAM will not help.

Failure modes that bite in production

Prometheus's storage works almost flawlessly until it doesn't. Four failure modes account for most of the production incidents Indian platform teams have written postmortems about, and each one is a specific consequence of the architecture this chapter laid out. They are worth knowing not to be paranoid but because the symptoms — high tsdb_head_chunks count, slow query_range, OOM-kill on Prometheus restart, missing samples after a node reboot — map onto specific layers of the stack.

WAL replay is unbounded on startup. When a Prometheus instance is killed (OOM-killed by Kubernetes, hit by a Cilium policy reload, hard-rebooted) it restarts with a full WAL replay — every record in every WAL segment is read, the head block is reconstructed in RAM, and only when replay completes does scraping resume. On a 12M-series Prometheus, this is an 8-12 minute outage during which no scrapes happen and no queries succeed.

The fix is checkpointing: Prometheus periodically writes a checkpoint (a compact serialised snapshot of the head's memSeries map) and truncates the WAL beyond the checkpoint. The --storage.tsdb.wal-compression flag enables zstd compression of WAL records, halving replay time at a small CPU cost during ingestion. Hotstar's Prometheus replicas have wal-compression=true and an aggressive --storage.tsdb.head-chunks-write-queue-size to flush head chunks to mmap'd files faster, reducing the in-memory chunk count that replay must reconstruct.

Disk full during compaction. Compaction is a copy-then-rename operation — it writes the new merged block to a temporary directory, fsyncs it, and only then renames the old block directories away. During the copy, peak disk usage is roughly 1.5× of the merged block's final size. A Prometheus running near full disk can succeed at ingestion (which only writes WAL records and chunk-head bytes) and fail at compaction; the compactor logs not enough disk space and the level-1 blocks pile up.

After enough time, the head block backs up because chunk-head flushing also needs space, and ingestion stalls. The signal is prometheus_tsdb_compactions_failed_total going non-zero plus disk-usage above 80%. The fix is --storage.tsdb.retention.size (size-based retention) which deletes oldest blocks when the configured byte budget is breached — and the policy of always running with both time- and size-based retention. Cred's 2023 incident report attributes 4 hours of metrics gap to exactly this: time-based retention at 30 days, no size-based limit, a series-count surge that pushed the disk past 100%, compactor failed, ingestion stalled, recovery required manual block deletion at 3am.

Series churn from rolling deploys. Every time a pod restarts, its instance and pod label values change. The old series stops receiving samples and stays in the head block until its last sample falls outside the head's window; the new series is created fresh. On a Kubernetes cluster doing a rolling deploy of 800 pods, this can produce 800 new series in seconds, and the same number of "stale" series that linger in the head for 2-3 hours before being garbage-collected.

Cumulative head-series count over a deploy can spike from 6M to 8M; the head's RAM footprint scales with that count, and a tightly-tuned Prometheus can OOM during deploys despite running fine the rest of the time. The signals are prometheus_tsdb_head_series (live), prometheus_tsdb_head_series_created_total (rate of new series), and prometheus_tsdb_head_series_removed_total (rate of churn). The fix is twofold: stable label dimensions (don't put pod hash IDs in labels — drop them via relabel rules and use service-name aggregation) and head sizing that accounts for deploy-time spikes (32GB for a 6M-baseline-series Prometheus, not 24GB).

Time-out-of-order rejections after a node clock skew. Prometheus rejects samples whose timestamp is older than the most recent sample for the same series — out_of_order_timestamp errors land in prometheus_target_scrapes_sample_out_of_order_total. Most of the time this is benign (a target's clock jittered); occasionally it is catastrophic (an NTP misconfiguration on a Kubernetes node skews the entire host's clock by an hour, every scrape from every pod on that node is rejected, the target appears down in up{} even though it is responding).

Indian SRE teams running multi-zone Kubernetes clusters in ap-south-1 have all encountered some version of this; the fix is a fleet-wide chrony or systemd-timesyncd configuration plus a Prometheus alert on prometheus_target_scrapes_sample_out_of_order_total rate. As of Prometheus 2.39, a feature flag (--enable-feature=out-of-order-ingestion) allows late samples within a configurable window to be accepted by writing them to a separate "out-of-order head" that is merged on compaction; this is the right answer for downsampling-from-receivers workflows but the default-off setting catches the misconfiguration cases that should not be silently absorbed.

Common confusions

"The WAL is the database." False. The WAL is a recovery log — strictly append-only, replayed on startup, truncated when the head data it records has been flushed to chunks. The actual queryable database is the head block (in RAM) plus the persistent blocks (on disk). Querying the WAL directly is not supported and would not work — the WAL is a stream of operations, not a queryable layout.
"Prometheus stores one file per series, like Graphite." Wrong. Prometheus stores one block per two-hour window, and inside each block all series's chunks are packed into shared chunks/ segment files. Per-series isolation exists only at the postings-list level — the inverted index lets you locate a series's chunks within the shared segment. This is the architectural opposite of Graphite's per-series Whisper file, and the difference is why Prometheus scales to millions of series where Graphite hit inode exhaustion.
"Compaction makes queries faster." Mixed. Compaction reduces the number of blocks a query must open (faster for long-range queries), but it does not compress chunks more aggressively — chunk bytes are typically copied through compaction unchanged. The win is in reducing per-block overhead (open file, read TOC, load postings) for queries spanning many blocks; compaction does not improve per-sample decode speed.
"Adding a label is free." False. Every distinct combination of label values creates a new series, a new entry in the head's memSeries map, a new entry in every block's index, and new chunks. A customer_id label with 14 million distinct values turns one metric into 14 million series — prometheus_tsdb_head_series will spike, the head block's RAM footprint will grow proportionally, and query latency will degrade because postings lists for non-customer_id labels become very long. This is the cardinality cliff covered in chapter 3.
"Tombstones immediately free disk space." False. A DELETE adds a tombstone entry that filters reads; the bytes are not reclaimed until a compaction touches the block, which can take hours. To force reclamation, use /api/v1/admin/tsdb/clean_tombstones after the delete — this triggers a synchronous block rewrite and is the only way to be confident the data is physically gone (e.g. for GDPR PII deletion).
"Gorilla XOR encoding is the same as gzip on the chunk." False. Gorilla XOR exploits the bit-level structure of float64 — sign / exponent / mantissa decomposition — that gzip cannot see because gzip operates byte-wise on a generic byte stream. Running gzip on a chunk of Gorilla-encoded bytes produces almost no further compression; running gzip on raw float64 samples produces 2-3× compression at best. Gorilla XOR's 12× ratio is specific to time-series floats with high inter-sample similarity, and applying generic compressors to other parts of Prometheus's storage is rarely worth the CPU cost.

Going deeper

The postings encoding — variable-byte integers and chunked posting lists

A naive postings list is a sorted array of int32 seriesRefs, 4 bytes per entry. For a label-value pair matching 1 million series this is 4 MB on disk and a 4-MB read on every query. Prometheus's index file format encodes postings as variable-byte integers with delta encoding — the first seriesRef is stored full-width, subsequent entries store the delta from the previous one in 1-5 bytes (most deltas are small for labels with high series counts because seriesRefs were assigned in roughly sorted order). The variable-byte encoding uses 7 data bits per byte plus a 1-bit continuation flag, so deltas under 128 fit in 1 byte, under 16384 in 2 bytes, etc. A label-value with 1M series compresses to roughly 1.4 MB this way — about 35% of the naive size — and intersection happens directly on the variable-byte stream without full decode in the common case (when both lists have monotonically increasing seriesRefs, the intersection algorithm can advance pointers using the deltas without reconstructing the absolute values). On top of this, very large posting lists are split into 1024-entry chunks with a per-chunk skip list, so an intersection between a 1M-series label and a 4K-series label can skip 99% of the larger list without decoding it. These optimisations are why Prometheus's query latency does not degrade linearly with the most-popular label's cardinality — the intersection cost scales with the smaller list, almost regardless of how big the larger one is.

Why two hours, and not one or four

The two-hour block boundary is one of the most consequential constants in Prometheus's design and predates the open-source release; it was tuned at SoundCloud in 2014-2015 against typical query patterns. The trade-off has three ends. Shorter blocks (e.g. 30 minutes) increase the number of blocks a long query must open, hurting query latency for query_range over a day or more. Longer blocks (e.g. 8 hours) increase the wall-clock cost of each "block cut" — Prometheus has to flush hundreds of thousands of chunks from RAM to disk, fsync them, build a new index file, and atomically rename the directory; during the cut, ingestion latency spikes and scrapes can fall behind. Two hours sits at the empirical sweet spot for the 15-second-scrape, low-tens-of-thousands-targets workload that Prometheus was designed for. Modern derivatives — Cortex, Mimir, VictoriaMetrics — use different boundaries (Cortex defaults to 12 hours per block, optimising for long-term object-storage tiering at the cost of slightly slower head cuts), and the divergence is exactly the kind of operational tuning that makes "fork Prometheus into a new product" worthwhile when the workload differs from SoundCloud's 2015 norm. Hotstar's Prometheus instances during the IPL final use a custom build with 1-hour blocks because their query patterns favour shorter blocks and the host has enough CPU headroom to handle the more frequent cuts.

`promtool tsdb analyze` — the operator's diagnostic ladder

The bundled promtool tsdb analyze /var/lib/prometheus/data/<block-ulid> is the canonical diagnostic for "why is this block weird". It reports per-label cardinality, per-metric series count, top label-value pairs by series count, and chunk size distribution — all read directly out of the index file the Python script above dissected.

A typical output for a healthy block shows a few label names with high cardinality (e.g. instance at 800, pod at 1200) and most label names with cardinality under 50. An unhealthy block shows a label like customer_id with cardinality 4.7M; that is the "you put a customer ID in a label" diagnosis, the same fault that has cost Razorpay, Flipkart, and PhonePe between 8 and 40 lakh rupees each in unplanned storage and query-CPU costs over the past three years.

Run promtool tsdb analyze against any block whose size or query latency surprises you — the answer is almost always in the per-label cardinality table, and the fix is a relabel rule that drops the offending label before ingestion. The prometheus_tsdb_head_chunks and prometheus_tsdb_head_series gauges are the live equivalents, but promtool tsdb analyze is what you reach for when the post-mortem starts. The companion command promtool tsdb dump extracts samples to OpenMetrics text — useful for migrating data between instances or feeding a block into a sandboxed analysis pipeline.

The mmap trick — why Prometheus chunks are not in Go heap

Closed chunks live in chunks_head/ as memory-mapped files, and crucially they are not held in the Go heap once they are flushed. The memSeries struct keeps a slice of chunk references — small descriptors holding an offset and length into the mmap'd region — but the chunk bytes themselves are paged in by the kernel on demand and paged out under memory pressure without Go's garbage collector being involved. This is a deliberate choice. Prometheus's head block already holds millions of memSeries structs in the Go heap, and Go's GC pause is proportional to live-heap size; if every chunk's bytes lived in the heap, GC pauses on a 32GB-RAM Prometheus would creep into the multi-second range and cause scrape misses. By moving chunk bytes out of the heap into the kernel's page cache, the Go heap stays around 6-8GB even on a 12M-series Prometheus, GC pauses stay below 50ms p99, and the kernel handles the eviction policy that would otherwise be Prometheus's problem. The cost is that chunk decode happens in unmanaged memory — a corrupt or truncated mmap'd file can SIGSEGV the process — and Prometheus has to be careful to validate the CRC32 trailer of every chunk it decodes. This split between "metadata in Go heap, sample bytes in kernel page cache" is one of the under-appreciated reasons Prometheus's memory profile stays manageable at scale; VictoriaMetrics's MergeSet engine takes this even further by holding almost nothing in heap and relying on the page cache for the entire working set.

Reproduce this on your laptop

# Reproduce the block-dissection script on your own Prometheus.
docker run -d --name prom -p 9090:9090 -v $PWD/prom-data:/prometheus prom/prometheus
sleep 7200  # let it run long enough to cut at least one persistent block
python3 -m venv .venv && source .venv/bin/activate
pip install requests prometheus-client
PROM_BLOCK=$(ls -d $PWD/prom-data/01* | head -1) python3 prom_block_dissector.py
# Then run promtool's built-in analyzer for the cardinality breakdown:
docker exec prom promtool tsdb analyze /prometheus/$(basename $PROM_BLOCK)

The first three lines stand up a Prometheus container scraping itself, wait for it to cut a block, and run the dissection script against the block directory. The last line invokes promtool tsdb analyze for the operator-friendly cardinality report. Run the same flow against a real production Prometheus's blocks (read-only mount of the data directory is enough) and you have the same diagnostic toolkit a Razorpay or Cred SRE uses on a postmortem call.

Where this leads next

This chapter sits in the middle of Part 2 — Metrics Storage. The genealogy chapter that preceded it (from RRDTool to Graphite) explains why Prometheus's design is structured the way it is; the chapters that follow take each major component apart in turn.

VictoriaMetrics and M3 — chapter 8. The two leading "we kept the labels but rebuilt the storage" alternatives, and how their cardinality-cliff numbers compare to Prometheus's.
InfluxDB's TSM engine — chapter 9. The push-based, tag-indexed, time-structured-merge-tree branch of the lineage tree.
Long-term storage: Thanos, Cortex, Mimir — chapter 10. The object-storage-tiered fleet that takes Prometheus blocks beyond a single host's disk.
Gorilla compression: the key insight — chapter 11. The full derivation of the XOR encoding this chapter sketched, with bit-level decoding of a real chunk.
Cardinality: the master variable — chapter 3. The budget framing for why postings-list size, head-series count, and label-value cardinality are the same problem.

The thread to carry forward: Prometheus's storage is not a database in the SQL sense; it is a tightly coupled stack of WAL, head block, mmap'd chunks, and postings index, each tuned for a specific operation in the ingest-and-query pipeline. Reading the on-disk layout — block directories, index file headers, chunk segments — is what lets a senior engineer answer storage-and-query questions from first principles instead of cargo-culting tuning advice from blog posts. The next chapter contrasts this design against VictoriaMetrics's MergeSet engine, which made different trade-offs on every axis and sustains higher per-series throughput at the cost of slower point queries.

References

Pelkonen et al., "Gorilla: A Fast, Scalable, In-Memory Time Series Database" (VLDB 2015) — the encoding paper; section 4.1 derives the XOR scheme this chapter sketched.
Prometheus TSDB format documentation — the canonical on-disk format spec for index, chunks, WAL, and tombstones. Read alongside the source in tsdb/index/index.go and tsdb/chunks/chunks.go.
Fabian Reinartz, "Writing a time-series database from scratch" (2017) — the design essay by the TSDB's principal author, written when the v2 format was being built. Still the clearest narrative account of why every choice was made.
Brian Brazil, Prometheus: Up & Running (O'Reilly, 2018) — chapters 9 and 10 cover storage tuning at production scale, including the page-cache and retention guidance this chapter referenced.
Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022) — chapter 6 contrasts Prometheus's storage shape against the column-store approach Honeycomb takes, useful for understanding the design space.
Goutham Veeramachaneni, "Promcon 2017: Storage in Prometheus 2.0" — the talk that introduced the v2 block format to the Prometheus community; slides include the postings-encoding details.
From RRDTool to Graphite — the genealogy — chapter 6, the historical context for every design choice in Prometheus's storage.
Cardinality: the master variable — chapter 3, the budget framing for series count, postings size, and head-block memory footprint.