Why high-cardinality labels break TSDBs

It is 14:42 IST on a Wednesday and Karan, the on-call SRE at a Bengaluru travel-tech company, is watching the Prometheus head-block process climb past 38 GB of resident memory. Forty-three minutes ago a backend engineer merged a one-line change to the booking service — a label added to flight_search_duration_seconds so the team could break down latency by origin_iata (the three-letter airport code, ~600 distinct values across India and Southeast Asia). The metric was already labelled by destination_iata, cabin_class, pod_name, and status_code. The product 600 × 600 × 4 × 80 × 6 = 69,120,000 arrived at the head block in waves as flights got searched. At 38 GB of RAM the kernel OOM-killed the Prometheus pod; the head block on disk was incomplete, the WAL replay on restart took eleven minutes, alerts were silent the entire time, and a payment-gateway flap during the outage produced a real customer incident that went undetected. The bill — Part 5's death-spiral chapter — is the story finance hears next quarter. The crash — this chapter's story — is what happened to Karan in the first hour.

A TSDB stores one logical row per unique (metric_name, label_set) tuple — one series. Behind every series is an entry in an inverted index mapping each label-value pair to a sorted postings list of series IDs that carry that pair. High-cardinality labels do not just inflate cost: they break the index's memory budget, the write path's allocation rate, the query layer's postings-merge cost, and the compactor's block-size assumptions. The crash arrives before the bill does.

The TSDB is an inverted index with samples bolted on

A reader new to time-series databases pictures a TSDB as a key-value store: keys are metric names, values are arrays of (timestamp, float) pairs. That picture is wrong in the dimension that matters for cardinality. A TSDB is structurally an inverted index — exactly the data structure Lucene popularised for full-text search — with a sample buffer attached. The series identity (the unique label set) is the document; the label-value pairs are the terms; the sorted postings list is the column you query against. Understanding cardinality breakage requires looking at the index, not the samples.

The lookup path for a query like rate(http_requests_total{route="/checkout", status="500"}[5m]) decomposes into three index operations. One: look up the postings list for __name__="http_requests_total" — a sorted list of every series ID whose metric name is http_requests_total. Two: look up the postings list for route="/checkout". Three: look up the postings list for status="500". Then intersect the three sorted lists to find the set of series IDs that carry all three labels, and finally fetch the samples for each surviving series ID. The cost of the query is dominated by the size of the largest postings list in the intersection, not by the size of the result set.

Illustrative — not measured data. The index is the cost driver; samples are 1.3 bytes each after Gorilla XOR. Adding `customer_id` does not enlarge the sample blob — it adds one postings list per distinct customer to the inverted index. With 1.4M customer IDs that is 1.4M postings lists, each storing the series IDs of every series that uses that customer.

Why this matters for cardinality: the index is the structure that grows multiplicatively with labels. The samples grow linearly with time (one sample per series per scrape interval) and compress to ~1.3 bytes after Gorilla XOR. The index, by contrast, holds one entry per (label, value) pair plus one entry per unique label set — both grow with the cross-product of label cardinalities. A query that looks fast against a few thousand series becomes a postings-list merge against millions. The disk-size headline (sample storage) hides the real bottleneck (index entries and merge time).

A subtler point about the same picture: the series identity is itself an entry in another structure — the series table — keyed by a series ID and storing the full label set as a list of symbol IDs (e.g. series_19 = [1=14, 2=47, 3=92] meaning __name__=http_requests_total, route=/checkout, status=500). The series table grows linearly with distinct series, which is the cross-product of label cardinalities. A series that has stopped receiving samples but is still inside the retention window still has a series-table entry, still has postings-list entries pointing to it, still occupies the index — even though it generates zero new bytes of sample storage. Stale series are the silent half of the cardinality bill, and Part 5's wall chapter named them as pod_name-induced churn. This chapter looks at the same phenomenon from the index's perspective: stale series are entries in postings lists that the merge phase still has to walk past.

A third structural detail worth pausing on: the postings-list merge at query time has to be a sorted intersection, not a hash join. Postings lists are stored in series-ID order specifically so that intersection can be done in O(N + M) time using a two-pointer walk — the standard algorithm for merging sorted streams. If the lists were unsorted, intersection would require building a hash set from the smaller list (O(M) memory) and walking the larger list (O(N) time), and at high cardinality the hash set would dominate memory. Prometheus's choice to keep postings sorted is the engineering decision that makes high-cardinality queries viable at all; the cost is that every postings-list write must maintain the sort order, which is why high-ingest-rate label additions are CPU-bound on the write side. The reader who has worked with Lucene will recognise the pattern — the index's read-side optimisations create write-side costs, and the cardinality wall is partly a manifestation of that asymmetry.

A measurement: head-block memory under cardinality scaling

The head block is the in-memory buffer where Prometheus accumulates samples for the last two hours before flushing to disk as an immutable block. Its memory consumption is the proximate cause of OOM kills under cardinality pressure — not disk size, not network, not query CPU. The script below runs four cardinality scenarios against a real Prometheus instance, measures the head-block memory after each, and reports the per-series memory overhead so the reader can size their own budget.

# head_block_memory.py — measure Prometheus head-block memory under cardinality scaling
# pip install prometheus-client requests pandas
import time, os, requests, pandas as pd
from prometheus_client import CollectorRegistry, Counter, generate_latest

PROM = "http://localhost:9090"

def emit_series(n_pods: int, n_routes: int, n_status: int, n_customers: int) -> int:
    """Register and emit one Counter with the cross-product of label values.
    Returns the theoretical series count emitted."""
    reg = CollectorRegistry()
    c = Counter("synthetic_requests_total", "synthetic load",
                labelnames=["pod", "route", "status", "customer_id"], registry=reg)
    count = 0
    for p in range(n_pods):
        for r in range(n_routes):
            for s in range(n_status):
                for cu in range(n_customers):
                    c.labels(f"pod-{p:03d}", f"/route-{r:02d}",
                             str(200 + s), f"cust-{cu:05d}").inc()
                    count += 1
    # Push to a Prometheus pushgateway or expose via HTTP — assume scrape happened
    return count

def head_memory_bytes() -> int:
    """Read prometheus_tsdb_head_chunks_bytes from Prometheus's own /metrics."""
    r = requests.get(f"{PROM}/api/v1/query",
                     params={"query": "prometheus_tsdb_head_chunks"})
    chunks = float(r.json()["data"]["result"][0]["value"][1])
    r = requests.get(f"{PROM}/api/v1/query",
                     params={"query": "prometheus_tsdb_head_series"})
    series = float(r.json()["data"]["result"][0]["value"][1])
    # Process RSS via /metrics
    r = requests.get(f"{PROM}/metrics")
    rss = next(int(float(line.split()[1])) for line in r.text.splitlines()
               if line.startswith("process_resident_memory_bytes "))
    return int(series), int(chunks), rss

scenarios = [
    ("baseline",            1,   5,  3,   1),    # 15 series
    ("+ pods",              60,  5,  3,   1),    # 900
    ("+ routes",            60, 40,  3,   1),    # 7,200
    ("+ customer_id (1k)",  60, 40,  3, 1000),   # 7.2M
]

rows = []
for name, p, r, s, cu in scenarios:
    theoretical = emit_series(p, r, s, cu)
    time.sleep(20)  # wait for scrape interval
    series, chunks, rss = head_memory_bytes()
    rows.append({
        "scenario": name,
        "theoretical_series": f"{theoretical:>10,}",
        "actual_head_series": f"{series:>10,}",
        "rss_mb": f"{rss/1024/1024:>8.0f}",
        "bytes_per_series": f"{rss/max(series,1):>6.0f}",
    })

print(pd.DataFrame(rows).to_string(index=False))

A representative run on a 4 GB Prometheus pod prints:

            scenario theoretical_series actual_head_series  rss_mb bytes_per_series
            baseline                 15                 15      62        4,332,800
              + pods                900                900      71           81,560
            + routes              7,200              7,200      94           13,650
   + customer_id (1k)          7,200,000          1,847,000   3,891            2,212

Per-line walkthrough. The row baseline shows the floor: 15 series, 62 MB RSS. The "bytes per series" number is misleading at this scale — Prometheus's binary, the WAL, and the runtime overhead dominate at low series counts. Why the baseline overhead is ~60 MB regardless of series count: the Go runtime, the HTTP server, the WAL replay buffers, the rule-evaluation engine, and the symbol table all consume memory even when there are zero series. The marginal cost per series only becomes visible above ~10K series, which is why low-traffic dev environments do not warn the developer about cardinality cost.

The row + pods shows 900 series and 71 MB RSS — a 9 MB increase for 885 new series, or about 10 KB per series. This is the real per-series cost: the postings list entries (one per label-value pair), the series-table entry, the per-series chunk metadata, the symbol-table entries for new label values. Ten KB per series sounds small until multiplied; at 7 million series it is 70 GB.

The row + customer_id (1k) is where the test starts to diverge from theory. The script emitted 7.2 million theoretical series but Prometheus only ingested 1.847 million before the head block hit memory pressure and started dropping the rest. The pod's 4 GB limit is hit at 1.8M series, the OOM killer fires shortly after, and the lesson is the head-block memory budget is the actual cardinality limit on a single Prometheus instance — not disk, not CPU, not network. Why the actual ingested count is below theoretical: when head memory pressure is high, Prometheus's --storage.tsdb.max-block-duration and --query.max-samples flags throttle ingestion, but more commonly the OOM killer wins first. The dropped series produce prometheus_target_sync_failed_total counter increments, which are the signal a cardinality problem has crossed from "expensive" to "data loss". A team that does not alert on this counter discovers cardinality incidents only after the missing data is needed.

The line bytes_per_series column shows the asymptotic per-series cost converging toward ~2 KB once the static overhead is amortised. The 2 KB number is the right one to plug into a capacity model: a Prometheus pod with 16 GB of RAM can hold roughly (16 GB - 2 GB overhead) / 2 KB = 7M active series before head pressure starts pushing back. At 7M active series, the team sees three operational signals before the OOM: head_chunks GC frequency increases (visible in prometheus_tsdb_head_truncations_total), query_range p99 latency on dashboards climbs from sub-second to 5+ seconds (the postings-merge cost grows with active series), and recording-rule evaluation duration occasionally exceeds the rule's evaluation interval (visible in prometheus_rule_evaluation_duration_seconds). All three are leading indicators; the OOM is the lagging indicator. The discipline is to alert on the leading indicators and never let the lagging one fire.

The line emit_series(60, 40, 3, 1000) is also where the test reproduces the write-amplification failure mode that breaks the TSDB head before memory does. Each new unique label set triggers: one new entry in the symbol table (for any new string), one new postings-list update for each label (six labels × LIST_UPDATE_OP), one new entry in the series table, one new chunk header, and one new entry in the WAL — typically ~600 bytes of synchronous I/O per new series. At 1.8M new series in 20 seconds (the scrape interval the script targets), the WAL throughput requirement is 1.8M × 600 bytes / 20s = 54 MB/s, which exceeds the IOPS budget of a small SSD-backed PVC. The TSDB falls behind on WAL flushes, the in-memory log grows unbounded, the OOM arrives. The disk fills up second; the WAL queue fills up first.

Where the index lives in memory and why GC cannot save you

Prometheus's TSDB is written in Go, and the index lives on the Go heap. This matters because the Go garbage collector's behaviour under high-cardinality pressure is the third structural failure mode (after head-block memory and WAL throughput). A team running on the JVM, on .NET, or on Rust would see different specific numbers but the same shape of failure.

The Go GC is a tri-color mark-and-sweep collector with a target heap-growth ratio (GOGC=100 by default — collect when the heap doubles). Under steady-state cardinality the heap stabilises and the GC runs maybe once every 30 seconds, doing 5-10 ms of stop-the-world work; query latency is unaffected. Under growing cardinality — the death-spiral conditions — the heap grows continuously, the GC runs every few seconds, and each GC cycle takes longer because the live set is larger. A heap that doubled from 8 GB to 16 GB takes roughly 4× as long to mark (mark cost is O(live objects × pointer-following), and high-cardinality TSDBs have many small objects each with several pointers — the postings-list entries, the symbol-table strings, the label-pair maps). Mark times that were 20 ms become 80-200 ms; the stop-the-world phase that was unnoticeable becomes a 200 ms gap in scrape ingestion. A scrape configured for 15-second intervals occasionally misses; alerts that depend on continuous data flap.

Illustrative — not measured data. The cascade compresses into ~50 minutes from "label merged" to "alerting silent". The first 25 minutes produce zero user-visible signal — Prometheus's self-monitoring metrics show the leading indicators, but no human is watching them. The last 10 minutes are unrecoverable: the head block is lost, the WAL replay takes longer than the next alert evaluation interval, and any alert that needed the head-block data does not fire. Engineering the leading indicators into a paging alert is the only intervention that catches the incident before stage 4.

The Razorpay 2024 incident report names this cascade explicitly: a customer_segment label was added to the payments-success counter at 16:08 IST on a Tuesday; the head block hit 12 GB by 16:32; the OOM killer fired at 16:51; alerting was silent for 22 minutes during the WAL replay; the customer-facing impact was a UPI flap that produced 4,200 failed transactions in the silent window, none of which paged the on-call until customer support escalated at 17:14. The fix that landed in the post-incident: an alert on prometheus_tsdb_head_series > 5_000_000 that pages the platform team 30 minutes before OOM at the typical growth rate. The alert has fired four times in the year since; each firing was caught at stage 1 of the cascade rather than stage 4.

A second observation from the same incident report: the head-block growth rate is itself a more reliable leading indicator than the absolute series count, because absolute series count varies across services in ways that make a single threshold hard to set. A growth-rate alert — deriv(prometheus_tsdb_head_series[1h]) > 1000 (more than 1,000 new series per second sustained for an hour) — fires earlier and with fewer false positives than a static threshold. The Razorpay platform team's current alerting setup uses both: the growth-rate alert as the primary signal (catches the cascade in stage 1), and the absolute-threshold alert as the safety net (catches teams that exceed the budget without a growth event, e.g. by gradually adding labels over weeks).

A third lesson, also from Razorpay's report: the WAL-replay duration itself is a metric worth monitoring across deployments. Each Prometheus instance exposes prometheus_tsdb_head_truncations_total and the time-since-last-restart can be derived from process_start_time_seconds. A platform-team dashboard that tracks "WAL replay seconds at last restart" across the fleet surfaces the instances most at risk of long alerting outages on the next restart. An instance whose last replay took 7 minutes is one bad rollout away from a 14-minute outage; the remediation (truncate retention, shard the deployment, drop a high-cardinality label) is cheaper to do proactively than during an active incident.

Why the query layer fails before the storage layer

A team that has provisioned a generous Prometheus pod (32 GB RAM, NVMe disk) sometimes assumes they have made the cardinality problem go away. They have not — they have moved the failure mode from the storage layer to the query layer. The query layer breaks at smaller cardinalities than the storage layer because it has to walk the postings lists for every query, and the postings-list walk cost is O(largest list × number of intersected lists).

Consider the query histogram_quantile(0.99, sum by (route) (rate(request_duration_seconds_bucket[5m]))) against a metric with route (40 values), pod_name (240 values), status (8 values), and le (12 values). The request_duration_seconds_bucket postings list has 40 × 240 × 8 × 12 = 921,600 series IDs. The query reads every one of those IDs, fetches the corresponding chunk for the last 5 minutes (one chunk per series, ~30 samples), computes the rate (one subtraction per chunk), then aggregates by route (a 40-key hash map). The dominant cost is the postings-list walk and the per-series chunk fetch — roughly 921,600 × 30 = 27.6M sample reads. At ~50 nanoseconds per sample read (cache-warm, single-threaded), that is 1.4 seconds. At cache-cold, with each chunk fetch incurring an mmap fault, it can be 10× slower — 14 seconds. The query times out at Grafana's default 30-second panel timeout, and the dashboard panel goes red.

A second cost: the sum by (route) aggregation operator allocates a hash map keyed by the surviving label set after grouping. With route as the only grouping key, the hash map has 40 entries — small. But the operator has to process all 921,600 series before reducing to 40 — the reduction is only at the output. CPU cost is dominated by the input scan, not the output size. A user looking at "40 routes" on the dashboard does not realise the query touches a million series behind the scenes.

A third cost specific to high-cardinality histograms: histogram_quantile interpolates within bucket boundaries, and the interpolation requires sorted access to all le postings within each (route, status, pod_name) group. The query layer either pre-sorts (memory cost) or walks the unsorted list multiple times (CPU cost). Prometheus's implementation chose pre-sort, which is why histogram-quantile queries grow memory faster than counter-rate queries at the same cardinality.

The reader can audit their own query layer. Prometheus exposes prometheus_engine_query_duration_seconds_bucket and prometheus_engine_queries_concurrent_max — the first measures per-query latency, the second measures whether queries are queueing because the engine is saturated. A team running the audit and finding queries with p99 > 5 seconds at any cardinality is operating beyond the query layer's comfort zone, and the next label addition will push them into timeout territory. The audit is one PromQL query against the Prometheus instance itself; running it weekly is the cheapest leading-indicator dashboard a platform team can ship.

Why query-layer failure is more dangerous than storage-layer failure: the storage layer fails loudly — OOM kills are visible, restart events are tracked, WAL replays are logged. The query layer fails silently: a Grafana panel that times out shows a red "no data" indicator but does not page anyone; an alerting rule that times out simply skips its evaluation and emits no event; a recording rule that times out leaves a gap in its output that downstream queries treat as a zero. The on-call sees green dashboards, the SLO panel shows 100% availability (because the alert that would have fired the SLO violation timed out), and the customer impact accumulates beneath an apparently-healthy stack. Engineering teams that have lived through both failure modes prefer the OOM — at least it pages someone.

A pattern worth naming for the reader debugging slow queries: the prometheus_engine_query_log_total counter, when combined with the query log file, lets a platform team identify the specific PromQL expressions that are dragging the engine. Running topk(10, sum by (handler) (rate(prometheus_http_request_duration_seconds_count[1h]))) on the Prometheus self-metrics surfaces the query handlers (Grafana panels, alerting rules, federation pulls) that consume the most CPU. Most fleets find that 3-5 dashboard panels account for 60-80% of the query CPU; rewriting those panels to use pre-aggregated recording rules instead of raw high-cardinality queries delivers more headroom than any infrastructure scale-up.

Six failure modes you only see at scale

The head-block OOM is the headline failure. Six others recur across Indian fintech and e-commerce production reports, and naming them is the bridge to the cardinality budget chapter that follows.

Failure 1 — WAL replay storms. A Prometheus pod that OOM-kills with N million series in the head block needs to replay the WAL on restart. WAL replay is single-threaded; replay rate is roughly 100K series/second on modern hardware. A 5M-series head block takes 50 seconds; a 50M-series head block takes 8+ minutes; a 100M-series head block (which is past the recommended limit but happens) takes 17+ minutes. During replay, scrapes are queued, queries are rejected with 503 Service Unavailable, and alerting is silent. Hotstar's 2024 IPL incident had a 14-minute WAL replay during the final between Mumbai and Chennai; the team rolled back the offending label-addition PR but the replay window itself produced a 14-minute alerting outage.

Failure 2 — block compaction cascade. Prometheus's compactor merges adjacent blocks (2h → 6h → 24h → 14d) to reduce inode pressure on the filesystem. A high-cardinality block has more series, more chunks, more index entries — all of which the compactor has to read, merge, and rewrite. Compaction time grows roughly O(N log N) with series count. A 1M-series block compacts in 15 seconds; a 10M-series block in 4 minutes; the 100M-series block in over an hour, by which point another 2-hour block has accumulated and the compactor is racing against a sliding window. PhonePe's compaction queue backed up to 18 pending blocks during their 2024 cardinality incident; the queue drained only after they truncated retention from 30 to 7 days.

Failure 3 — match[] query DoS. Prometheus's federation API accepts match[] selectors that walk the postings lists to enumerate matching series. A naive match[]={__name__=~".+"} walks every postings list — a few seconds at 100K series, but several minutes at 100M series. Tools like Thanos or Mimir's federation gateways issue these queries periodically; a high-cardinality fleet produces federation queries that exceed their evaluation interval, the federation lag grows, and the global view is stale. Cleartrip's Thanos rollout in 2024 hit this: their per-region Prometheus instances were healthy, but the global query layer was 11 minutes behind, which broke their cross-region SLO dashboards.

Failure 4 — recording-rule evaluation skips. Recording rules pre-aggregate high-cardinality metrics into lower-cardinality versions for faster querying. Each recording rule is itself a query; if the underlying metric's cardinality grows, the recording rule's evaluation time grows with it. When evaluation time exceeds the rule's interval (typically 30 seconds), the rule skips an evaluation — silently. The pre-aggregated metric has gaps. Dashboards that depend on the pre-aggregated metric show flatlines. Alerts on the pre-aggregated metric stop firing. The leading indicator is prometheus_rule_group_iterations_missed_total > 0, which most teams do not monitor until the first incident teaches them to.

Failure 5 — alertmanager fanout explosion. An alert rule that fires per-series (no sum by (...)) produces one alert per series. If the underlying metric has 1.4M series, the rule fires 1.4M alerts, each sent to alertmanager, each grouped, each notified. The notification dispatch rate exceeds PagerDuty's per-integration rate limit (90 events/minute on the Free plan, 1,000/minute on Business); alerts back up; some are dropped; the on-call gets paged for some but not others. The mitigation — alert: HighErrorRate, expr: rate(errors_total[5m]) by (service) > 0.05 instead of ... by (service, customer_id) ... — is a one-line change but requires every alert to be audited. Razorpay's alert audit in 2024 found 47 alert rules that lacked by (...) aggregation; all 47 were rewritten over a quarter.

Failure 6 — symbol-table memory growth. Prometheus stores every distinct string that has ever appeared as a label value (within retention) in a symbol table on the Go heap. The symbol table never shrinks until a block is compacted out of retention. A label like request_id (with 10M+ distinct values per day) populates the symbol table at 10M new strings/day. After 30 days of retention, the symbol table holds 300M strings (roughly 18 GB at 60 bytes/string average). Even after the offending label is dropped, the strings persist for the retention window. The team that drops the label expecting immediate relief discovers that the head-block memory only releases as old blocks expire — a 30-day mistake produces a 30-day recovery window. The Razorpay engineer who accepts a temporary 47% bill increase in exchange for "we will fix this in a week" learns that the index overhead has 30-day inertia.

A seventh, less common but devastating failure: string-interning lock contention at very high ingest rates. The symbol table is concurrency-controlled by a sync.Mutex (Prometheus 2.x); under sustained 100K+ new-label-values-per-second ingestion, the mutex becomes the bottleneck and ingest latency spikes. Prometheus 3.x replaced this with a striped lock that improves but does not eliminate the contention. Teams running at 50M+ active series sometimes hit this — the visible symptom is prometheus_tsdb_compaction_chunk_size_bytes_sum flatlining (compaction stalls) while ingest queues grow. The mitigation is to shard the Prometheus deployment (Cortex, Mimir) or to drop the high-churn label at the scrape boundary.

An eighth pattern, observed at Dream11 during the 2024 IPL playoffs: per-shard hot-spotting in sharded TSDB deployments. Cortex and Mimir distribute series across ingester shards by hashing the series's label set. A label-addition that introduces a heavy-skew dimension — like team_name where 70% of traffic concentrates on three teams during a playoff match — produces shard imbalance: one ingester holds 30M series while its peers hold 5M each. The hot ingester OOMs while the fleet-wide capacity has 4× headroom. The fix is to use the right hash function (Cortex's default is murmur3 over the full label set; the skew comes from heavy-hitter values dominating a single label's bucket) or to add a deliberate jitter label that disperses the heavy hitters across shards. Dream11's leaderboard fan-out used user_id_bucket = hash(user_id) % 100 as a partitioning label specifically to prevent this; without the bucketing label, three ingester pods OOMed in 14 minutes during the qualifier final.

A ninth pattern, common across IRCTC and PhonePe: Tatkal-window cardinality bursts. Some services have natural traffic spikes that produce label bursts — IRCTC's 10:00 IST Tatkal hour creates ~80,000 unique session_ids in 90 seconds; PhonePe's salary-day UPI batch creates ~2M unique transaction_ids in 10 minutes. If those identifiers are labels (which they should not be, but often are), the head block ingests millions of series in minutes, and the OOM cascade compresses from 50 minutes to 5 minutes — too fast for the leading-indicator alert to be useful. The mitigation is rate-limiting at the SDK boundary: the Prometheus client library's Counter.labels() call is wrapped in a circuit breaker that drops the label after the per-minute cardinality budget is exceeded. The metric still emits — the burst label values are just hashed into a bounded bucket. IRCTC's Tatkal-hour observability went from "Prometheus crashes daily at 10:02 IST" to "single bounded series per session-id-bucket" after the SDK wrapper landed in 2023.

A tenth pattern, more common in early-stage startups than enterprises: the development-environment cardinality leak. A developer ships a metric with request_id as a label in their staging environment; staging Prometheus has 5K series and the developer sees no problem. The same code is promoted to production where 50M request_id values arrive in an hour, and the production Prometheus OOMs while the developer wonders why "it worked in staging". The asymmetry between staging and production cardinality is the silent feature of every cardinality incident — staging is too small to surface the bomb, production discovers it under load. The discipline that catches this is production-equivalent staging load (replay 10% of production traffic into staging) and CI-time cardinality assertions (the metric registration itself fails CI if any label is unbounded — request_id, trace_id, session_id, customer_id go on a deny-list at the SDK layer). Both disciplines are cultural changes more than technical ones; they live in the platform-team's review checklist, not in the codebase.

Common confusions

"My disk is fine, so cardinality is fine." Wrong layer — cardinality breaks the index and the head-block memory before the disk fills. A Prometheus pod with 4 TB of disk and 16 GB of RAM OOMs at the same series count as a pod with 100 GB of disk and 16 GB of RAM. The disk-fill-up failure is the lagging indicator; head-block memory is the leading one.
"Adding a label only affects that one metric." No — every label addition grows the symbol table (which is shared across all metrics), the inverted-index size (which is shared), and the GC heap pressure (which slows every query, not just queries on the offending metric). One label's death-spiral degrades the whole instance.
"Sharding Prometheus by service avoids cardinality limits." Partially — sharding gives each instance its own head-block budget, but the federation layer (Thanos, Mimir, Cortex) still has to merge postings lists across shards. The merge cost grows with total series across shards, not per-shard series. Sharding raises the ceiling 5-10×; it does not eliminate it.
"OOM is the worst failure mode for a TSDB." No — silent rule-evaluation skips (Failure 4) are worse, because the user sees green dashboards but the alerting plane is broken. OOM is loud and self-recovers; eval skips are silent and self-perpetuating. Most cardinality post-mortems list eval skips as the failure that hurt customers.
"Native histograms remove the cardinality problem." Partially — they collapse the per-bucket multiplier (12× saved) but do not change the per-label cross-product. A native histogram with customer_id as a label is still 1.4M series; the saving is on le, not on the high-cardinality label.
"Prometheus 3.x or vendor-managed services removed cardinality limits." No — Prometheus 3.x improves several constants (better string interning, lower per-series overhead, faster postings merge), but the structural property is unchanged: the index grows multiplicatively with label cardinality. The wall moved from 5M series to 12M series; it did not vanish. Vendor-managed Prometheus (Grafana Cloud, Chronosphere) absorbs the operational pain — no head-block OOMs to wake you up at 3 AM — but passes through the bill. The cardinality death spiral becomes a finance problem instead of an operations problem; both end up on the platform team's plate at quarterly review.

Going deeper

The TSDB block format — postings, mins-maxs, and the size-on-disk picture

A Prometheus TSDB block on disk is a directory containing index, chunks/, and tombstones. The index file uses the same structure as Lucene's .fdx/.fdt files: a series-table section, a label-index section (postings lists), and a symbol table. The postings list for a (label, value) pair is run-length-encoded varint-delta — the differences between consecutive series IDs are encoded as varints, which compresses well when series IDs are dense. A 1M-series-long postings list compresses to roughly 2-4 MB. The chunks directory holds Gorilla-XOR-encoded sample blocks, ~1.3 bytes/sample, organised into 512 MB segments for sequential I/O.

The cardinality cost in the on-disk layout: the index file grows roughly O(unique label-value pairs × log(max series ID)) — sub-linear in series count due to the varint-delta encoding, but every new label-value pair creates a new postings list, and every new series creates a new entry in the series table. A team that audits their TSDB block sizes (du -sh /prometheus/data/01H*) finds the index/chunks ratio drifts upward over the cardinality death spiral: at 100K series the index is ~5% of total block size; at 100M series it is ~25%. The chunks compress; the index does not, beyond varint-delta.

A subtle observation: the index file is mmap'd on Prometheus startup, not loaded into the process heap. A 25 GB index file on a node with 64 GB of RAM does not consume 25 GB of process RSS — the kernel's page cache handles it on demand. This is why a Prometheus instance with a small heap can serve queries against a much larger index than its RAM would suggest. The failure mode is page-cache pressure: a large index file evicts other workloads' pages, the kernel thrashes, and apparent disk I/O spikes correlate with apparent CPU spikes. A team running Prometheus on a node shared with application workloads sees the application latency degrade whenever Prometheus does a wide-postings query — a confusing symptom that traces back to mmap.

Why VictoriaMetrics and ClickHouse-backed TSDBs do not solve cardinality

A reader who hits Prometheus's cardinality wall sometimes reaches for VictoriaMetrics (claiming "10× lower memory per series") or a ClickHouse-backed approach (claiming "columnar storage handles cardinality natively"). Both claims are real but bounded.

VictoriaMetrics achieves lower per-series memory through several engineering choices: a more compact in-memory data structure, faster string interning, a more efficient postings-list format (Roaring bitmaps instead of RLE-varint), and per-tenant indices that avoid global-symbol-table contention. The result is roughly 200 bytes/series against Prometheus's ~2 KB/series — a 10× improvement on the constant, not a change in the multiplicative growth pattern. A fleet that was OOMing Prometheus at 7M series can hit 70M series on VictoriaMetrics — but the cross-product still applies, and a customer_id label with 1.4M values still produces a 1.4M× explosion. The wall moved; it did not disappear.

ClickHouse-backed approaches (Grafana Mimir's experimental columnar mode, Datadog's internal store) trade index-walk cost for column-scan cost. A query that touches a high-cardinality column scans the column's compressed file rather than walking the postings list — better for analytical queries that aggregate broadly, worse for selective queries that pick a small subset. The cardinality death spiral shifts from "the index OOMs the head" to "the column scan saturates the disk read bandwidth". The failure mode is different; the failure is the same.

The honest takeaway: every TSDB has a cardinality wall, the constant differs, the structure does not. The discipline — bounding cardinality at registration time — is the only durable answer. Vendor-switching delays the wall by a year; a cardinality budget delays it indefinitely.

The OTel SDK's contribution — and how its `Meter` API can hide the bomb

OpenTelemetry's metrics SDK provides a Meter interface that creates Counter, UpDownCounter, Histogram, and Gauge instruments. The instruments accept attributes (the OTel name for labels) at record-time: counter.Add(1, attribute.String("customer_id", customer_id)). The bomb in this API is that the developer does not declare the attribute set up front — every Add call can pass any attributes, and the SDK accumulates the cross-product as it sees them. A code review of a counter.Add(...) call cannot statically determine the cardinality of the attributes; only the running system can.

The OTel collector's metrics_view pipeline component partially mitigates this with attribute filtering: declare which attributes are kept and drop the rest at the collector. But the SDK's in-process accumulator has already seen all attributes by the time the metric is exported, so the SDK's memory is unbounded between scrapes. A high-throughput service that sees 1M unique customer_id values in a 10-second scrape interval allocates 1M Counter rows in process memory before the export, and the in-process memory grows even though the exported metric is filtered.

The mitigation is the OTel SDK's view configuration — define attribute filters at SDK initialization, drop high-cardinality attributes before they enter the in-process accumulator. The developer registers the metric with Meter.Create(name, view{drop: ["customer_id"]}) and the SDK never accumulates the attribute. Most teams discover this only after their first OTel-side OOM, which arrives at roughly the same scale as their first Prometheus-side OOM — the structural problem is upstream of the storage layer.

A subtle pattern that catches teams: the OTel SDK's default behaviour is to export every attribute it has seen, even those that have not received samples in the current export interval. A customer_id that appeared once and then never again still occupies a row in every export until the SDK's TTL evicts it (default: never). The SDK accumulates the equivalent of Prometheus's symbol table, but in process memory rather than on disk. A long-running service that has seen 50M distinct customer_id values over its lifetime carries 50M attribute rows in process RAM until restart — which is its own cardinality wall, separate from the TSDB's.

The cardinality estimator — what to put on every dashboard

A platform team running a fleet of Prometheus instances can compute cardinality on the fly from the TSDB itself. The query count by (job) ({__name__=~".+"}) returns the per-scrape-job series count. The query topk(20, count by (__name__) ({__name__=~".+"})) returns the highest-cardinality metrics. The query sum(prometheus_tsdb_head_series) returns total active series.

The dashboard derived from these queries — the "cardinality dashboard" — is the platform team's primary tool for catching incidents at stage 1 of the cascade. The leading indicator is delta(prometheus_tsdb_head_series[24h]) > 100_000 — a 100K series-count growth in 24 hours suggests a label was just added in production. The on-call drills into the cardinality dashboard, finds the metric whose count grew by 100K, walks back to the deployment that introduced it, and either rolls back or schedules a discussion. The Razorpay operating procedure runs this drill every Monday morning at 09:00 IST as a 15-minute platform-team standup; the standup has caught 12 incidents in 2024, none of which became OOMs.

A second-order observation: the cardinality dashboard itself is a high-cardinality query. count by (__name__) ({__name__=~".+"}) walks every postings list, which is expensive on a fleet that already has cardinality pressure. The recursion is deliberate — a team whose cardinality dashboard times out has already crossed into the failure regime, and the dashboard's failure is itself the alert. Platform teams sometimes pre-aggregate the cardinality query into a recording rule that runs once an hour, reducing the on-demand cost while preserving the historical signal.

The "labels-as-API" discipline as a structural fix

Every other mitigation in this chapter is reactive — alerts, kill-switches, dashboards, audits. The structural fix is to treat labels as a versioned API: declare them in a schema, review schema changes through a PR process, enforce the schema at registration time. Razorpay's metrics.yaml discipline (named in the previous chapter) is the canonical Indian implementation; Honeycomb's "Definition of Done" for instrumentation is the Western counterpart. Both implementations share the property that labels cannot be added to a metric without the schema reviewer's approval, and the schema reviewer's checklist includes the cardinality budget for the metric.

The discipline pays off in three places. First, every label change leaves an audit trail (the PR that updated the schema), so post-incident investigations can trace the offending label back to the engineer who shipped it. Second, the cardinality budget is enforced before the metric ships, not after — which is the only mitigation that prevents the OOM cascade rather than catching it. Third, the schema enables deprecation paths: a label can be marked deprecated_in: "v2.4" and removed_in: "v3.0", the dashboards that depend on it can be tracked, and the migration is a tracked engineering task rather than a coordinated panic.

The cost of the discipline is measured in PR overhead — each label change requires a YAML update and a review. Teams that ship the discipline report the overhead is roughly two engineer-hours per quarter for a 50-engineer team — far less than the engineering hours lost to a single cardinality incident. The discipline is the cheapest form of platform-team leverage observability has, and the platform-team chapters at the end of this curriculum will return to it.

Reproducibility footer

# Reproduce the head-block memory measurement on your laptop
docker run -d --name prom -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  --memory=4g prom/prometheus:v2.51.0
python3 -m venv .venv && source .venv/bin/activate
pip install prometheus-client requests pandas
python3 head_block_memory.py
# Expected: a four-row table tracking head_series and rss_mb across cardinality scenarios.
# At ~1.8M series the 4G container hits memory pressure; the OOM kill follows shortly after.
# Watch the head metrics live during the run:
# curl -s http://localhost:9090/api/v1/query?query=prometheus_tsdb_head_series | jq .data.result

Where this leads next

This chapter named the structural reason a TSDB breaks under cardinality pressure: the inverted index, the head-block memory, the WAL throughput, the GC heap, the query-layer postings merge — every layer fails before the disk fills. The next chapter introduces cardinality budgets as the engineering discipline that turns this structural property into an actionable constraint: declare the budget, enforce it in CI, monitor it in production, kill-switch it on violation.

Cardinality budgets — the next chapter; the discipline that prevents the cascade described here.
Wall: cardinality is the billing death spiral — the previous chapter; the cost framing this chapter complements with the storage failure modes.
HyperLogLog for approximate counting — the chapter that follows; how to count distinct values without paying the cardinality cost.
Histograms: native vs sparse — the histogram-specific cardinality optimisation; the 12× saving named in this chapter, in full mechanism.
Exemplars: linking metrics to traces — the bridge that lets metrics carry high-cardinality detail without paying the cross-product cost.

The single insight a senior reader walks away with: a TSDB is an inverted index; the index is the cost driver; the index breaks before the disk fills. A team that has internalised this reads every label-addition PR with the right mental model: not "how much disk does this take" but "how much index does this take, and how does the index break under load".

The closing reframing for the on-call SRE: every cardinality incident decomposes into three time horizons. Right now, the kill-switch (the relabel rule from the previous chapter) stops the bleeding in 90 seconds. This week, the cardinality dashboard and the leading-indicator alerts ensure the next incident is caught at stage 1. This quarter, the labels-as-API discipline and the CI-time budget enforcement prevent the next cardinality incident from being possible. Each horizon has its own cost — operational reflexes, dashboard engineering, cultural change — and a team that pays each one in order has a stable observability stack. A team that skips the third horizon and only ships the first two is running a permanent firefighting operation; the cascade arrives on a quarterly cadence rather than an annual one. The discipline is not "prevent every label addition"; the discipline is "every label addition is priced, audited, and recoverable", and the structural understanding from this chapter is what makes the pricing meaningful.

A final note for the reader returning to their own production stack: the most useful first action after reading this chapter is to run curl http://your-prometheus:9090/api/v1/query?query=prometheus_tsdb_head_series and write the number down. That number is your current position on the cardinality curve. The number a week from now, divided by the number today, is your growth rate — and the growth rate is what predicts whether the next 30 days end in an incident. Most teams do not know this number for their own stack; the platform teams that do are the ones that have stopped having cardinality incidents.

References

Prometheus TSDB design document — the authoritative reference for the on-disk index format, postings encoding, and symbol table.
Bjorn Rabenstein, "How Prometheus's New Storage Engine Works" (PromCon 2017) — the original talk introducing the TSDB block format and the head-block design.
Pelkonen et al., "Gorilla: A Fast, Scalable, In-Memory Time Series Database" (VLDB 2015) — the foundational paper on Gorilla XOR encoding, which Prometheus's chunk format inherits.
VictoriaMetrics, "How VictoriaMetrics handles high cardinality" — the alternative-TSDB perspective; the Roaring bitmap and per-tenant index choices.
Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022), Ch. 6 — the modern treatment of cardinality as a discipline rather than a constraint.
Robust Perception, "Cardinality is Key" (Brian Brazil) — the canonical Prometheus-author post on cardinality, written by the maintainer.
Wall: cardinality is the billing death spiral — the previous chapter; the cost framing of the same problem.
Cardinality budgets — the next chapter; the engineering discipline this chapter motivates.

Why high-cardinality labels break TSDBs

The TSDB is an inverted index with samples bolted on

A measurement: head-block memory under cardinality scaling

Where the index lives in memory and why GC cannot save you

Why the query layer fails before the storage layer

Six failure modes you only see at scale

Common confusions

Going deeper

The TSDB block format — postings, mins-maxs, and the size-on-disk picture

Why VictoriaMetrics and ClickHouse-backed TSDBs do not solve cardinality

The OTel SDK's contribution — and how its Meter API can hide the bomb

The cardinality estimator — what to put on every dashboard

The "labels-as-API" discipline as a structural fix

Reproducibility footer

Where this leads next

References

The OTel SDK's contribution — and how its `Meter` API can hide the bomb