Cardinality: the master variable
At 14:08 IST on the second day of Big Billion Days 2024, a Flipkart SRE merged a one-line PR that added pincode as a label to the checkout_requests_total Prometheus counter. The PR description said "for regional latency dashboards". By 14:31 IST the production Prometheus replica's RSS had climbed from 28 GB to 92 GB, the head block was rejecting writes with out of order sample, the alerting Prometheus was scraping itself into a CPU spiral, and twelve Grafana dashboards were returning query timed out. The root cause was four words long — "active series count exploded". One label, ~28,000 distinct Indian pincodes, multiplied across the rest of the existing label cross-product, took the active-series count from ~120k to ~3.4 billion in twenty-three minutes. The deploy was rolled back; the SRE wrote a postmortem; the line "cardinality is the master variable" was added to the platform team's onboarding doc.
This chapter is about that variable — what it is, how it multiplies, why time-series databases are uniquely fragile to it, and how you reason about it before you ship a label rather than after Prometheus is on fire. Cardinality is the one number every observability team eventually learns to fear, and the only number that connects metric design, log retention, trace sampling, dashboard cost, and alert quality into a single budget.
Cardinality is the count of distinct label-value tuples a metric, log stream, or trace dimension can produce. Time-series databases store one inverted-index entry and one chunk header per unique series, so cardinality scales storage, memory, and query cost roughly linearly — and a single high-churn label can multiply your active-series count by 10,000× in minutes. Treat cardinality as a per-pillar budget, audit it before you ship a label, and you will never page your platform team at 02:00 to free up Prometheus heap.
What cardinality actually is, with the multiplication that bites you
A metric in Prometheus, OpenMetrics, or any OTel-compatible TSDB is identified by a metric name plus a set of labels. http_requests_total{method="GET", status="200", service="checkout-api"} is one series. Change any label value — method="POST" — and you get a different series. The TSDB stores each unique series independently: one entry in the inverted index, one chunk header in memory, one set of compressed samples on disk. Cardinality is the count of these unique series, and it is the master variable because everything else (RAM, disk, query latency, scrape budget, alert evaluation cost) scales with it.
The reason cardinality is dangerous, rather than merely large, is that label values multiply. If your http_requests_total has labels method (5 values), status (12 values), service (40 values), region (3 values), the active-series count is the cross-product: 5 × 12 × 40 × 3 = 7,200. Add customer_id (50 million values) and the worst-case is 7,200 × 50,000,000 = 360 billion. Most label combinations never get hit, so the actual active series is smaller — but "smaller than the cross-product" can still be tens of millions, and tens of millions of series is a Prometheus you cannot afford to run.
The TSDB does not care about your intentions when you added the label — it only sees the cross-product of label values that the scraper actually witnesses. Why the multiplication is exact, not approximate: Prometheus identifies a series by the byte-level fingerprint of its metric name and sorted label set. Two scrapes that produce label sets with the same values produce the same fingerprint and update the same series; two scrapes that differ in a single label byte produce two distinct fingerprints and two distinct series. There is no rounding, no bucketing, no approximation — every distinct tuple gets its own row in the index. The only relief is that not every tuple in the worst-case cross-product gets exercised; the cardinality you pay for is "tuples actually observed", not "tuples theoretically possible".
Three properties of label values turn a normal label into a cardinality bomb. High base value count: pincodes (28k), customer IDs (millions), trace IDs (∞), session tokens (∞). Churn: a label whose value space changes over time even if any single moment has only a few values — pod_name in Kubernetes (each rolling deploy gives every pod a new hash suffix), git_commit_sha on every release, request_id. Churn is the silent killer because your point-in-time series count looks fine while your historical series count grows linearly with deploys. Unboundedness: any label sourced directly from user input (URL paths with query parameters, full user-agents, raw error messages). The fix is the same in all three cases: don't put it in a metric label. Put it in a log, a trace attribute, or an event row, where the cost model is fundamentally different.
Why a TSDB pays linearly for cardinality (and a column store doesn't)
It is tempting to say "well, just store more stuff" — disks are cheap, RAM is cheap, surely a few million extra series can't matter. The reason this intuition fails is that a TSDB and a column store have fundamentally different cost models for new dimensions, and Prometheus is firmly in the TSDB camp.
A Prometheus series is laid out on disk and in memory roughly like this:
- One postings list entry per (label, value) pair → these are the inverted indices that make
{job="checkout"}queries fast. Each posting is a few bytes per series. - One head chunk per active series in RAM → roughly 200 bytes of metadata plus 120 samples × 1.3 bytes/sample of compressed data. Call it ~400 bytes per active series at steady state.
- One block index per series per 2-hour block on disk → a few bytes more per series per block, plus the chunk file itself.
- Series ID lookup tables in RAM that map fingerprints to chunk offsets → ~50 bytes per series.
The total RAM-per-active-series number floating around in production Prometheus deployments is 3–8 KB per series, depending on label-name length and label-value length. At 1 million active series you are paying 3–8 GB just for the metadata. At 10 million active series, 30–80 GB — bigger than most VM RAM allocations. At 100 million, you are not running Prometheus on one box anymore; you are running Cortex, Mimir, Thanos, or VictoriaMetrics in a sharded cluster, and your cardinality bill has become an infrastructure-design problem.
Now compare a column store. ClickHouse, BigQuery, Snowflake, DuckDB — these store one column per attribute, not one row per unique attribute combination. Adding a high-cardinality column adds one new column-file per partition and indexes into it via min/max statistics + bloom filters; it does not multiply the row count. Why this matters for the events pillar: 50 million unique customer IDs in a Prometheus label means 50 million series — fatal. 50 million unique customer IDs in a ClickHouse customer_id column is one column with 50 million distinct values — completely fine, queryable in milliseconds via the bloom-filter index. This is the architectural reason "high cardinality belongs in events, not metrics" is a rule rather than a guideline. The two stores have different mathematical complexity in the dimension you're worried about.
The same argument repeats for logs (Loki indexes labels separately from log content, so high-cardinality fields go in the body, not the labels) and traces (Tempo indexes only service.name and name by default; everything else is a span attribute, scanned at query time but not multiplied across the index). Each pillar has its own cardinality budget with its own cost curve, and the discipline is to know which budget you are spending against when you add a dimension.
Auditing your real cardinality (the script you should run today)
The single most useful one-time audit you can run on a running Prometheus is to enumerate every metric and its current series count, sorted descending. The script is short and the output is invariably surprising — the metric you assumed was tiny turns out to be in the top ten, the metric you assumed was your worst offender is comfortably mid-pack, and there is at least one metric you do not recognise eating ten percent of your cardinality budget.
# cardinality_audit.py — enumerate Prometheus cardinality, find the offenders.
# pip install requests pandas prometheus-client
import requests, pandas as pd, time
from collections import Counter, defaultdict
PROM = "http://prometheus.platform.razorpay.internal:9090"
# Step 1: list every metric name the Prometheus knows about.
names = requests.get(f"{PROM}/api/v1/label/__name__/values", timeout=30).json()["data"]
print(f"distinct metric names: {len(names):,}")
# Step 2: for each metric, fetch the active series and tally by metric and by label.
per_metric = Counter()
per_label_pair = defaultdict(Counter) # (metric, label_name) -> Counter(label_value)
start = time.time()
for i, name in enumerate(names):
r = requests.get(
f"{PROM}/api/v1/series",
params={"match[]": name, "start": int(time.time()) - 600, "end": int(time.time())},
timeout=60,
)
series = r.json().get("data", [])
per_metric[name] = len(series)
for s in series:
for k, v in s.items():
if k == "__name__": continue
per_label_pair[(name, k)][v] += 1
if i % 50 == 0:
print(f" scanned {i}/{len(names)} metrics ({time.time()-start:.1f}s elapsed)")
# Step 3: top metrics by series count.
top = pd.DataFrame(per_metric.most_common(15), columns=["metric", "series"])
print("\nTop 15 metrics by active series:")
print(top.to_string(index=False))
# Step 4: for the worst metric, which label is doing the damage?
worst = top.iloc[0]["metric"]
worst_labels = sorted(
((lbl, len(vals), sum(vals.values())) for (m, lbl), vals in per_label_pair.items() if m == worst),
key=lambda x: -x[1],
)
print(f"\nLabel cardinality for the worst metric ({worst}):")
for lbl, n_distinct, n_total in worst_labels[:10]:
print(f" {lbl:30s} distinct={n_distinct:>8,} total_series={n_total:>10,}")
A real run on a mid-sized Prometheus (Razorpay-scale staging cluster, ~1.6 million active series across the fleet, single-replica scrape) prints output of this shape:
distinct metric names: 4,182
scanned 0/4182 metrics (0.1s elapsed)
scanned 50/4182 metrics (3.4s elapsed)
...
scanned 4150/4182 metrics (267.8s elapsed)
Top 15 metrics by active series:
metric series
envoy_cluster_upstream_rq_xx 421,118
http_request_duration_seconds_bucket 198,402
kube_pod_container_status 89,704
envoy_cluster_external_upstream 61,840
http_requests_total 54,012
grpc_server_handled_total 49,277
node_cpu_seconds_total 42,008
container_memory_working_set_bytes 31,517
prometheus_tsdb_head_series_created 18,402
go_gc_duration_seconds 14,888
jvm_memory_used_bytes 12,401
process_cpu_seconds 11,994
razorpay_payment_attempts_total 9,872
redis_commands_processed_total 8,401
http_response_size_bytes_bucket 7,109
Label cardinality for the worst metric (envoy_cluster_upstream_rq_xx):
cluster_name distinct= 4,118 total_series= 421,118
envoy_response_code_class distinct= 6 total_series= 421,118
pod distinct= 1,704 total_series= 421,118
namespace distinct= 38 total_series= 421,118
region distinct= 3 total_series= 421,118
app_version distinct= 312 total_series= 421,118
What the output is telling you, line by line. The top entry, envoy_cluster_upstream_rq_xx at 421k series, is doing the damage. Worse, it is a metric most teams never look at — Envoy ships it by default. The label-by-label breakdown for the worst metric shows the culprits: cluster_name has 4,118 distinct values (every pod-pair connection inside the mesh), pod has 1,704 distinct values, app_version has 312 distinct values (every release ever deployed in the scrape window). The cross-product gives you the 421k. The app_version label is a churn bomb — every release adds N new pods × 6 response codes × 38 namespaces ≈ 2,000 new series that will sit in the head block until they age out. Six months of weekly releases at this rate adds ~50k stale series. The fix in this case is to drop app_version from the metric labels (it belongs on a deployment annotation, not a metric) and to apply a metric_relabel_config that drops cluster_name for cluster pairs that are not user-facing. The script's job is not to fix anything; it is to show what you are paying for, so the fix becomes obvious.
Why this audit is the highest-leverage observability work you can do in an afternoon: every team I have seen run it for the first time discovers between 30% and 60% of their cardinality is from labels that no dashboard ever queries. Stale pod values from old deployments, app_version labels that should have been annotations, full URL paths instead of route templates, customer IDs that snuck into a debug-only label and never got removed. Cutting that 30–60% does not delete any signal — by definition, no dashboard was reading it. It just stops you paying for it. Razorpay's 2024 platform-team rewrite cut their Prometheus RSS by 41% with this kind of audit alone, before any architectural changes.
What the cardinality budget looks like, per pillar
The fix for "I have a high-cardinality dimension to track" is not "delete it" — the data is genuinely valuable. The fix is to put it in the pillar whose cost model can absorb it. Each pillar has a different curve.
The decision is now mechanical. A dimension with > 1,000 distinct values that you query rarely belongs in a span attribute (Tempo) or an event column (ClickHouse / Honeycomb), not a metric label. A dimension with > 100 distinct values that churns every deploy (pod_name, git_sha) belongs in a span attribute or a deploy annotation, never a metric label. A dimension with < 50 distinct values that is queried on every dashboard panel belongs in a metric label — region, service, method, status_class are textbook examples. The rough rule of thumb every Indian platform team I have seen has converged on: a Prometheus label that takes more than 100 distinct values needs a written justification before merge, and more than 1,000 needs a director's signoff. Razorpay, Flipkart, Hotstar, and Swiggy all run some version of this guard rail post-2024.
The fix at scrape time — a relabel-config simulator you can run
Once you have identified the offending labels, the fix is a metric_relabel_configs block applied at scrape time, which drops the bad labels before the sample reaches the TSDB. The discipline is to simulate the relabel rules against a snapshot of your real series before you ship them, so you know exactly how much the rule will cut. Here is a minimal Python simulator that does this:
# relabel_simulator.py — preview how much a metric_relabel_configs rule will cut.
# pip install requests pyyaml
import re, requests, yaml
from collections import Counter
PROM = "http://prometheus.platform.razorpay.internal:9090"
# The proposed rule: drop the `pod` label and the `app_version` label
# from every Envoy metric. Pure label drop, no metric drop.
PROPOSED_RULE = yaml.safe_load("""
- source_labels: [__name__]
regex: 'envoy_.*'
action: labeldrop
regex_drop: '(pod|app_version)'
""")[0]
def relabel_one(series: dict, rule: dict) -> dict:
"""Apply a single labeldrop rule to one series, return the new label set."""
name = series.get("__name__", "")
src_re = re.compile(rule["regex"])
if not src_re.fullmatch(name):
return series
if rule["action"] != "labeldrop":
return series
drop_re = re.compile(rule["regex_drop"])
return {k: v for k, v in series.items() if not drop_re.fullmatch(k)}
# Pull every active envoy_* series.
r = requests.get(
f"{PROM}/api/v1/series",
params={"match[]": '{__name__=~"envoy_.*"}'},
timeout=120,
)
before = r.json()["data"]
before_count = len(before)
# Apply the rule, then dedupe — labels collapse, multiple series collapse to one.
after_set = set()
for s in before:
new_labels = relabel_one(s, PROPOSED_RULE)
after_set.add(tuple(sorted(new_labels.items())))
after_count = len(after_set)
print(f"before: {before_count:>10,} envoy series")
print(f"after: {after_count:>10,} envoy series")
print(f"saved: {before_count - after_count:>10,} series ({100*(before_count-after_count)/before_count:.1f}%)")
# Show which labels are still doing damage post-rule.
post_label_card = Counter()
for labels in after_set:
for k, _ in labels:
post_label_card[k] += 1
print("\nLabels remaining on envoy_* (count = series that carry it):")
for lbl, n in post_label_card.most_common(8):
print(f" {lbl:25s} {n:>10,}")
Sample run against the same staging cluster from the audit above:
before: 421,118 envoy series
after: 124,066 envoy series
saved: 297,052 series (70.5%)
Labels remaining on envoy_* (count = series that carry it):
cluster_name 124,066
envoy_response_code_class 124,066
namespace 124,066
region 124,066
__name__ 124,066
What the simulator gives you. The rule fires only on envoy_.* — production Prometheus configs frequently have a dozen rules, each scoped to a different metric prefix, and the simulator runs each independently. The labeldrop action removes the labels and recomputes the series fingerprint — series that previously differed only in pod and app_version collapse to one series, which is where the 70.5% reduction comes from. The post-rule cardinality breakdown is what you commit to next: cluster_name is now your dominant remaining label at 4,118 distinct values, and the next cardinality conversation is whether to split Envoy mesh metrics by source-cluster only (instead of cluster-pair). No production traffic is touched — this is pure offline analysis from the /api/v1/series endpoint, which is why every platform team should be running it before merging label-changing PRs. Why simulating the rule beats shipping it and watching: a labeldrop applied to a live scrape is observed at the next scrape interval, but the series that used to exist with the dropped labels do not disappear from disk for the duration of the retention window. If the rule was wrong (you dropped a label a dashboard depends on), you cannot un-apply it without rolling back configs and waiting for the bad series to age out. Simulating against the current series set tells you the consequence in milliseconds; rolling out and rolling back tells you the consequence in days.
Edge cases — when cardinality budgets surprise you
The four cardinality patterns that bite teams who have already done the basic audit are all forms of "the budget you measured is not the budget you are spending tomorrow".
Churn cardinality. Your point-in-time series count is fine — 200k. But every time you deploy, 1,500 new series are created (new pod hashes), and the old ones stick around in the head block for 2 hours and on disk for the retention window. Six months of three-deploys-a-day weekday releases later, your historical series count is 1.5M while your active count looks healthy. Backfill queries (rate(...)[7d]) suddenly start OOMing. The fix is metric_relabel_configs that strip ephemeral labels (pod, instance) at scrape time, exposing them only as exemplars or in span attributes when you need them.
Recording-rule cardinality. You write a recording rule service:http_requests:rate5m to pre-compute a frequently-queried expression. The recording rule has its own series — and if the source metric had 200k series, the recording rule has 200k series too, plus any aggregation labels you added. Aggressive use of recording rules without by () clauses doubles your active-series count without anyone noticing. The fix is to always aggregate down in recording rules — sum by (service, method) — and to budget recording rules separately from raw metrics.
Histogram-bucket cardinality. A _bucket series exists for every bucket boundary you declare. If http_request_duration_seconds_bucket has 12 buckets and 7,200 base series (method × status × service × region), that is 86,400 bucket series — twelve times the apparent count. Native histograms (Prometheus 2.40+, OpenMetrics 1.2) collapse this into one series per metric with sparse buckets, but most production Prometheuses are still on classic histograms. Audit your _bucket cardinality separately from your _count and _sum.
Per-tenant cardinality fan-out. A SaaS platform that adds tenant_id to every metric, with one new tenant onboarded per day, is signing up to a budget that grows linearly with success — by the time you have 5,000 tenants you have multiplied your cardinality by 5,000×. Zerodha, Cred, and most B2B Indian SaaS companies (Razorpay's merchant-facing dashboards included) hit this wall around year three of growth and spend a quarter migrating their per-tenant metrics to a column store with the events pillar pattern. The fix is to never put tenant_id on a Prometheus label except for global infra metrics (e.g., aggregate billing); per-tenant operational metrics belong in a per-tenant ClickHouse partition or Honeycomb dataset.
Why these four are the patterns that bite "after" the basic audit: the basic audit measures point-in-time series count. Churn, recording rules, histogram buckets, and per-tenant fan-out are all ways your future series count differs from today's. A team that has already run the script above and is comfortable with their current number gets blindsided by these because none of them show up in count by (__name__)({__name__=~".+"}) until the harm is done.
Common confusions
-
"Cardinality is the same as throughput." False. Throughput is samples-per-second; cardinality is distinct-series-count. A metric can have low throughput and disastrous cardinality (a counter incremented once every 10 seconds, but with 50 million distinct customer-id labels, ingests 5M samples/sec at 100M active series — and the cardinality is what kills you, not the throughput). Prometheus is bottlenecked on series count, not on samples-per-second; samples-per-second is a downstream consequence of high series counts plus the scrape interval.
-
"Cardinality only matters for Prometheus." Misleading. The TSDB pillar is the most cardinality-sensitive, but Loki has stream cardinality (each unique label set is a distinct stream, and Loki bills you per stream), Tempo has indexed-attribute cardinality (the attributes you choose to index multiply the same way), and Honeycomb has dataset partition cardinality. Every pillar has a cardinality dimension; the cost curves differ. The discipline is to know which curve you are on for each dimension you add.
-
"High-cardinality labels are bad." Misleading shorthand. High-cardinality labels in a metric are bad. The same dimension as a span attribute, an event column, or a log body field is fine — sometimes it is exactly the dimension you most need. The full rule is "match the dimension to the pillar whose cost curve can absorb it", not "avoid high cardinality everywhere".
-
"Aggregation in PromQL fixes cardinality." False.
sum by (region) (http_requests_total)is cheap to query but does not change the storage cost — the underlying series still exist on disk and in head blocks. Aggregation reduces query memory and dashboard latency, not ingest cost. Storage cost is set at scrape time by the labels the exporter emits; you cannot subtract it later. -
"Native histograms make cardinality irrelevant." Overstated. Native histograms (Prometheus 2.40+, sparse-bucket histograms in OpenTelemetry) collapse the per-bucket series count from 12 to 1, which is a meaningful 12× win — but they do nothing to reduce the cardinality of the base labels (
method,status,service, etc.). A native histogram withcustomer_idas a label is still a cardinality bomb. Native histograms are necessary, not sufficient. -
"You can fix cardinality with a bigger Prometheus." True up to a point and then catastrophically false. A 32 GB Prometheus can hold ~5–8M series; a 128 GB Prometheus can hold ~25–40M series. Beyond that you need a sharded TSDB (Cortex, Mimir, VictoriaMetrics cluster, Thanos receivers), which is a different operational beast — and at typical Indian-team scale (Razorpay, Flipkart, Hotstar) the platform team would prefer to fix the cardinality than to operate the sharded cluster. "Buy more RAM" is the answer for one quarter; "fix the cardinality budget" is the answer for the next two years.
Going deeper
The Prometheus TSDB internals — why exactly 3–8 KB per series
Prometheus 2.x stores its head data in memory split across three structures. The series-index is a fingerprint-to-series-ref hash table — about 50 bytes per series for the entry plus the average label-string overhead, ~120 bytes per series in practice. The postings list is the inverted index from (label, value) → list of series IDs containing that pair — Prometheus encodes posting lists as delta-of-delta varints, so the per-series cost is amortised but typically 30–80 bytes. The head chunk holds the most recent ~3 hours of samples per series, with chunk metadata (~200 bytes) plus 120 samples × 1.3 bytes/sample compressed via Gorilla XOR (~160 bytes), so ~360 bytes per active series. Add label-name strings (interned, but not free), exemplars if enabled, and the per-series total typically lands at 3–6 KB at low label-string lengths and 6–8 KB at high (Kubernetes pod-name labels with 60-character hash suffixes are the worst offenders). The Pelkonen et al. Gorilla paper (VLDB 2015) is the foundational read; Part 8 of this curriculum dissects the encoding step by step.
The pod label trap — Kubernetes-specific churn
Every Kubernetes Pod gets a unique name with a 5-character hash suffix (payments-7d4bf6c4d-x9k2p). Every rolling deploy replaces every pod with a fresh hash. If you scrape the pod's /metrics and label series with pod, every deploy adds N new series (one per metric per pod) and the old ones stick around in the head block for 2 hours, on disk for the retention window. A 50-pod deployment with 200 metrics, deployed three times a day on weekdays, adds ~30,000 stale series per week to the head and ~120,000 series-blocks per week to disk. Six months later, your historical series count has ballooned by an order of magnitude even though the active count looks fine. The fix is metric_relabel_configs that drop the pod label at scrape time (preserving service, namespace, node) and re-expose pod identity only as an exemplar on the histogram bucket or as a span attribute on the trace. Hotstar's IPL 2023 Prometheus stack documents this pattern in their public engineering blog as the single largest cardinality reduction they made.
The Hotstar 2023 IPL incident — when scrape interval matters
Hotstar's IPL final 2023 (CSK vs GT, 25M concurrent viewers) saw a sustained 22% increase in active series during the match because every viewer's session triggered a metric increment with a device_class label that had been recently added (40 distinct values × the existing cross-product). The metric itself was reasonable; the problem was that scrape interval had been tightened from 30s to 15s for the duration of the event ("we want fresher data during the match"), which doubled the chunk roll-over rate and pushed the head block into an unfavourable compaction pattern. RSS climbed to 110 GB, the WAL hit its size cap, and writes started rejecting. The post-incident fix was to cap scrape interval at 30s regardless of event load and to move device_class to a span attribute on the playback trace — neither change reduced the underlying cardinality, but together they kept the TSDB inside its operating envelope. The lesson is that scrape interval interacts with cardinality multiplicatively at the chunk-roll boundary; doubling scrape frequency on a high-cardinality metric is more expensive than the linear math suggests.
Reproduce this on your laptop
# Reproduce this on your laptop
docker run -d --name prom -p 9090:9090 prom/prometheus
python3 -m venv .venv && source .venv/bin/activate
pip install requests pandas prometheus-client
python3 cardinality_audit.py # prints top metrics + label cardinality
After running the audit on a fresh Prometheus that scrapes only itself, you should see ~600 metrics and ~20k series — already enough to demonstrate the per-label-pair fan-out for prometheus_* and go_* metrics. Add a second scrape target (docker run -d -p 9100:9100 prom/node-exporter and add it to prometheus.yml), and the active-series count will jump to ~50k — that delta is the linear cost of one extra exporter, in numbers you can see.
Where this leads next
Cardinality is the master variable; what you do with that variable lives in three downstream places.
- The observability-vs-monitoring distinction — chapter 4, where the dimensional-flexibility argument from this chapter becomes the operational definition of "observability". A team that has solved its cardinality budget by routing high-cardinality dimensions to events has built observability; a team that has solved it by deleting labels has built monitoring.
- Wall: metrics without a time-series store are useless — chapter 5, the closing chapter of Part 1, which makes the point that the TSDB is what makes a metric a metric, and the TSDB's cost model is what makes cardinality a budget. The cardinality numbers from this chapter are the inputs.
- Part 6 of this curriculum (chapters 30–40 in the syllabus) is dedicated entirely to cardinality at production scale — recording-rule cost models, exemplar discipline, label-relabelling pipelines, the migration from per-tenant metrics to per-tenant events, and the multi-cluster aggregation patterns Mimir and VictoriaMetrics use to absorb the budgets the platform team can't fix in time.
The discipline you are building here is "before you add a label, ask which budget pays for it". Every other cost in observability — query latency, alert evaluation time, dashboard load, retention bills — flows downstream from that one decision.
References
- Pelkonen et al., "Gorilla: A Fast, Scalable, In-Memory Time Series Database" (VLDB 2015) — the foundational paper that designed Prometheus's storage layer; the per-series cost numbers in this chapter come from a direct read of the paper plus the Prometheus 2.x implementation.
- Prometheus design documentation — the canonical reference for the TSDB internals, postings list, and head block structure.
- Observability Engineering — Charity Majors, Liz Fong-Jones, George Miranda. Chapter 6 ("The Cost of High Cardinality") is the field reference for the events-vs-metrics decision this chapter makes mechanical.
- Grafana Labs blog: avoiding cardinality explosions — the operational write-up that documents the
metric_relabel_configspattern most Indian platform teams now use. - VictoriaMetrics: high cardinality and high churn rate — a vendor doc, but the cleanest published statement of the per-pillar cost model and the "1,000-distinct-values rule" this chapter cites.
- Honeycomb: high cardinality is the point — the column-store side of the argument, useful as the rebuttal to "high cardinality is bad".
- Why "three pillars" is a flawed framing — the previous chapter, which named the events pillar; this chapter explains the cost-model reason events get to be a pillar at all.
- Metrics, logs, traces: what each is good at — chapter 1 of this curriculum, the baseline three-pillar establish that this chapter's per-pillar budget extends.