Wall: cardinality is the billing death spiral
It is 09:14 IST on the first Monday of the quarter and Aditi, the platform-team lead at a Bengaluru fintech, is reading the Datadog invoice email her CFO forwarded with one word in the body: "explain". The bill is up 47% from last quarter — ₹38 lakh has become ₹56 lakh — and her team shipped no new services, scraped no new endpoints, and added no new dashboards. They did, six weeks ago, ship one perfectly innocent-looking change: the payments service started attaching customer_id as a label on the request_duration_seconds histogram so on-call could break down latency by customer for the new enterprise SLA. The label has 1.4 million distinct values. The histogram has 12 buckets. The service has 8 endpoints. The fleet has 240 pods. The math — 1,400,000 × 12 × 8 × 240 — works out to roughly 32 billion active series, each costing Datadog the moral equivalent of ₹0.000018 per series-month, and Aditi's CFO is now staring at the result. The team's instinct is "switch vendors" or "negotiate". The actual fix is "delete the label" — but no one wants to lose the dashboard the sales team built last week, the dashboard the sales team would not have built if anyone had named the cost up front. This is the wall. Every observability budget eventually collides with the cost of cardinality, and sampling — the entire subject of Part 5 — does almost nothing to help.
Sampling controls the trace and log bill but barely touches the metrics bill, because metrics cost is dominated by active series count, which is dominated by label cardinality — the cross-product of label-value sets per metric. Adding one label with N distinct values multiplies your series count by N, multiplies your TSDB memory by N, multiplies your vendor invoice by N, and the multiplication compounds across every label you have already added. The death spiral is structural: each new label looks cheap in isolation, the cumulative effect is exponential, and the only honest fix is a cardinality budget enforced in CI before the metric ships. Part 5 ends here, Part 6 begins on the other side of this wall.
Why sampling does not save your metrics bill
The previous five chapters of Part 5 — head sampling, tail-based sampling on the OTel Collector, adaptive sampling, error-rate-driven dynamic sampling — solved the trace pillar's cost problem. A 100K-RPS fleet emitting 80 spans per request produces 8 million spans per second; a 1% sampler keeps 80,000 spans per second, the trace store stays sane, the bill is bounded. The same logic applies to the log pillar — head-sampled logs at 1% keep the Loki cluster from drowning. Everything in Part 5 was about the question "which events do we keep?" and the entire toolkit assumes the cost is per-event.
The metrics pillar does not work that way. A Prometheus counter http_requests_total{method="POST", route="/checkout", status="200"} is one time series, regardless of whether the underlying endpoint serves 10 requests per second or 10 million. The series stores its samples — one float every 15 seconds, compressed to ~1.3 bytes/sample after Gorilla XOR — and the storage cost is (num_series) × (samples per scrape interval) × (bytes per sample). Sampling the requests does not reduce the number of series. A pod handling 10 RPS and a pod handling 10,000 RPS each produce one series per (method, route, status) combination. Halving the request rate halves nothing on the metrics bill.
What multiplies the metrics bill is adding labels. Every label adds a dimension to the cross-product of distinct values. The series count for a single metric is Π (cardinality of label_i) summed across all the label combinations that actually occur in production. Add a label with 5 values (region: ap-south-1, ap-south-2, us-east-1, eu-west-1, us-west-2) — series count multiplies by 5. Add a label with 240 values (one per pod hostname) — multiplies by 240. Add customer_id with 1.4 million values — multiplies by 1.4 million.
Why this is not a Prometheus quirk: the cross-product is a property of how time-series databases index data. Every TSDB — Prometheus, VictoriaMetrics, M3, Cortex, Mimir, Datadog's internal store, New Relic's, even ClickHouse-backed metrics layers — stores one logical row per unique label set per metric. The cross-product is the index's primary key. There is no encoding trick that makes 11M series cost the same as 1,600 — the index has to enumerate them, the gossip protocol has to replicate them, the query layer has to merge them. The "death spiral" name is not metaphor: every additional label multiplies cost, every additional metric adds a new multiplied row, and the curve compounds.
The reader new to this discovers the death spiral the same way Aditi did — a routine label addition produces a routine cost increase produces a routine quarterly review, and three quarters later the metrics bill is larger than the compute bill it was supposed to monitor. The structural failure is not vendor pricing; the structural failure is that nobody told the team adding the label that they were committing to a million-fold series growth. Part 5 ends with this naming because Part 6 — the cardinality discipline — is the answer.
A measurement: simulate four real production cardinality additions
The fan-out diagram is a sketch. The engineering question is: how does cost actually scale when a real team makes four real label decisions over six months on a real fleet? The script below simulates a Razorpay-shaped fleet — 18 services, 40 endpoints each, ~12,000 RPS aggregate — and walks through four label additions a platform team might ship in a quarter, computing series count, monthly storage, and a vendor-equivalent invoice in rupees at each step.
# cardinality_death_spiral.py — simulate four label additions on a fleet over a quarter
# pip install pandas
import pandas as pd
# Fleet baseline: 18 services × 40 endpoints, RED-method metrics
SERVICES = 18
ENDPOINTS_PER_SVC = 40
METHODS = 5 # GET, POST, PUT, DELETE, PATCH
STATUS_CODES = 8 # 2xx (3) + 4xx (3) + 5xx (2)
HISTOGRAM_BUCKETS = 12 # le=0.005, 0.01, ..., 5.0, +Inf
# Pricing model (illustrative — order-of-magnitude based on Datadog/Grafana Cloud public pricing)
COST_PER_SERIES_PER_MONTH_INR = 0.018 # ~₹0.018 per active series-month
SCRAPE_INTERVAL_SEC = 15
SAMPLES_PER_DAY_PER_SERIES = (24 * 3600) / SCRAPE_INTERVAL_SEC # 5,760
BYTES_PER_SAMPLE_AFTER_GORILLA = 1.3
RETENTION_DAYS = 30
def fleet_series_count(label_factors):
"""label_factors is a dict {label_name: cardinality}; compute total series."""
base = SERVICES * ENDPOINTS_PER_SVC * METHODS * STATUS_CODES * HISTOGRAM_BUCKETS
multiplier = 1
for v in label_factors.values():
multiplier *= v
return base * multiplier
def monthly_cost_inr(series):
return series * COST_PER_SERIES_PER_MONTH_INR
def storage_gb(series):
bytes_total = series * SAMPLES_PER_DAY_PER_SERIES * RETENTION_DAYS * BYTES_PER_SAMPLE_AFTER_GORILLA
return bytes_total / (1024**3)
# The quarter unfolds — four label additions, each "small"
timeline = [
("week 0", "baseline (no extra labels)", {}),
("week 3", "+ pod_name (240 pods, for per-pod debugging)",
{"pod_name": 240}),
("week 7", "+ region (5 regions, multi-region rollout)",
{"pod_name": 240, "region": 5}),
("week 11", "+ tenant_tier (3 tiers: free/pro/enterprise, for tier SLAs)",
{"pod_name": 240, "region": 5, "tenant_tier": 3}),
("week 13", "+ customer_id (1.4M customers, for enterprise SLA dashboard)",
{"pod_name": 240, "region": 5, "tenant_tier": 3, "customer_id": 1_400_000}),
]
rows = []
prev_series = None
for week, change, labels in timeline:
s = fleet_series_count(labels)
rows.append({
"week": week,
"change": change,
"active_series": f"{s:,}",
"monthly_inr": f"₹{monthly_cost_inr(s):>16,.0f}",
"tsdb_storage_gb": f"{storage_gb(s):>10,.1f} GB",
"growth_×": "—" if prev_series is None else f"{s/prev_series:,.1f}×",
})
prev_series = s
print(pd.DataFrame(rows).to_string(index=False))
A representative run prints:
week change active_series monthly_inr tsdb_storage_gb growth_×
week 0 baseline (no extra labels) 345,600 ₹ 6,221 0.1 GB —
week 3 + pod_name (240 pods, for per-pod debugging) 82,944,000 ₹ 1,492,992 17.4 GB 240.0×
week 7 + region (5 regions, multi-region rollout) 414,720,000 ₹ 7,464,960 87.0 GB 5.0×
week 11 + tenant_tier (3 tiers: free/pro/enterprise, for tier SLAs) 1,244,160,000 ₹ 22,394,880 260.9 GB 3.0×
week 13 + customer_id (1.4M customers, for enterprise SLA dashboard) 1,741,824,000,000 ₹31,352,832,000 365,253.9 GB 1,400,000.0×
Per-line walkthrough. The line base = SERVICES * ENDPOINTS_PER_SVC * METHODS * STATUS_CODES * HISTOGRAM_BUCKETS computes the bare RED-method baseline: 18 × 40 × 5 × 8 × 12 = 345,600 series. This is what the fleet ships before any team adds anything. The bill at this baseline is ₹6,221/month, which is what the metrics line item should look like for an 18-service fleet of this shape. Why ₹6,221 and not "free": even a baseline RED-method instrumentation produces hundreds of thousands of series because histograms multiply by their bucket count (HISTOGRAM_BUCKETS = 12 — typical for Prometheus default), and every (method, status) pair is its own series. Teams that look at one service in isolation see "a few hundred series, no problem"; the fleet-wide aggregate is two orders of magnitude larger because the multiplication happens at the fleet level, not the service level.
The line {"pod_name": 240} is week 3's addition. A platform engineer adds pod_name because debugging a noisy pod requires per-pod aggregation. The series count multiplies by 240 (one per pod) and the bill jumps from ₹6,221 to ₹14,93,000 (₹14.93 lakh) per month. This is the first death-spiral signal — but on a quarterly review it looks like "we scaled up our infrastructure observability by 240×, of course it costs more". The decision was not wrong; the awareness was wrong. Nobody costed the 240× growth in advance.
The line {"customer_id": 1_400_000} is week 13's catastrophe. A product engineer adds customer_id because the new enterprise SLA dashboard needs per-customer p99 latency. The series count goes from 1.24 billion to 1.74 trillion. The bill goes to ₹31 thousand crore per month — clearly impossible, the TSDB will OOM long before the bill is rendered. Why this is the textbook case: customer_id looks innocent in the code (just another label), the dashboard renders fine in the engineer's local Grafana (because the local Prometheus has 5 customers, not 1.4M), and the cardinality only manifests when the metric flows through the production scrape — at which point the TSDB starts evicting older blocks, the alerting rules start timing out their query_range calls, and the on-call gets paged for "metrics ingestion lag" which turns out to be self-inflicted. The fix is not "make the TSDB bigger" — no TSDB on earth can hold 1.7 trillion active series — the fix is "delete the label, replace it with exemplars or with a sampled subset of high-value customers".
The line growth_× column tells the death-spiral story most clearly. Each label addition looks like a 3×-to-240× growth, which on a quarterly review feels manageable ("we knew the multi-region rollout would cost more"). The cumulative effect is 5 billion times the baseline by week 13. The exponential is visible only in retrospect; in the moment, every individual addition feels linear. This is the cognitive trap that produces every cardinality incident — humans reason about each label in isolation, but the bill is the cross-product.
A second-order observation hidden in the table: the tsdb_storage_gb column grows in lockstep with the series count, but the query latency grows faster. A query like histogram_quantile(0.99, sum by (route) (rate(request_duration_seconds_bucket[5m]))) reads every series matching the selector and merges them; the merge cost is O(n × log n) in series count, so a 240× series increase produces roughly a 480× query-time increase (assuming the query was IO-bound to start with). At baseline the query returns in 80ms; at week 3 it returns in 38 seconds, which exceeds Grafana's default panel timeout. Dashboards stop loading. Alerts based on the same query stop evaluating in their interval. The bill is the visible cost; the invisible cost is that every team that depends on the metric's queries — dashboards, alerts, downstream systems — starts seeing degraded latency on the day the cardinality grows, weeks before the invoice arrives.
The headline of the measurement is the multiplicative compounding. Sampling, by contrast, is additive: dropping 99% of traces reduces trace volume by 100×, but never by 1.4 million×. The metrics bill grows by orders of magnitude that sampling architectures simply cannot reach. This is why Part 5's tools — every sampler in this Build — are the wrong tool for the job. They were never designed to address it.
What sampling does and does not transfer to the metrics pillar
There is a tempting analogy: "if sampling reduces trace cost, surely a similar trick reduces metrics cost". The analogy fails in three specific ways, and naming each one is the bridge into Part 6.
The first failure is the per-event vs per-series distinction. A sampler operates on events — one trace per request, one log per write, one span per service-call. The metrics pillar's atomic unit is not the request; it is the series. A series exists whether anyone calls it or not — once the metric is registered with its label set, the TSDB allocates the row, the gossip protocol replicates the series identity, the scrape produces a sample even when the underlying counter is zero. You cannot "sample a series" — every scrape interval produces one sample per series, deterministically. The pillar's cost driver is allocation, not traffic.
The second failure is the aggregation layer mismatch. A sampler can drop 99% of events because the kept 1% is statistically representative — you reconstruct population statistics from the sample. Metrics already are aggregations: request_duration_seconds_bucket{le="0.5"} is a counter of how many requests took less than 500ms, summed over all requests in the scrape interval. Dropping 99% of those scrapes (i.e., scraping every 25 minutes instead of every 15 seconds) does reduce series count zero — there are still the same number of series — and the 25-minute granularity makes alerts that depend on 5-minute windows useless. Metrics sampling, naively applied, breaks the alerting model the metrics exist to support.
The third failure is the cardinality vs volume orthogonality. Sampling reduces volume — events per second. Cardinality is series count — distinct label sets. A fleet at 100K RPS with 100 series and a fleet at 100 RPS with 100,000 series produce the same scrape volume (both emit 100K samples per scrape interval if the scraper hits both metrics) but the second fleet is 1,000× more expensive to store because the index size is the bottleneck. Sampling the requests does not change the series count; only changing the labels does. The two cost levers are independent.
A fourth observation worth naming: there is one trick that does transfer — exemplars. An exemplar is a high-cardinality detail (e.g., a trace_id) attached to a metric's bucket increment, but stored separately from the metric's series structure. The bucket counter is still one series per (le, method, route, status) cross-product; the exemplar is a side-table that holds, for example, the trace_id of the last request that fell into the le=0.5 bucket. The exemplar lets a debug session jump from "p99 latency spiked" (the metric) to "here is a specific trace that took 1.2 seconds" (the exemplar's trace_id). The cardinality budget pays only for the metric's series; the exemplar storage scales with samples per second, not with the cross-product. Exemplars are the metric pillar's answer to "I want per-customer detail without per-customer series". This is the bridge Part 6 builds in detail; the wall this chapter names is what makes the bridge necessary.
A fifth observation that traps teams: scrape-rate sampling is not series-count sampling. A team panicked by a metrics bill sometimes reaches for "scrape every 60s instead of every 15s" thinking it is the equivalent of trace sampling. The bill drops by 4× in some pricing models (per-sample storage), but stays unchanged in the more common per-active-series-per-month model. Why the latter dominates: vendors price on series-month because the dominant cost component for them is the index — the structure that maps (metric_name, label_set) → series_id — and the index size is a function of distinct series, not samples. A 60-second scrape interval has the same index size as a 15-second scrape interval; only the per-series sample count changes. The developer's instinct to "scrape less" addresses the wrong dimension. Worse, scraping less breaks alerting: a 5-minute burn-rate alert with 60-second scrapes has only 5 data points to evaluate; the alert flaps on noise that 15-second scrapes would have averaged out. Scrape-rate is a dimension you can tune, but only after you have already fixed cardinality — and the fix for cardinality is in Part 6's toolkit, not Part 5's.
A subtler point worth pausing on: the simulation assumes the labels are added one at a time across a quarter, but real fleets often add multiple labels in a single PR — "let me add pod_name, region, and tenant_tier together so the new monitoring dashboard works end-to-end". The cumulative growth is the same (the math is multiplicative regardless of when each label is added), but the detection is different. If a team adds three labels in one PR, the daily-diff alert fires once with a 3,600× growth signature; the on-call investigates one anomaly. If they add them one at a time across three weeks, the alert fires three times with 240×, 5×, 3× signatures and the on-call gets fatigued and stops investigating after the second one. The empirical lesson from teams that have shipped daily-diff alerts: batching label changes into single PRs is operationally cheaper than spreading them out, even though the spread-out approach feels more conservative. The conservative-feeling approach hides the cumulative cost.
A nuance that catches teams: the cross-product is only over labels that actually co-occur. If region=ap-south-1 only ever appears with tenant_tier=enterprise (because the enterprise tier is geo-restricted to India) then those two labels do not produce a 5 × 3 = 15-fold expansion; they produce a 5 + 3 = 8-fold expansion (only the diagonal of the matrix is populated). Real fleets sit between these extremes: most label pairs co-occur partially, and the empirical series count is somewhere between the additive-floor and the multiplicative-ceiling. The right tool to measure your fleet's actual position is count by (label_a, label_b) ({__name__="metric_name"}) against your production Prometheus — the result is the actual number of (label_a, label_b) pairs the fleet ever emits, not the theoretical maximum. Fleets in early growth show series counts close to the multiplicative ceiling because the cross-product is fully populated; mature fleets with selective labelling show series counts at 30-60% of the ceiling. The death spiral is real either way — but knowing whether your fleet is at the ceiling or the floor changes the order in which you fix things. Fix the labels nearest the ceiling first; they are paying for combinations that do not exist in production.
Five lived patterns Indian teams ship after they hit the wall
The metrics-bill incident is so common that the recovery playbook is roughly the same across teams. Five patterns recur across Razorpay, Hotstar, Zerodha, Flipkart, PhonePe.
Pattern 1 — the cardinality budget enforced in CI. Every metric in the fleet has a declared max_cardinality in a YAML file checked into the repo. A CI job runs promtool (or an in-house equivalent) against the metric definition and computes the theoretical maximum series — Π cardinality(label_i) — using the registered set of allowed label values. If the theoretical max exceeds the declared budget, the build fails. The PR cannot merge without either lowering the budget, removing a label, or constraining a label's allowed values (region: enum[ap-south-1, ap-south-2, us-east-1] rather than region: any string). Razorpay's platform team ships this as a Bazel rule; the rule blocked 14 high-cardinality additions in 2024, each of which would have been a quarterly-bill incident.
Pattern 2 — the histogram bucket count audit. A Prometheus histogram defaults to 12 buckets (0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, +Inf). Every histogram metric multiplies its underlying series count by 12. A fleet with 200 histograms and a 240-pod cardinality multiplier is paying for 200 × 12 × 240 = 576,000 series just for the bucket structure. The audit replaces native histograms with native_histograms (Prometheus 2.40+, single-series exponential histograms) where the bucket-count multiplier collapses to 1. PhonePe migrated their UPI latency histograms in Q3 2024 and reduced histogram-driven series by 87%; the migration took six weeks because every alert and dashboard panel had to be rewritten to use the histogram_quantile_native function.
Pattern 3 — the high-cardinality label moved to exemplars. The customer_id, trace_id, request_id, session_id labels are the canonical death-spiral causes. The pattern is to never put these in the metric's label set at all — instead, attach them as exemplars on the histogram bucket increment. The metric's series count stays at the (method, route, status) baseline; the exemplar storage handles the high-cardinality detail. The dashboard panel for "p99 latency by customer" is built differently: the panel queries the metric for the aggregate p99, and a separate panel pulls exemplars for the most-recent traces from the high-latency buckets. The reader gets a representative trace per customer-tier rather than a per-customer time series, which is what they actually needed for debugging anyway. Cred shipped this in 2024 after their first cardinality incident.
Pattern 4 — the recording-rule pre-aggregation. A high-cardinality metric kept "raw" produces an unsustainable bill, but a pre-aggregated version of it can produce useful insight at a fraction of the cost. A recording rule of the form sum by (route, status) (rate(http_requests_total[5m])) evaluates every 30 seconds, writes one new metric (http_requests_5m:sum_by_route_status) with the cross-product of (route, status) only — dropping pod_name, region, tenant_tier. The downstream dashboards query the pre-aggregated metric. The original raw metric's retention can be dropped to 4 hours (just enough for the recording rule to consume) while the pre-aggregated metric retains 30 days. Hotstar's IPL dashboards run this pattern; the raw-metric storage is 4 hours, the recording-rule output is 90 days, and the bill is 1/40th of the un-pre-aggregated equivalent.
Pattern 5 — the cardinality dashboard with daily diff alerts. The platform team runs a Grafana dashboard whose top panel is count by (job) ({__name__=~".+"}) — the per-scrape-job series count over time. A second panel is topk(20, count by (__name__) ({__name__=~".+"})) — the highest-cardinality metrics. A third panel is topk(20, count by (label_name) ({__name__=~".+"})) — the most-multiplied labels. An alert fires on delta(scrape_series_count[24h]) > 100000 — if any scrape job's series count grew by more than 100K in 24 hours, page the platform team. The on-call goes to the dashboard, finds the offending metric, walks back to the PR that introduced it, and either rolls back or schedules a discussion. Zerodha implemented this after their 2023 KYC-service cardinality incident; the daily-diff alert has caught 8 incidents in the 18 months since, none of which became billing incidents because the catch was at week 1 of the death spiral, not week 13.
A sixth pattern worth a paragraph: service-team-owned label whitelists. Each service team declares — in their service's manifest — the labels their metrics are allowed to carry. The platform team's Prometheus relabel rules enforce the whitelist by dropping any label not on the list before the metric is ingested. This catches the case where a developer adds a label in code but forgets to update the manifest; the label is silently stripped, the developer notices in their dashboard ("why isn't customer_id showing up?"), checks the manifest, and either adds it (going through the cardinality-review process) or removes it from the code. The discipline is opt-in labels, not opt-out — by default a label does nothing; only labels explicitly registered are allowed through. Razorpay's relabel-rule generator is in their internal "obs-platform" repo; teams that want to add a new label file a PR against the manifest and the cardinality-review automation runs on the PR.
A seventh pattern, less talked about but appearing in three of the five teams: per-tenant cardinality quotas. A multi-tenant SaaS metrics layer (think Mimir or Cortex with the per-tenant API) lets the platform team allocate a series budget to each downstream service team. The team that ships its budget gets a Slack alert when they cross 80%, a hard ingestion-block when they cross 100%, and a billed overage when they explicitly buy more. The ergonomic property: the budget is owned by the team, not the platform. The team can spend their budget on pod_name if they want; they can spend it on customer_id if they accept the bill; they can spend it on more metrics with fewer labels each — the choice is theirs as long as it stays inside the budget. The platform team's job becomes "set the budget", not "approve every label". Flipkart shipped this in 2024 across 110 internal service teams; the per-team cardinality is now a line item in the team's quarterly OKR review, the same way error budget is. The cultural shift is from "platform polices labels" to "teams own their cardinality" — which is what observability-as-a-discipline actually looks like at scale.
An eighth pattern, and the one Indian fintechs converge on after their first incident: the cardinality kill-switch. A service emits a metric, the cardinality grows, the platform team's daily-diff alert fires at week 1, and the on-call needs a way to stop the bleeding within minutes — not wait for a PR review and a rollout. The kill-switch is a Prometheus relabel rule injected at the scrape layer that drops the offending metric or label entirely, applied via a config reload (not a redeploy). The on-call updates a single YAML file, runs kubectl rollout restart deployment/prometheus-server (or hits the /-/reload endpoint), and the runaway metric stops being ingested in under 90 seconds. The historical data is still there; the bleeding stops at "now". Cleartrip's incident playbook lists this as step 1; their on-call has used it three times in 2024, each time saving an estimated ₹4-12 lakh in monthly bill before the underlying fix landed in code review.
Three edge cases the cardinality bill hides
The death-spiral simulation captures the headline mechanism. Three edge cases produce bills that are even worse than the multiplicative formula predicts, and each is a trap a team only finds after they have already paid for it.
The first is label churn — labels whose value sets change over time, not labels whose value sets are large at a single moment. Consider a pod_name label on a fleet of 240 Kubernetes pods. At any single instant, the cardinality is 240. But Kubernetes recreates pods on every deploy; over a 30-day retention window, the fleet has cycled through perhaps 4,800 distinct pod names (one fleet replacement per business day, plus the occasional bad rollout). The TSDB stores historical series even after they stop receiving samples — the 240 currently-active pod names plus 4,560 stale ones from earlier in the retention window. The active series count is 240; the billable series count is 4,800. Most vendors price on billable series, not active. The fix is to either drop pod_name (replace with node_name or availability_zone, which churn slower) or to shorten the retention specifically for high-churn metrics. PhonePe's checkout fleet has a 7-day retention on pod_name-labelled metrics and 90-day retention on the same metrics with pod_name stripped via a recording rule — the two-tier retention is what makes the budget work.
The second is deploy-time cardinality spikes. A canary rollout produces, briefly, two distinct version labels — v2.4.1 and v2.4.2 — each with the full cross-product of pod names, regions, and tenant tiers. During the canary window the series count doubles. If the canary fails and rolls back, the rolled-back series sit in the TSDB for the full retention window, billable, even though no pod is currently emitting them. A team that does six canary rollouts a week accumulates twelve weeks of dead version-labelled series — bills paid for series no living code emits. The fix is the labels-as-API discipline layered with TTL-aware metric registration: the metric registration declares "this label has a 7-day TTL after last sample" and the TSDB compactor evicts series that have not received samples in that window. Datadog's APM does this automatically; self-hosted Prometheus deployments must configure it via --storage.tsdb.retention.size and tombstone-handling rules.
The third is cardinality propagation across federated TSDBs. A team running Thanos or Cortex or Mimir has multiple ingester replicas plus a query layer plus a long-term store, and each layer charges differently. A high-cardinality metric ingested at the source replica produces 1.4M series; the federation layer deduplicates across replicas (so the source replica's HA peer does not double-bill); the long-term store compacts older blocks (so the storage cost falls over time). The bill at any single layer is not the bill at the next — and a developer who measures cardinality at the source replica may underestimate the long-term-store cost by 3-5×. Hotstar runs a Thanos federation across three regions and tracks cardinality at four points: per-replica ingester, post-dedup at the query layer, post-compaction in the long-term store, and per-tenant in the per-tenant index. Each layer has its own budget and its own alerting; the team that watches only the source replica gets surprised at quarter-end when the long-term-store bill is 4× the ingester bill. The discipline is whole-stack cardinality observability — the cardinality dashboard from Pattern 5 runs not at one TSDB layer but at every layer the metric flows through.
What this wall closes and what Part 6 opens
Part 5 set out to answer "which events do we keep and which do we drop". The toolkit — head sampling, tail sampling, adaptive sampling, dynamic error-rate sampling, exemplars at the trace boundary — solves the trace and log pillars' cost problems and barely touches the metrics pillar's cost problem. The reader who has internalised every chapter in Part 5 has a working observability stack on traces and logs and an out-of-control metrics bill. The wall is real and the wall is structural.
There is one more honest thing to name about Part 5 before crossing the wall: every sampling architecture in this Build assumes the events are the cost, and assumes a successful sampler is one that drops events while preserving the signal in the kept set. The metrics pillar inverts both assumptions. The events are not the cost — the registrations are. A successful "sampler" for the metrics pillar is one that prevents registrations from happening in the first place — which is not a sampler at all. It is a budget enforced before the metric ships. The vocabulary mismatch is part of why teams reach for sampling tools and find them inadequate; the right tools have different names (relabel rules, recording rules, exemplars, label whitelists) and a different design centre (compile-time enforcement, not runtime decision-making). Part 6's chapters are organised around this re-centring.
Part 6 opens on the other side of this wall. The deliverables are: the cardinality budget as a first-class artefact (Pattern 1 above, but as a discipline, not just CI); the exemplar pattern that lets metrics carry high-cardinality detail without paying the cross-product cost; the recording-rule pre-aggregation that lets raw high-cardinality metrics live for hours and useful aggregates live for months; the labels-as-versioned-API discipline that prevents teams from shipping labels their downstream dashboards do not need. The chapters that follow — cardinality-the-master-variable, the-label-explosion, prometheus-relabel-rules-as-cardinality-firewall, recording-rules-and-when-to-write-them, exemplars-linking-metrics-to-traces — are the engineering ladder out of the death spiral.
A practical sequencing for a team confronting the wall in real time: do not panic-redesign the entire metrics pipeline, because the time pressure produces worse decisions than the bill itself. Stage the response. Week 1: install the cardinality kill-switch (Pattern from above) and the daily-diff alert; this stops the bleeding without requiring code changes. Week 2-3: audit the top-20 highest-cardinality metrics, classify each as "needed in production" / "needed in dev only" / "delete entirely", and ship the relabel rules that strip the dev-only labels at the production scrape boundary. Week 4-6: introduce the cardinality budget into CI for new metric definitions only — do not retrofit the entire fleet at once, the cost of the migration outweighs the cost of the existing labels. Month 2-3: incrementally migrate histograms to native histograms, service by service, starting with the highest-cardinality offenders. Month 4-6: the labels-as-API discipline lands as the backstop. Six months from incident-day, the team is in a stable state where new labels go through review, old high-cardinality labels are either justified or stripped, and the bill grows linearly with the fleet rather than exponentially with the label count. Razorpay went through this sequencing in 2023-2024; the post-incident report names the six months as the cost of the wall, and the engineering culture change that landed in the same period as the dividend.
The deeper reframing: observability cost is dominated by the choices that look free. A label is one parameter on a function call. A histogram bucket is a number in a config. A pod-name dimension is automatic. Every expensive component of an observability bill is a choice that, in code review, looks like a one-line addition. The discipline Part 6 builds is not "use less observability" — it is price the additions before they ship, and price them in the units that matter. A team that knows the cost of the labels it adds will ship more observability than a team that does not, because the team that knows can defend the budget against finance and the team that does not gets a 47% bill increase and a one-word email asking for an explanation.
Common confusions
- "Sampling reduces my metrics bill the same way it reduces my traces bill." No — sampling reduces per-event cost. Metrics cost is per-series. A sampled service has the same series count as an un-sampled service; what changes is the value the counter increments to, not how many counters exist.
- "Lowering the scrape interval reduces metrics cost." Partially — going from 15s to 60s reduces samples-per-day per series by 4×, which reduces storage proportionally. But the series count is unchanged, and most vendors price per-active-series-per-month rather than per-sample. Lowering the scrape interval also breaks alerts that depend on shorter windows and hides bursts.
- "Histograms are cheaper than gauges." No — histograms are 12× more expensive (one series per bucket × labels cross-product). Native histograms (Prometheus 2.40+) collapse this to one series, but most fleets have not migrated yet.
- "Cardinality only matters for Prometheus self-hosted; vendor TSDBs handle it." No — every TSDB stores one row per unique label set; the index structure is the same. Vendor pricing is
series-monthbecause that is the cost lever they cannot avoid passing through. - "Adding a label only multiplies cost if the label has many values." No — even a 5-value label multiplies all your metrics, which means it multiplies the cumulative cost of every label you have already added. The label might look small in isolation; in combination it is the multiplicative coefficient.
- "You can fix cardinality after the metric ships, or it's just a vendor lock-in trick." Partially on both counts — you can
dropthe label via relabel rules going forward, but the historical data is stored at the high cardinality until retention expires; the bill for the prior weeks is paid. And the cost is real even on self-hosted Prometheus, where it shows up as memory pressure, OOM kills, longer query times, and slower compaction. Vendor pricing exposes the cost to finance; self-hosting hides it in the SRE team's incident queue. Both teams pay.
Going deeper
Native histograms vs classic histograms — the bucket-count multiplier
A classic Prometheus histogram (Histogram(...) in prometheus-client) emits one _bucket series per le boundary. With 12 default buckets and 1,000 base series (no per-bucket multiplier), the metric occupies 12,000 series. Native histograms (Prometheus 2.40+, OTLP exponential histograms, OpenMetrics 2.0) collapse this to one series whose internal representation is a sparse exponential bucket array compressed into a single sample blob. The series count drops to 1,000 — a 12× reduction across every histogram metric. The migration cost is non-trivial: every alert rule that calls histogram_quantile(0.99, sum by (le) (rate(...[5m]))) must be rewritten to histogram_quantile(0.99, sum(rate(...[5m]))), every dashboard panel must be re-pointed, and every downstream tool (Grafana, Datadog ingest) must support the new format. Teams that ship the migration save 80%+ on histogram-driven series; teams that defer the migration pay the 12× tax indefinitely. PhonePe's UPI latency dashboards moved in Q3 2024; the migration took 6 weeks of platform-team effort.
A subtlety beyond the headline 12× saving: native histograms also fix the bucket-boundary drift problem. Classic histograms hard-code their bucket boundaries at metric-registration time; if the production latency distribution shifts (because a service got faster, or slower, or a new endpoint with different latency joined the metric), the bucket boundaries no longer cover the relevant percentiles well. Re-registering the histogram with new boundaries is a breaking change — historical data has the old buckets, new data has the new buckets, and histogram_quantile cannot mix them. Native histograms use exponentially-spaced buckets that auto-adapt: a service whose p99 grew from 400ms to 4s sees its native histogram automatically allocate higher-resolution buckets in the new range, with no re-registration. The cost saving is the headline; the operational ergonomics is the underrated win.
The __name__ cardinality mistake — when the metric name itself becomes the label
A common anti-pattern: dynamically generating metric names based on runtime data — metric_for_user_riya_total, metric_for_user_rahul_total, etc. Each unique metric name is its own row in the __name__ index, and the cross-product effect applies at the metric-name level too. A naive implementation that mints one metric per customer produces 1.4M distinct __name__ values; the TSDB's metric-name index alone consumes the storage budget. The fix is the discipline rule: metric names are static, defined at compile time, never templated from user data. The runtime-variable detail goes in a label (where the cardinality budget governs it) or in an exemplar (where it scales with samples, not series). Datadog's APM has a per-customer-name guard that warns when more than 1,000 distinct metric names are emitted from a single agent; teams that ship with the warning ignored discover the cost at quarter end.
Why the OTLP collector's metrics processor is your last cardinality firewall
The OTel Collector's transform and attributes processors can drop or hash high-cardinality labels after the SDK emits them but before they are exported to the backend. The pattern: SDK emits the metric with the full label set (the developer's local Prometheus sees it); the collector's attributes/drop_high_cardinality processor strips customer_id, trace_id, etc., before exporting to the production TSDB. The developer's local environment retains the high-cardinality detail for debugging; the production environment pays only the bounded-cardinality bill. This is the last firewall because it is the last layer that touches the metric before it becomes a series. Every team that has shipped a cardinality-firewall collector has caught at least one developer's accidentally-merged high-cardinality label without an incident. The configuration is roughly 30 lines of YAML; the cost of not having it is one quarterly bill.
A common refinement: the firewall does not just drop high-cardinality labels — it hashes them into a bounded-cardinality bucket. A customer_id with 1.4M values becomes customer_bucket = hash(customer_id) % 1000, producing 1,000 buckets instead of 1.4M. The metric retains some per-customer signal (a customer that consistently lands in a slow bucket is detectable), but the cardinality is bounded. The hash-bucketing pattern is the bridge between "drop the label entirely" and "keep all 1.4M values"; it is what teams reach for when the per-customer signal genuinely matters but the per-customer cost does not. Hotstar's video-start latency metric uses this pattern with 256 user-id buckets — enough to detect "Tamil-Nadu users are seeing 3× p99" without paying for 50M individual user series.
The label-as-API discipline — versioning metrics like database schemas
Metrics are an API. The labels are the API's columns. Adding a label is a schema change; removing a label is a schema change; renaming a label is a schema change. Treating metrics as an API means every label addition goes through a schema-review process — same as adding a column to a database table. Razorpay's process: the metric definition lives in a metrics.yaml file checked into the service repo, the file declares the labels and their allowed value sets, the platform team's review includes a cardinality-budget check, and the CI fails the PR if the projected cardinality exceeds the budget. The discipline pays off six months later when the metric needs to be deprecated and the team can find every dashboard that depends on it (because the dashboards reference the schema'd label names, not arbitrary strings). Teams that treat metrics as ad-hoc emit-and-pray quickly accumulate a pile of orphaned high-cardinality metrics that no one knows how to delete.
Where the discipline most clearly pays off is the deprecation path. A metric with a published schema can be deprecated formally: the schema declares an until date, the dashboards and alerts that depend on the metric are listed in the schema's consumers field, the platform team gets a cron alert 30 days before the deprecation, and the migration to the replacement metric is a tracked engineering task. A metric without a schema cannot be deprecated — no one knows who depends on it, the on-call's runbook references it implicitly, and the only way to "deprecate" is to delete and wait for the bug reports. Most fleets accumulate dozens of un-deprecatable high-cardinality metrics over years; the schema discipline is what makes the cleanup possible. The cleanup itself is the second-largest cost saving (after native histograms) most fleets report — Cleartrip's 2024 cleanup deprecated 47 stale metrics, recovered 22% of their total series count, and reduced their monthly bill by ₹14 lakh.
The blast-radius of a cardinality incident on the alerting plane
A high-cardinality incident does not stop at the bill — it propagates into the alerting plane in three concrete ways. First, alerting rules slow down: a rule like histogram_quantile(0.99, sum by (route, le) (rate(http_request_duration_seconds_bucket[5m]))) evaluates against the underlying series; when those series multiply by 240 because of pod_name, the evaluation cost multiplies by 240 too. Rules that used to evaluate in 200ms now take 48 seconds; if the rule's evaluation interval is 30 seconds, the rule misses its window and alerting goes silent. Second, the recording-rule pipeline backs up: pre-aggregation rules that compute fleet-wide aggregates from raw high-cardinality metrics queue behind each other; the pipeline stalls; alerts that depend on the pre-aggregated metric stop firing. Third, the alertmanager's deduplication breaks: a rule firing across 240 pod_name-labelled series produces 240 distinct alert instances rather than one tier-aggregated alert; the on-call's PagerDuty queue fills with 240 redundant pages and the human signal-to-noise drops to zero. Razorpay's 2024 cardinality incident took out their alerting plane for 22 minutes — the bill was a quarterly story; the alerting outage was the immediate operational story, and it was the alerting outage that drove the post-incident action items.
The mitigating discipline runs in two layers. The first layer is alerting on alerting — the platform team's monitoring monitors itself, with alerts that fire when alerting rule evaluations exceed their interval (prometheus_rule_evaluation_duration_seconds > rule_evaluation_interval), when recording-rule queues back up (prometheus_recording_rules_pending > threshold), and when alertmanager dedup ratios drop (alertmanager_alerts_received / alertmanager_notifications_sent). When any of these meta-alerts fire, the platform team knows the alerting plane itself is degraded — and that customer-facing alerts may be silently failing. The second layer is rule cardinality budgets: the same per-metric budget discipline applied to the rules that consume those metrics; a rule that would produce more than N output series fails review the same way a metric with too many labels does. Both layers are operationally necessary; the second is rarely shipped on day one but is the discipline that prevents the post-incident "we did not know our alerting plane was dying" finding.
Reproducibility footer
# Reproduce the cardinality death-spiral simulation on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install pandas
python3 cardinality_death_spiral.py
# Expected: a five-row table tracking the fleet's series count and monthly INR cost
# across four label additions over a quarter. Week 0 baseline ~345K series at ₹6K/month;
# week 13 with customer_id added blows up to 1.74T series — physically impossible to store,
# the TSDB will OOM long before the bill. The exponential is in the multiplications.
# To audit your real fleet, hit Prometheus's /api/v1/series and group by metric name:
# curl -G http://prometheus:9090/api/v1/series --data-urlencode 'match[]={__name__=~".+"}' | jq
Where this leads next
This wall closes Part 5 by naming what sampling cannot fix — and Part 6 opens by treating cardinality as the master variable for cost, the same way Part 5 treated retention as the master variable for trace volume. The shape of Part 6 mirrors Part 5's: an opener that frames the problem (cardinality-the-master-variable), three chapters on mechanism (label cross-products, the explosion patterns, the relabel-rule firewall), two chapters on tooling (recording rules, exemplars), and a wall chapter that names what the toolkit does not solve and pivots to Part 7 (latency and the tail).
- Cardinality: the master variable — Part 6's opener; the chapter that defines what cardinality is and why it sets every other observability budget.
- Exemplars: linking metrics to traces — the bridge pattern that lets metrics carry high-cardinality detail without paying the cross-product cost.
- Why three pillars is a flawed framing — profiles, events, SLOs — the framing chapter; cardinality is a property of the metrics pillar specifically and shapes the per-pillar trade-offs.
- Dynamic sampling based on error rate — the previous chapter; the most sophisticated trace-pillar sampler, which still does nothing for the metrics bill.
- Why you can't collect everything — Part 5's opener; the original framing of the budget problem that this wall completes.
The single most useful thing the senior reader walks away with: sampling is the trace pillar's cost lever; cardinality budgeting is the metrics pillar's cost lever, and they are not interchangeable. A team that has shipped excellent sampling and treated their metrics labels as free will discover, on the next quarterly review, that the metrics bill is the larger of the two. The tools to fix it are in Part 6, not Part 5; the discipline to apply them is the engineering culture chapter at the end of the curriculum.
A team that has hit the cardinality wall and survived has earned three pieces of equipment: a cardinality budget enforced in CI, an exemplar-based dashboard for the high-cardinality questions that used to require high-cardinality labels, and a daily-diff alert that catches new explosions in week 1 instead of week 13. Part 6's deliverables are these three artefacts. Walk past the wall by building them.
The reader from a smaller team — five engineers, one staging cluster, no dedicated platform team — sometimes reads chapters like this and concludes "this does not apply to us, our scale is small". The conclusion is wrong in a specific way: the multiplicative formula does not depend on absolute scale, it depends on the ratio of cardinality to budget. A 5-engineer team running a 10-pod fleet on a free-tier observability vendor (Grafana Cloud's free tier, ~10K active series) hits the wall just as fast as a 500-engineer team on Datadog — they just hit it at smaller absolute numbers. The free tier is the budget, the labels are the multipliers, the death spiral runs the same. The smaller team often hits the wall harder because their free tier degrades silently (Grafana Cloud throttles instead of bills) and the metrics simply stop being ingested without anyone noticing. The discipline scales down to small fleets; the absolute cost is different, the structural lesson is identical.
The closing reframing: every observability decision has a unit cost and a unit budget. Sampling's unit is the event; cardinality's unit is the series; latency's unit (Part 7) is the bucket-resolution × tail-percentile. The pillars cost what they cost because of these units, and the engineering discipline is to know — for each pillar — which unit you are buying and at what price. Part 5 ends here. Part 6 begins on the other side.
A final practical discipline: the vocabulary of cost belongs in every metrics-related design document. When a developer proposes adding a label, the design document should answer four questions in numbers: what is the maximum cardinality of the new label, what is the projected series-count increase across the fleet (label cardinality × current fleet series count for the affected metrics), what is the projected monthly bill increase at vendor-list-price, and what is the proposed budget for this metric's series count over the next four quarters. A team that answers these four questions before the PR merges does not hit the wall. A team that does not answer them hits the wall on schedule. The discipline is not a process burden; it is the conversation that should have happened informally and didn't, made formal so it cannot be skipped. Every team that has crossed the wall and built the discipline back into their workflow describes the same recovery shape: six weeks of acute remediation, six months of cultural change, and a bill that bends from exponential growth back to linear by the end of the second quarter after the incident. The death spiral is reversible — but only by treating cardinality as the master variable it is, and not as an afterthought to which labels feel useful.
References
- Robust Perception, "Cardinality is Key" (Brian Brazil) — the canonical Prometheus-author post on cardinality. The reference for understanding why series-count is the master cost variable.
- Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022), Ch. 6 — modern treatment of high-cardinality observability; the chapter that frames cardinality as a discipline rather than a constraint.
- Prometheus 2.40 native histograms RFC — the migration target for histogram cardinality reduction; reference for the 12×-collapse mentioned in the Going-deeper section.
- Datadog, "Understanding custom metrics billing" — vendor-side pricing model that exposes the per-series cost lever; the document that turns the cardinality discussion from theoretical to monetary.
- Grafana Cloud, "Active series and metrics billing" — alternative-vendor billing reference; same
series-monthcost lever, different multiplier, useful for comparing vendor positions. - VictoriaMetrics, "High-cardinality metrics" — open-source TSDB perspective on cardinality; the reference for understanding what techniques (relabeling, recording rules, hash-based dedup) work in TSDBs other than Prometheus.
- Cardinality: the master variable — Part 6's opener; the next chapter, which picks up the toolkit Part 5 cannot deliver.
- Exemplars: linking metrics to traces — the bridge pattern; how metrics carry trace-level detail without paying the cross-product cost.