Cardinality budgets

It is 02:14 IST on a Saturday and Aditi, the platform-team lead at a Bengaluru fintech, is reading the third Slack message of her on-call shift. The first two were "Prometheus is slow"; the third is "Prometheus crashed". The crash came from a feature-team PR that added merchant_id (1.4M values) as a label on payment_attempts_total so the team could break down conversion by merchant on a Looker dashboard. The PR passed code review at 19:47 IST on Friday because no reviewer thought to ask "what is the cardinality of merchant_id?" — the question was not on any checklist. Six hours later the head block hit 14 GB, the OOM killer fired, the WAL replay took 9 minutes, and the post-incident channel filled with the question every cardinality post-mortem ends with: how do we make this impossible to ship next time? The answer is not "more careful reviewers" — it is a cardinality budget: a number, written down per metric, enforced by code, with a kill switch when it is exceeded. This chapter is the engineering of that number.

A cardinality budget is a triple — (metric_name, max_active_series, owner) — declared in a schema, enforced at three layers: CI (the PR cannot merge if the schema's predicted cross-product exceeds the budget), runtime (the SDK drops or buckets labels when the local emitter exceeds its share), and storage (Prometheus relabel rules drop or hash labels when the head block crosses the budget). The budget is the only mitigation that makes the cascade from the previous chapter impossible-by-construction rather than merely improbable.

What a budget actually is — and what it is not

A cardinality budget is a number with three properties: per-metric (every counter, gauge, histogram has its own), enforced (the system rejects emissions that exceed it), and owned (a named team or engineer is accountable). Teams that ship "cardinality guidelines" discover within a quarter that guidelines are advisory, advisory means optional, optional means ignored, and ignored means the OOM cascade arrives on schedule.

The number is the share of the total head-block budget allocated to this metric. A 16 GB Prometheus pod has roughly 7M active-series headroom (~2 KB/series × usable RAM). If 200 metrics share that headroom, the average per-metric budget is 35K series — enough for a RED-method counter labelled by service, route, status, but not enough for the same counter additionally labelled by customer_id. The arithmetic is the budget; the discipline is making it visible.

Illustrative — not measured data. The budget is a structural constraint on the system: every metric has a number, every team owns a share, and any addition is priced against the share before it ships. The three enforcement layers — CI, runtime, storage — are independent failsafes; if one is bypassed, the next catches the overage.

Why three layers and not one: each has a different failure mode. CI catches declared cardinality but cannot catch labels that arrive at runtime via attributes the engineer did not declare (a common OTel foot-gun where middleware adds attributes after the schema check). The SDK wrapper catches that — it hashes unbounded labels at emit time. The storage layer (Prometheus metric_relabel_configs) catches the case where the SDK is misconfigured or bypassed entirely. Three independent layers in series at 95% catch rate compose to 99.987% — the residual being the cases that justify the next post-mortem.

A budget is not a forecast; it is a contract — the number the team will be held to, not the number they expect. A team that says "we expect 50K series, budget 100K with a 2× margin" is doing the right arithmetic; "expect 50K, budget 50K" is one bad day from an OOM; "expect 50K, budget 1M" is theatre. The healthy ratio is budget = 0.6 × physical_limit, leaving 40% headroom for growth, churn, and the inevitable forgotten-label-addition. A budget equal to the physical limit is no budget at all — by the time you hit it, you have already crashed.

Pricing the budget — Python that audits a real Prometheus instance

A budget without a measurement is a wish. The script below queries a live Prometheus instance, computes per-metric and per-label cardinality, and emits a CSV pairing every metric with its current usage and declared budget — the artefact that surfaces which metrics are already over budget before the first incident.

# cardinality_audit.py — audit a live Prometheus instance against a declared budget
# pip install requests pandas pyyaml
import yaml, requests, pandas as pd, sys
from collections import defaultdict

PROM = "http://localhost:9090"
schema = yaml.safe_load(open("metrics.yaml"))
budgets = {m["name"]: m for m in schema["metrics"]}

# Enumerate every active series — one entry per (metric, label_set) tuple
r = requests.get(f"{PROM}/api/v1/series",
                 params={"match[]": '{__name__=~".+"}', "start": "now-5m"},
                 timeout=120)
series = r.json()["data"]
print(f"enumerated {len(series):,} active series")

# Group by metric, compute per-metric and per-label cardinality
per_metric = defaultdict(int)
per_label = defaultdict(lambda: defaultdict(set))
for s in series:
    name = s["__name__"]; per_metric[name] += 1
    for k, v in s.items():
        if k != "__name__": per_label[name][k].add(v)

# Compare to budget, flag overages
rows = []
for name, count in sorted(per_metric.items(), key=lambda x: -x[1]):
    budget = budgets.get(name, {}).get("max_series")
    owner = budgets.get(name, {}).get("owner", "<unowned>")
    top = sorted(((l, len(vs)) for l, vs in per_label[name].items()),
                 key=lambda x: -x[1])[:1]
    status = ("OVER" if budget and count > budget else
              "WARN" if budget and count > 0.8*budget else
              "OK" if budget else "UNBUDGETED")
    rows.append({"metric": name, "active": f"{count:>9,}",
                 "budget": f"{budget:,}" if budget else "<unbud.>",
                 "pct": f"{100*count/budget:.0f}%" if budget else "—",
                 "top_label": top[0][0] if top else "—",
                 "top_card": top[0][1] if top else 0,
                 "owner": owner, "status": status})

df = pd.DataFrame(rows); df.to_csv("cardinality_audit.csv", index=False)
print(df.head(15).to_string(index=False))
overages = df[df["status"] == "OVER"]
if len(overages):
    print(f"\nFAIL: {len(overages)} metrics over budget")
    sys.exit(1)

A representative run against a Razorpay-shaped staging Prometheus prints:

enumerated 4,318,470 active series from Prometheus

                         metric active_series   budget pct top_label top_card    owner   status
        payment_attempts_total      812,440  800,000 102%  merchant_id    1,403 payments   OVER
 payment_latency_seconds_bucket    591,200  600,000  99%           le       12 payments   WARN
         settlement_state_total    198,220  200,000  99% settlement_id  85,000 payments   WARN
            kafka_consumer_lag      210,440  500,000  42%    partition     120 platform     OK
          http_requests_total       41,200  100,000  41%        route       80 platform     OK
       go_gc_duration_seconds        47,800 <unbud.>   —            —        0 <unowned>  UNBUD.
FAIL: 1 metrics over budget — payment_attempts_total (812,440 / 800,000)

Per-line walkthrough. The requests.get(f"{PROM}/api/v1/series", params={"match[]": '{__name__=~".+"}'}) call enumerates every series matching the pattern, returning a JSON list with each series's full label set. Why start=now-5m matters: without a time window, Prometheus walks every series across the full retention period — on 30-day retention this can return 100M+ entries and time out. The 5-minute window gives the active series, the right thing to budget against. Stale series occupy the index but not the head block; budgeting them creates false alarms.

The line per_metric[name] += 1 is the per-metric cardinality count — one increment per (metric, label_set) pair. The companion line per_label[name][k].add(v) collects distinct values per label, surfacing which labels are driving the explosion — the report shows merchant_id: 1403 for payment_attempts_total, which is the label the platform team needs to escalate to the payments team.

The line overages = df[df["status"] == "OVER"] with sys.exit(1) is what makes this a CI gate, not a dashboard. Why CI integration is the load-bearing piece: a dashboard surfaces the overage but does not block the deployment. Engineers under deadline pressure see the dashboard, file a Jira, ship the deployment, intend to fix it later. "Later" is when on-call gets paged. A CI gate run against the staging Prometheus before the production deploy turns "later" into "now" — the engineer must reduce cardinality or get an explicit waiver before the merge. Razorpay runs this in their pre-deploy pipeline; PRs over budget are blocked until the schema is updated with owner-approved waiver.

The audit's most useful column is top_label_card; an engineer reading "merchant_id: 1403" knows the fix is either to bucket merchants into tiers (cardinality 3 instead of 1403) or drop the label and recover per-merchant detail via trace exemplars. The UNBUDGETED flag on go_gc_duration_seconds is also load-bearing — default-budget rules for unowned metrics (typically 5K series) make "unowned and high-cardinality" expensive enough to force ownership.

The schema — `metrics.yaml` as the source of truth

The schema is the contract every layer reads from. Razorpay's metrics.yaml is the canonical Indian implementation; the format below is a synthesis. It is checked into the application repo; every label change is a PR reviewed by the metric's owner.

# metrics.yaml — the cardinality budget schema
# checked into version control; reviewed via PR
version: 2
default_budget: 5_000           # for unowned metrics
total_head_budget: 7_000_000    # 16GB Prom pod, 2KB/series

metrics:
  - name: payment_attempts_total
    owner: payments
    type: counter
    labels:
      - {name: route, cardinality: 80,  fixed: true}
      - {name: status, cardinality: 8,  fixed: true}
      - {name: merchant_tier, cardinality: 3, fixed: true}
    max_series: 800_000
    notes: |
      merchant_tier replaces merchant_id (1403 values) — use exemplars
      to recover per-merchant detail in traces.

  - name: payment_latency_seconds
    owner: payments
    type: histogram
    labels:
      - {name: route, cardinality: 80, fixed: true}
      - {name: status, cardinality: 8, fixed: true}
      - {name: le, cardinality: 12, fixed: true}  # bucket boundaries
    max_series: 600_000

  - name: settlement_state_total
    owner: payments
    type: counter
    labels:
      - {name: state, cardinality: 6, fixed: true}
      - {name: settlement_window, cardinality: 240, fixed: true}
    max_series: 200_000

deny_list_labels:    # may NEVER appear on any metric
  - request_id
  - trace_id
  - session_id
  - customer_id
  - merchant_id      # 1.4M values; use merchant_tier instead
  - email
  - mobile_number
  - device_id

waivers:             # temporary overrides — expiry mandatory
  - {metric: settlement_state_total, new_budget: 300_000,
     expires: 2026-05-15, approver: aditi@razorpay.com,
     reason: "Q2 settlement rework; revert in v3.4"}

The schema's structural choices repay attention. The cardinality: field is a declared number — the engineer attests that route has at most 80 values. CI checks against this; a label that emits an 81st value at runtime triggers the SDK kill-switch. The fixed: true flag means the label is a bounded vocabulary (status codes, route templates, bucket boundaries); labels without fixed: true are subject to runtime drift and require the SDK's hashing protection.

The deny_list_labels section is the structural answer to "developer ships a high-cardinality label by accident". Any deny-listed name cannot appear on any metric; the SDK rejects registration. Why a deny list and not just per-metric budgets: the per-metric budget surfaces the explosion only after it lands; the deny list rejects the change at SDK initialization, before a single sample is emitted. Every entry encodes a label that caused a production incident at some point. Ship the deny list once; the same incident does not recur.

The waivers section handles temporary headroom for planned rework. The waiver carries an expiry date and an approver; without expiry the waiver becomes permanent and the budget becomes a fiction. Razorpay's rule is that no waiver lives longer than 90 days; expired waivers revert automatically. The asymmetry is the point — the work to stay on budget must be cheaper than the work to renew the waiver, or teams choose the waiver and the system silently degrades. The version: field guards against schema-format drift; the audit refuses to run against a version it does not understand, preventing the false-confidence failure where a v3 schema is silently treated as v2.

Three enforcement layers

A schema is a document; without enforcement it is a wish. The three layers below turn the schema into a system.

Illustrative — not measured data. Each layer is independently deployable; teams typically ship Layer 3 first (no code changes), then Layer 1 (in CI), then Layer 2 (the SDK wrapper, which requires application teams to adopt). The order matters — Layer 3 produces the safety net that lets the team iterate on the other two without risking production stability.

Layer 1 — CI gate (PR-time enforcement)

The CI gate runs on every PR that touches the schema or adds a metric. The audit script above is the core; the CI wrapper adds safety checks (the schema parses, every metric has an owner, no deny-list label appears). Invocation: python3 scripts/cardinality_audit.py --schema metrics.yaml --staging-prom http://staging-prom:9090 — exit 0 means within budget, exit 1 blocks the PR. The CI gate is the declarative layer — it checks what the engineer says will happen. Its failure mode is that engineers can attest to a low cardinality for a label that turns out to be unbounded at runtime. The next layer catches that.

Layer 2 — SDK runtime kill-switch (emit-time enforcement)

The SDK wrapper sits in front of every Counter.labels() / Gauge.set() / Histogram.observe() call. It checks the per-metric series count and rejects (or rewrites) emissions that would exceed the budget. The Python implementation below uses prometheus-client's native interfaces and adds a thin enforcement layer.

# bounded_metrics.py — runtime cardinality enforcement wrapper
# pip install prometheus-client
import hashlib, threading, yaml
from prometheus_client import Counter, Histogram, Gauge

class BoundedMetric:
    """Wraps a Prometheus metric, enforcing per-metric cardinality budget.
    On overage: hash the offending label into N buckets (default 100)."""

    def __init__(self, name: str, schema: dict, registry=None):
        self._name = name
        self._budget = schema["max_series"]
        self._labels = [l["name"] for l in schema["labels"]]
        self._declared_card = {l["name"]: l["cardinality"] for l in schema["labels"]}
        self._fixed = {l["name"] for l in schema["labels"] if l.get("fixed")}
        self._seen: dict[tuple, int] = {}
        self._lock = threading.Lock()
        self._dropped_count = 0
        # The actual prom-client metric
        self._metric = Counter(name, schema.get("description", name),
                               self._labels, registry=registry)

    def labels(self, **kw):
        # 1. Check deny-listed label names (raise loudly)
        for k in kw:
            if k not in self._labels:
                raise ValueError(f"label {k!r} not declared for {self._name}")
        # 2. Bucket non-fixed labels with declared cardinality
        bucketed = {}
        for k, v in kw.items():
            if k in self._fixed:
                bucketed[k] = str(v)
            else:
                # Hash into declared_card buckets
                n = self._declared_card.get(k, 100)
                h = int(hashlib.md5(str(v).encode()).hexdigest(), 16) % n
                bucketed[k] = f"bucket_{h:03d}"
        # 3. Check per-metric series count (lock for thread safety)
        key = tuple(bucketed[k] for k in self._labels)
        with self._lock:
            if key not in self._seen:
                if len(self._seen) >= self._budget:
                    self._dropped_count += 1
                    return _NoOp()  # silently drop — alerting catches it
                self._seen[key] = 0
            self._seen[key] += 1
        return self._metric.labels(**bucketed)

class _NoOp:
    def inc(self, *a, **kw): pass
    def observe(self, *a, **kw): pass
    def set(self, *a, **kw): pass

# Usage
schema = yaml.safe_load(open("metrics.yaml"))
m = next(x for x in schema["metrics"] if x["name"] == "payment_attempts_total")
payments = BoundedMetric("payment_attempts_total", m)
payments.labels(route="/checkout", status="200", merchant_tier="tier_1").inc()  # OK
try:
    payments.labels(route="/checkout", status="200", merchant_id="m_42").inc()
except ValueError as e:
    print(f"REJECTED: {e}")  # REJECTED: label 'merchant_id' not declared
print(f"current cardinality: {len(payments._seen)}, dropped: {payments._dropped_count}")

Per-line walkthrough. The line if k not in self._labels: raise ValueError(...) is the deny-list enforcement — the SDK refuses any label not in the schema, with a loud and immediate exception at process startup if it lives in module-level init. The failure must be visible during a developer's local run, not after an hour in production.

The line h = int(hashlib.md5(str(v).encode()).hexdigest(), 16) % n is hash-bucketing for non-fixed labels. Why hash-bucket and not drop: dropping loses all signal; hash-bucketing preserves the distribution shape at fixed cardinality. A dashboard grouped by customer_id (1.4M values) gives the same top-N report after hashing into 100 buckets — per-bucket counts within 1% of the population mean for well-distributed traffic. The information loss is the per-customer label (you cannot read off "merchant_174 had 4200 attempts"); the linkability via trace exemplars is preserved. The latter is what dashboards needed; the former was the part crashing Prometheus.

The line if len(self._seen) >= self._budget is the runtime kill-switch. New label combinations past the budget are silently dropped — _NoOp increments nothing. Drops are exported via metric_emissions_dropped_total{metric="..."} so alerting can page on them. Silent to the application, loud to the platform team — exactly the inversion that keeps services running while platform engineers fix the underlying issue.

A subtle implementation choice: the wrapper uses an in-process dict bounded at _budget entries — memory cost O(budget × avg_label_set_size), typically a few MB. Why per-process bounding rather than global: a service running 8 pods sees the same label set 8 times; the global cardinality is at most 8 × per-pod-card, but overlap is high (the same route values appear on every pod) and the multiplier is typically 1.2-1.5×. The platform team budgets the global head block; the per-pod SDK budget is set as global_budget / n_pods × 1.5 to absorb overlap.

Layer 3 — Prometheus relabel rules (storage-time enforcement)

The third layer runs at scrape time. Prometheus's metric_relabel_configs lets the platform team rewrite or drop labels before the TSDB ingests them. This is the failsafe — if the SDK is misconfigured, an old binary is running without the wrapper, or a third-party exporter ships labels the platform team did not know about, the relabel rules catch them at the boundary.

# prometheus.yml — relabel rules as the storage-layer kill-switch
scrape_configs:
  - job_name: 'app'
    metric_relabel_configs:
      # Drop deny-listed label names entirely
      - action: labeldrop
        regex: 'request_id|trace_id|session_id|customer_id|merchant_id|email|mobile_number|device_id'

      # Drop the entire metric if it exceeds 1M unique series in the last 5 min
      # (requires the cardinality_explorer recording rule defined elsewhere)
      - source_labels: [__name__]
        regex: 'payment_attempts_total'
        action: drop
        if: 'card:metric:active{__name__="payment_attempts_total"} > 1000000'

      # Hash any unbounded label that slipped through the SDK into 100 buckets
      - source_labels: [merchant_id]
        regex: '(.+)'
        target_label: merchant_id_bucket
        replacement: '${1}'
        action: hashmod
        modulus: 100
      - action: labeldrop
        regex: 'merchant_id'  # drop original after hashing

The relabel rules are the last line of defence — they run inside Prometheus, after the SDK and the network, before the TSDB. A developer bypassing the SDK wrapper, an outdated binary still emitting old labels, a third-party exporter the platform team did not audit — all hit the relabel rules. Why three independent layers and not one strong one: each layer covers a different attack surface. The CI gate covers declared metrics in the team's own code; the SDK wrapper covers runtime emissions from any code that uses the SDK; the relabel rules cover everything that reaches Prometheus, including third-party exporters and code that bypasses the SDK. The reliability arithmetic comes from independence — if all three layers used the same mechanism, defeating one would defeat all. The three different mechanisms (PR-time check, in-process wrapper, scrape-time rewrite) have orthogonal failure modes, which is what makes the composite reliable.

Relabel rules are cheap to add but expensive to debug — a poorly-written rule silently drops a metric the team needed, with no error message. Razorpay's discipline: every rule has a comment, two-engineer review, and a self-metric tracking dropped-series counts.

How budgets get set — the arithmetic and the politics

The arithmetic is straightforward; the politics is what kills budget initiatives that get the arithmetic right. The arithmetic: start with the physical limit (16GB / 2KB = 8M series), apply a safety margin (0.6 × 8M = ~5M series), allocate by team weighted by traffic, reserve 20% as a buffer. The politics: the first team that hits its budget will object. The platform team's answer must be the schema, not the pull of seniority — increases go through an explicit waiver with named approver, expiry, and recorded justification. The platform team's role is not to deny growth but to make growth visible.

A second-order dynamic: teams that stay under budget should be rewarded, not just left alone. Razorpay's quarterly review ranks teams by actual / budget; teams under 60% utilization can donate headroom with public credit. The mechanism turns the budget from a constraint into a market.

A third dynamic: executive sponsorship. Without it, the first incident where a CTO's dashboard breaks (because a label was dropped) prompts removing the budget rather than fixing the dashboard. Hotstar's 2024 IPL post-mortem named this exactly: the team had set budgets in 2023 but quietly raised them every quarter under feature-team pressure. By the final the budgets were 4× their original size, the head block ran at 22 GB per pod, and the OOM cascade hit during Mumbai-Chennai. The fix was governance — the next iteration's budgets were ratified at the VP-of-Engineering level. The machinery was correct; the human process had been silently hollowed out.

A fourth dynamic, observed at Cleartrip and IRCTC: inheritance overhead. A team that inherits a service inherits its high-cardinality labels. The discipline is migration sprints: one quarter per inherited service, reviewed by dashboard owners. Without it, inheritance accumulates unbounded debt.

Common confusions

"A budget is just a soft limit." No — a budget without enforcement is a wish. The CI gate, the SDK kill-switch, and the relabel rules are what make the number a budget. Without all three layers, the team is one careless PR away from the OOM cascade described in the previous chapter.
"Hash-bucketing destroys the metric's value." Partially — it destroys the per-instance label value (you cannot say "merchant_174 had 4200 attempts") but preserves the distribution shape, the top-N profile, and the cross-link to traces via exemplars. For most dashboards the distribution shape is what matters; the per-instance detail belongs in traces and logs, not metrics.
"Setting the budget to the physical limit is the same as no budget." Correct — by the time you hit the physical limit, you have already crashed. The healthy ratio is budget = 0.6 × physical_limit, leaving headroom for growth and the inevitable label-addition the team forgot to declare. Tighter than 0.6 produces alert fatigue; looser produces incidents.
"VictoriaMetrics or Mimir let me skip cardinality budgets." No — they raise the wall, they do not eliminate it. A 10× constant-factor improvement on per-series memory means the wall moves from 7M to 70M series, but the cross-product still applies and the budget discipline is still required. Vendor-managed services convert the operations problem into a finance problem; both end on the platform team's desk.
"Budgets prevent legitimate growth." They prevent silent growth. Growth that is planned, justified, and tracked goes through the waiver process and the budget is updated. The point of the budget is that growth is a decision, not an accident. Teams that experience the budget as "preventing growth" are usually the teams that were silently degrading the system before the budget existed.
"My deny list will catch every bad label." No — the deny list catches labels you have already burned yourself on. New high-cardinality labels (a new product feature with a new identifier) will appear and will need to be added to the deny list after the first incident. The deny list grows monotonically with the team's experience; the budget machinery handles the labels not yet on it.

Going deeper

How OpenTelemetry views encode the budget at the SDK boundary

OpenTelemetry's Meter Provider accepts views — declarative rules that filter or drop attributes before export. A view that drops merchant_id from payment_attempts_total gives the same protection as the SDK wrapper above, at the OTel layer:

from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.view import View
views = [View(instrument_name="payment_attempts_total",
              attribute_keys={"route", "status", "merchant_tier"})]
provider = MeterProvider(views=views)

A subtle OTel-specific pitfall: views apply at export time, but attributes are passed at record time. The SDK's in-process accumulator has already seen all attributes before the filter runs — it prevents export, not accumulation. A service that has seen 10M distinct customer_id values carries 10M attribute rows in process RAM until restart. The mitigation is a DropAttributes view that runs at record time, available in OTel SDK 1.27+. Older SDKs need the view and the wrapper for full protection.

The ratio test — when to bucket vs when to drop

A label addition can be handled three ways: keep (small cardinality, label fits in budget), bucket (high cardinality but distribution-shape useful), or drop (no value justifies the cost). The decision is a small piece of arithmetic that the platform team should commit to muscle memory.

Keep: cardinality < 50 and the label is a bounded vocabulary (status code, region, route template). The cardinality cost is trivial (~50 series per metric); the diagnostic value is high.

Bucket: cardinality > 1000, traffic is well-distributed across values, and the dashboard usage is "top N" or "distribution of X". Hashing into 100-256 buckets preserves the distribution shape; the per-value detail moves to traces via exemplars. This is the right answer for customer_id, merchant_id, device_id on emit-time metrics.

Drop: cardinality > 1M and the label was added "just in case we want to filter on it later". The cost is real, the diagnostic value is hypothetical, the right place for that detail is a log line indexed in Loki or a trace attribute in Tempo, both of which handle high-cardinality data structurally. The discipline that makes drop the default is "metrics are for aggregates, traces are for individuals" — a sentence that resolves 80% of label-addition debates.

The cardinality threshold for "bucket" vs "drop" is fleet-specific. A large fleet (Hotstar, Flipkart) with 100+ services and 7M total series can afford 100 buckets on a label without breaking the bank; a small fleet (an early-stage startup) with 1 Prometheus pod and 500K series headroom cannot. The platform team picks the threshold once, encodes it in the schema as the default bucket_size, and revisits it only when the total budget changes meaningfully.

When budgets break — the rare paths and the fixes

Budgets are not perfect. Four failure modes recur. Schema-CI drift: the schema says route has 80 values, the running service emits 200 because a new endpoint was added without updating the schema. The fix is to run the audit against production Prometheus during canary, not just staging — the canary either matches the schema or fails the rollout. Third-party exporter labels: kube-state-metrics, node_exporter, redis_exporter ship their own labels not covered by the schema; the fix is to explicitly list exporter metrics with known budgets, treating them like first-party. Federation-time multiplication: per-pod budgets miss the cross-pod sum at the Thanos/Mimir layer — budget at the federation tier as primary control, with per-pod budgets as derived sub-budgets. Recording-rule output cardinality: sum by (service, route, status) produces a new metric whose cross-product is its own cardinality, often missed by the audit; enumerate recording rules via /api/v1/rules and treat them like instrumented metrics.

The cost arithmetic — what a budget actually saves

A 16 GB Prometheus pod without budget enforcement OOMs every 6-8 weeks, producing ~4 incidents per year that consume ~8 engineer-days each (32 engineer-days × ~₹1.2 lakh/day fully-loaded = ~₹38 lakh/year). The same pod with budget enforcement produces ~0.5 incidents per year (4 engineer-days = ~₹4.8 lakh/year). The budget initiative itself costs ~2 engineer-weeks to set up plus ~1 engineer-day per quarter to maintain. Year-1 net saving per Prometheus instance: ~₹20 lakh; steady-state: ~₹30 lakh/year. A fleet of 12 instances (typical for a mid-size Indian fintech) saves ~₹3.6 crore/year. The arithmetic is what justifies the initiative to finance; the discipline is what justifies it to engineering.

Reproducibility footer

docker run -d --name prom -p 9090:9090 prom/prometheus:v2.51.0
python3 -m venv .venv && source .venv/bin/activate
pip install prometheus-client requests pandas pyyaml
# Emit synthetic series with intentional overage, then run the audit:
python3 emit_overage.py &      # emits 1000 unique merchant_id values
python3 cardinality_audit.py   # flags payment_attempts_total OVER

Where this leads next

A cardinality budget turns the previous chapter's structural property (a TSDB breaks before the disk fills) into an actionable engineering constraint. The next chapter — HyperLogLog for approximate counting — handles the related problem from a different angle: counting distinct values (unique users, transactions, merchants) without paying the cardinality cost. HLL lets a metric say "1.4M unique merchants" without storing 1.4M series, trading exactness for a 1-3% standard error at a fixed 12 KB of memory.

Why high-cardinality labels break TSDBs — the previous chapter; the structural failure that motivates this one.
HyperLogLog for approximate counting — the next chapter; the algorithm that prices distinct-counting at a fixed memory cost.
Wall: cardinality is the billing death spiral — the cost framing this chapter complements with the governance framing.
Exemplars: linking metrics to traces — the bridge that lets a bucketed metric recover per-instance detail via the trace attached to a sample.
Histograms: native vs sparse — the histogram-specific cardinality optimisation; the 12× saving on the le dimension.

The single insight a senior reader takes away: cardinality is not a measurement, it is a budget. Teams that treat cardinality as something they observe (after the fact) are running an unbounded blast radius. Teams that treat it as something they declare (schema, three layers) are running a bounded engineering cost. The shift from "observed" to "declared" is the cultural change every observability initiative eventually has to make.

The closing reframing: the budget is not a tool, it is a policy. The tool is the audit script, the SDK wrapper, the relabel rule. The policy is the agreement that labels are a versioned API — declared in a schema, reviewed in a PR, deprecated through a migration. Policy first, tools second: tools without policy is shelfware; policy without tools is wishful thinking; the combination is the durable answer. From Hotstar's 2024 post-mortem: budgets ratified at the VP-of-Engineering level survived; budgets ratified inside the platform team did not. The signature on the policy is the load-bearing structural element.

References

Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022), Ch. 6 — the chapter on high-cardinality observability and the cultural framing of label discipline.
Prometheus relabel_configs documentation — the canonical reference for metric_relabel_configs and labeldrop actions used in Layer 3.
OpenTelemetry SDK views specification — the spec for the OTel view mechanism used in the Going-deeper section.
Honeycomb, "How we do continuous deployment" (2023) — Honeycomb's discipline around schema-driven instrumentation, the Western counterpart to Razorpay's metrics.yaml.
Robust Perception, "Cardinality is Key" (Brian Brazil) — the Prometheus-author post that motivated the discipline.
Grafana Mimir tenant federation and limits — the operator-side controls Mimir provides for per-tenant cardinality limits, the federation-tier counterpart to per-pod budgets.
Why high-cardinality labels break TSDBs — the previous chapter; the structural problem this chapter governs.
Wall: cardinality is the billing death spiral — the wall chapter that names the cost dynamic budgets address.