Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Cardinality budgets revisited

Six months after Yatrika rolled out /wiki/cardinality-budgets, Kiran in finance-ops opens the Q4 invoice and finds the metrics-line still up 23% over Q3. The CI gate fires every week. Engineers grumble, file waivers, the platform team approves them under deadline pressure, the bill grows. The post-mortem on the budget itself reaches one conclusion: a budget enforced as a series cap is a constraint; a budget enforced as a rupee P&L is a discipline. This chapter takes the offender table from /wiki/the-observability-bill-where-it-goes(http_request_duration_seconds_bucket, customer_id, 14M) — and turns it into a quarterly P&L line every team owns, with a write-off policy, a deletion calendar, and an operating review that runs every other Friday. The CI gate is the floor. The P&L is the ceiling.

A cardinality budget that is only a series cap drifts upward through waiver creep. A cardinality budget that is also a rupee-denominated P&L line per team, with a quarterly target, a deletion calendar, and a Fortnightly Cardinality Review meeting, drifts downward — because every approved waiver costs a named engineering manager named rupees from a budget they argued for. The mechanism is the meeting, not the YAML.

The waiver-creep problem and why a series cap alone fails

The original cardinality-budget chapter (ch.37) shipped a clean three-layer enforcement: CI rejects PRs that exceed the declared cap, the SDK hashes unbounded labels at emit time, Prometheus relabel rules drop labels at storage. Yatrika ran that for two quarters. Series counts stopped exploding overnight; the bill kept climbing. The platform-engineering retro found seven causes — six of them not technical.

The technical cause: the budget was set in series, not in rupees. A 100K-series budget on a histogram with 50 buckets is functionally a 2K-distinct-label-set budget, but engineers reading the YAML see "100K" and think "plenty of room". Multiplied across 200 metrics times 30-day retention times the per-series-month rate, that "plenty of room" becomes ₹4.8 lakh/quarter that nobody priced. Series do not show up on invoices; rupees do. The translation has to live in the schema.

The non-technical causes were all variants of the same pattern: waivers. The platform team granted them, because the alternative was blocking a feature shipping for the payments team, who had a deadline tied to a quarterly OKR, who had a VP whose comp depended on the OKR. A waiver was a paragraph in a Slack thread; the budget grew by 600K series, and the rupee cost showed up six months later on Kiran's invoice. The platform team paid; the payments team kept its OKR. The waiver had no price tag, so the equilibrium was that every waiver got approved.

Why pricing the waiver is the entire fix: in a cost-attribution model where each waiver subtracts named rupees from a named team's named budget, the cost of asking for a waiver — the political and financial cost — rises until it equals the cost of redesigning the metric to fit the budget. At that point the team picks the cheaper option, which is usually the redesign. Below that point the team always picks the waiver, regardless of how fierce the platform team's objections sound in a Slack thread. The platform team is not the principal in the waiver decision — the team's own VP-of-engineering is, but only when the rupees are visibly attributed.

Waiver-creep over six months — what a series-only budget looks like vs a rupee P&LTwo timelines side by side. Left: Yatrika Q1-Q4 metric-series count climbing 1.1M, 1.4M, 2.1M, 3.4M as a series-only budget is enforced; each upward step labelled with the team and waiver granted. Right: a counterfactual showing what the same six months would have looked like with a rupee P&L — series count flat at 1.1M, with three waivers explicitly priced at ₹40K, ₹1.2L, ₹2.1L and rejected by the requesting team's own VP. Two enforcement regimes — six-month outcome Illustrative — same fleet, same engineers, two budget mechanisms Series-only budget (what shipped) 3.4M 2.5M 1.7M 1.1M Q1 Q2 Q3 Q4 +merchant_id (waiver) +customer_id (waiver) +route+method (waiver) 3 waivers approved → bill +₹14L/quarter Rupee P&L budget (counterfactual) 3.4M 2.5M 1.7M 1.1M Q1 Q2 Q3 Q4 waiver priced ₹40K → declined waiver priced ₹1.2L → redesigned waiver priced ₹2.1L → declined 0 waivers → bill flat, 3 redesigns The mechanism difference, in one sentence In regime A, the platform-team lead approves waivers under deadline pressure; the cost is invisible and the bill grows. In regime B, the requesting team's own VP-of-engineering signs off on the named ₹ amount; redesign becomes the cheaper option three times in three quarters.
Illustrative — Yatrika's six-month outcome under two budget regimes. The technical mechanism is identical (CI gate, schema, kill-switch); the difference is whether the waiver is priced in series (invisible) or rupees (attributable to a named manager's budget). The rupee-denominated regime is a P&L, not a constraint.

The pattern Yatrika hit is a textbook case of what FinOps practitioners call showback drift: a cost is technically tracked but not attributed to the team causing it, so the team feels no friction when adding cost. The fix is chargeback — the cost is not just shown, it is deducted from the team's budget, and they have to explain at the quarterly review why their line grew. Until cardinality is chargeable, it grows. Once it is chargeable, it is engineered against the way every other cost is engineered against — by the team that incurs it, in the language they understand, which is rupees.

The audit script extended — series, rupees, owner, and a write-off ledger

The audit primitive from ch.37 enumerated (metric, label_set) and printed a series count. The ch.104 version extends it three ways: it prices each metric in rupees-per-quarter using the per-series-month rate from the contract, attributes each metric to a named owner (team + manager email), and writes the overage into a waivers.yaml ledger with an expiry date and a deletion-calendar entry. The ledger is the load-bearing artefact — it is what the Fortnightly Cardinality Review meeting reads from.

# cardinality_pnl.py — turn the audit into a per-team P&L with waiver tracking
# pip install requests pyyaml pandas
import yaml, requests, pandas as pd, datetime, sys
from pathlib import Path
from collections import defaultdict

PROM = "http://prom.yatrika.internal:9090"
RATE_PER_SERIES_MONTH = 1.62          # ₹1.62 per active series per month (contract)
QUARTER_MONTHS = 3
SCHEMA = yaml.safe_load(Path("metrics.yaml").read_text())
WAIVERS = yaml.safe_load(Path("waivers.yaml").read_text()) if Path("waivers.yaml").exists() else {"waivers": []}

# 1. Enumerate series active in the last 5 minutes (the head-block view)
r = requests.get(f"{PROM}/api/v1/series",
                 params={"match[]": '{__name__=~".+"}', "start": "now-5m"}, timeout=120)
series = r.json()["data"]
print(f"enumerated {len(series):,} active series")

# 2. Group by metric, compute series count and top-label cardinality
per_metric = defaultdict(int)
top_label = defaultdict(lambda: defaultdict(set))
for s in series:
    name = s["__name__"]; per_metric[name] += 1
    for k, v in s.items():
        if k != "__name__": top_label[name][k].add(v)

# 3. Walk the schema, compute rupee cost per metric per quarter, attribute to owner
team_pnl: dict[str, dict] = defaultdict(lambda: {"series": 0, "inr": 0.0,
                                                  "budget_inr": 0.0, "metrics": []})
rows = []
for m in SCHEMA["metrics"]:
    name = m["name"]; team = m["owner_team"]; manager = m["owner_email"]
    budget_series = m["max_series"]
    budget_inr = budget_series * RATE_PER_SERIES_MONTH * QUARTER_MONTHS
    actual_series = per_metric.get(name, 0)
    actual_inr = actual_series * RATE_PER_SERIES_MONTH * QUARTER_MONTHS
    overage_inr = max(0, actual_inr - budget_inr)
    top = sorted(((l, len(vs)) for l, vs in top_label[name].items()),
                 key=lambda x: -x[1])[:1]
    waiver = next((w for w in WAIVERS["waivers"] if w["metric"] == name and
                   datetime.date.fromisoformat(w["expires"]) >= datetime.date.today()),
                  None)
    rows.append({
        "metric": name, "owner_team": team, "manager": manager,
        "budget_₹/q": int(budget_inr), "actual_₹/q": int(actual_inr),
        "overage_₹/q": int(overage_inr),
        "top_label": top[0][0] if top else "—",
        "top_card": top[0][1] if top else 0,
        "waiver_until": waiver["expires"] if waiver else "—",
    })
    team_pnl[team]["series"] += actual_series
    team_pnl[team]["inr"] += actual_inr
    team_pnl[team]["budget_inr"] += budget_inr
    team_pnl[team]["metrics"].append(name)

# 4. Per-team P&L summary — what the VP-of-engineering reads on Friday
print("\n=== Per-team cardinality P&L (₹/quarter) ===")
for team, p in sorted(team_pnl.items(), key=lambda x: -x[1]["inr"]):
    delta = p["inr"] - p["budget_inr"]
    sign = "OVER" if delta > 0 else "under"
    print(f"  {team:18s}  budget ₹{int(p['budget_inr']):>9,}  "
          f"actual ₹{int(p['inr']):>9,}  {sign} ₹{int(abs(delta)):>9,}  "
          f"({len(p['metrics'])} metrics)")

# 5. Per-metric overage report — the deletion calendar
df = pd.DataFrame(rows)
overages = df[df["overage_₹/q"] > 0].sort_values("overage_₹/q", ascending=False)
df.to_csv("cardinality_pnl.csv", index=False)
print(f"\n=== {len(overages)} metrics over budget — deletion-calendar candidates ===")
print(overages[["metric", "owner_team", "actual_₹/q", "overage_₹/q",
                "top_label", "top_card", "waiver_until"]].head(10).to_string(index=False))

# 6. Exit non-zero if un-waivered overages exist (this is the CI gate)
unwaivered = overages[overages["waiver_until"] == "—"]
if len(unwaivered):
    print(f"\nFAIL: {len(unwaivered)} metrics over budget without an active waiver")
    sys.exit(1)
Sample run on Yatrika Q3:
enumerated 3,402,189 active series

=== Per-team cardinality P&L (₹/quarter) ===
  payments           budget ₹  9,72,000  actual ₹14,68,200  OVER  ₹4,96,200  (38 metrics)
  risk               budget ₹  6,48,000  actual ₹  6,80,400  OVER  ₹   32,400  (24 metrics)
  platform           budget ₹  3,24,000  actual ₹  3,02,400  under ₹   21,600  (52 metrics)
  data-eng           budget ₹  2,16,000  actual ₹  1,94,400  under ₹   21,600  (18 metrics)

=== 4 metrics over budget — deletion-calendar candidates ===
                              metric owner_team  actual_₹/q  overage_₹/q     top_label  top_card waiver_until
http_request_duration_seconds_bucket  payments    11,21,040     4,73,040  customer_id  14219401   —
              payment_attempts_total  payments       97,200       18,200  merchant_id     58201   —
        cache_hit_total                payments       62,640        4,860   key_prefix     12889   2026-05-14
       flink_watermark_skew_seconds   data-eng       21,600          0       partition      4480   —

FAIL: 3 metrics over budget without an active waiver

Read the per-team summary first. payments is over budget by ₹4.96 lakh per quarter — and that is what the platform-team lead emails the payments-VP on Friday morning, with the offender table attached. The conversation is no longer "your team is using too much cardinality" (abstract, easy to defer) but "your team's metrics line is ₹4.96 lakh over the ₹9.72 lakh you budgeted; here are the four metrics driving it; redesign is on the agenda for the next Cardinality Review." The conversation has rupees in it; rupees are the lingua franca of engineering managers.

The deletion-calendar table is what the Fortnightly Cardinality Review acts on. Each row is a candidate for one of three actions: redesign (replace customer_id with a tier — whale | mid | tail — cardinality 3), archive (move the metric to a low-frequency Mimir tenant with cheaper per-series-month rate), or delete (drop the metric and recover the labels via /wiki/exemplars-linking-metrics-to-traces). One row, one action, one PR, one named engineer, one merge date. The deletion calendar is the operating mechanism; without it the offender table is just a dashboard nobody reads.

Why the waiver-tracking is in YAML, not in the platform team's heads: a verbal waiver granted in a Slack thread evaporates on the next on-call rotation. The platform team rotates, the original approver leaves the company, the waiver outlives them, the cardinality stays. A YAML waiver with an explicit expires: date has two properties — it is discoverable (anyone running the audit can see what is currently waivered) and it is expiring (the audit fails on the renewal date, forcing a re-approval). The renewal forces the team to re-justify the cost every quarter, which is the only mechanism that prevents a one-time waiver from becoming a permanent line on the bill.

The Fortnightly Cardinality Review — the meeting that does the work

The audit script and the P&L are infrastructure. The meeting is the discipline. Two engineering managers (platform + the team with the largest overage), the head of FinOps, and one senior engineer per overage metric meet for 30 minutes on alternating Fridays. The agenda is fixed, the pre-read is the audit output, the outcomes are written to the waivers.yaml ledger or the deletion-calendar PR queue. The meeting has run at Yatrika for three quarters; the Q4 metrics-line is down 8% absolute, the first observability-cost-line decline in the company's history.

The agenda is exactly four items: (1) review last fortnight's deletion-calendar PRs — which merged, which slipped, which were de-prioritised by what; (2) walk the new overage rows — for each, the owning engineer presents the redesign or asks for a waiver with a stated rupee cost; (3) re-approve expiring waivers — every waiver expiring in the next 14 days is re-justified or allowed to lapse; (4) one structural ask — what change to the schema or instrumentation framework would make next quarter's audit cleaner. Item 4 is what makes the meeting iterative — every two weeks the SDK or the schema gets a small improvement, and over six months that compounds.

The Fortnightly Cardinality Review — what enters, what leavesA flow diagram showing the inputs to the review (audit script output, current waivers.yaml, last fortnight's PR queue, structural-improvement backlog), the four agenda items in order, and the outputs (updated waivers.yaml, new deletion-calendar PRs, schema/SDK improvement ticket, post-meeting Slack summary to all engineering leads). Fortnightly Cardinality Review — 30 minutes, fixed agenda, written outputs INPUTS (read before meeting) cardinality_pnl.csv waivers.yaml (expiring <14d) deletion-calendar PR queue structural-improvement backlog 30-minute agenda 1. Last fortnight's PRs (5min) merged / slipped / blocked-by-what 2. New overages (12min) redesign or priced waiver, per row 3. Expiring waivers (8min) re-justify or let lapse 4. One structural ask (5min) SDK/schema improvement OUTPUTS (within 24h) updated waivers.yaml N deletion-calendar PRs 1 structural ticket Slack summary to leads Why this meeting works It is short (30 min), it has named owners (not "the team"), it produces written artefacts (PRs and YAML), and it forces re-justification on a 14-day cadence. Skip any one and the discipline drifts within two months.
Illustrative — the Fortnightly Cardinality Review meeting structure. The meeting is the operating mechanism; the YAML files and PR queue are the artefacts the meeting reads from and writes to. Without the meeting, the YAML drifts; without the YAML, the meeting becomes a venting session.

Why fortnightly and not weekly or monthly: weekly is too frequent — engineering teams cannot redesign metrics on a weekly cadence, and the meeting devolves into status updates with no decisions. Monthly is too slow — overages compound, waivers pile up, and by the time the meeting happens the team has shipped three more high-cardinality labels because nobody pushed back. Fortnightly is the cadence at which one engineering team can complete one redesign PR between meetings, which is the unit of progress this meeting tracks.

The meeting also enforces a cultural rule that no Slack thread can: the engineer who added the cardinality presents the fix. Not the platform team, not the team's tech-lead, the actual author of the PR that introduced the offending label. This has two effects. First, the engineer learns the trade-off concretely — "I added customer_id because I wanted per-customer breakdown for the merchant dashboard; the fix is to keep customer_tier and use exemplars to drill down to per-customer when investigating a specific incident; tier-based cardinality is 3 instead of 14M." Second, the team's tech-lead and the team's VP both see the engineer present, which tells them whether the engineer's mental model has updated — and prevents the next engineer in the team from making the same mistake. The meeting is a teaching mechanism for the cardinality discipline, not just a governance ritual.

Common confusions

  • "A cardinality budget is the same as a CI gate." A CI gate enforces a number; a budget is the number plus the operating mechanism that adjusts it. Yatrika's CI gate ran for two quarters with no budget reductions; the bill grew. The CI gate is necessary, not sufficient — without the Fortnightly Review and the rupee P&L, the gate becomes a waiver-rubber-stamping bureaucracy.
  • "Pricing the budget in rupees is finance theatre." Engineers sometimes object that translating series to rupees is "just a dashboard". It is the principal change that makes the team's own VP-of-engineering the decision-maker for the waiver, instead of the platform-team lead. The rupee column is what makes the ask political-equivalent to other engineering-cost decisions (compute, headcount, tooling licences) — which is the only weight class in which it actually gets pushed back on.
  • "Waivers are the engineering-friendly path." Waivers feel friendly because they unblock a deadline. They are unfriendly to the engineer six months later who debugs a dashboard whose mental model is broken because the team hashed customer_id into 100 buckets and now p99 per customer is no longer measurable. Every waiver compounds future cognitive load. Redesigns front-load the cost; waivers amortise it across every future investigator.
  • "The deletion calendar is just a backlog." A backlog is unbounded and prioritised "later"; a deletion calendar has a date per item and a named owner. The calendar entry says "delete customer_id label from http_request_duration_seconds_bucket by 2026-05-31, owner ravi@yatrika.com, blocked-by checkout-redesign." If the date slips, the item appears at the next Fortnightly Review with the slippage reason. Calendars get reviewed; backlogs get archived.
  • "Per-team P&L creates inter-team competition." It does, deliberately. The healthy form is: payments-team sees risk-team's metrics-line drop 18% in Q3 and asks how. The unhealthy form is: payments-team blames risk-team for taking "their" budget. The audit script's per-team output normalises by team-size and by RPS to prevent zero-sum framing — ₹/RPS/quarter is the comparable, not absolute rupees.
  • "The schema is the source of truth." The schema is the declared truth; the live Prometheus is the actual truth; the gap between them is the audit's whole point. A schema that says "100K series" while Prometheus shows 940K means the kill-switch failed or was bypassed — the audit surfaces this within the next fortnightly cycle. Drift between schema and reality is the leading indicator of process breakdown, and the audit is the canary.

Going deeper

Setting the per-team rupee budget — the arithmetic and the politics

The rupee budget per team starts as a top-down number derived from the company's overall observability spend allocated by team headcount, and is then adjusted bottom-up against the team's RPS and SLO surface area. At Yatrika the formula is team_budget_inr = (total_obs_budget × team_headcount_share × 0.6) + (team_rps_share × total_obs_budget × 0.4). The 0.6/0.4 split favours headcount slightly because larger teams have more developers shipping code that emits metrics, and the RPS contribution accounts for the inherent cost difference between a 50-RPS internal-tools team and a 50K-RPS payments team. The first quarter the formula is run, every team's budget is wrong — some are 30% over-allocated, some 30% under. Quarters 2-4 adjust based on actual usage, with a hard rule: no team's budget can grow more than 10% per quarter (forces redesign-before-spend) and no team's budget can shrink more than 20% per quarter (prevents punitive cuts that surprise the team mid-roadmap).

The waiver-pricing model — what a waiver should cost

A waiver should be priced at 2× the marginal rupee cost of the cardinality it permits, on a 90-day non-renewable basis. The 2× multiplier reflects the future-investigation cognitive cost noted in the common-confusions section — the team is not just paying the storage cost, they are paying the option to defer the redesign. If the waiver is renewed (a second 90-day extension), the multiplier rises to 4×; a third renewal forces a redesign-or-retire decision at the next Architecture Review. Yatrika's first-year data: of 23 waivers granted, 17 were redesigned within the first 90 days (the 2× pricing made the redesign cheaper), 4 were renewed at 4× (legitimate "we are mid-migration" cases), 2 were retired at the third renewal (the metric was decommissioned because nobody could justify 8× the storage cost for it). The progression — 2×, 4×, retire — is what prevents permanent waivers.

What to do with the metrics that have no owner

Every audit surfaces 30-50% of metrics with no owner — Go runtime metrics, Prometheus self-metrics, Kubernetes-emitted metrics, third-party Helm chart metrics. These cannot have an "owner team" in the company's product sense, but they cost rupees, and somebody has to own the cost. The pragmatic answer: the platform team owns all unowned metrics, with a "platform-overhead" budget line that includes a default 20% of total observability spend. This forces the platform team to either (a) prove that the unowned metric is operationally load-bearing (kept at any cost), (b) identify the team that introduced the third-party tool emitting it (transferred), or (c) drop the metric via a Prometheus relabel rule (deleted). The default-budget-for-unowned policy is what surfaces the metric python_gc_objects_collected_total consuming 47K series across 1400 pods — the platform team had not noticed because it was distributed across many services; the audit script's unowned-roll-up showed 4.7% of total cardinality bound to two Python runtime metrics, which the platform team relabelled out at the storage layer for an instant ₹2.1L/quarter savings.

Reproduce this on your laptop

# 1. Spin up Prometheus + a metric emitter
docker run -d --name prom -p 9090:9090 prom/prometheus
python3 -m venv .venv && source .venv/bin/activate
pip install requests pyyaml pandas prometheus-client

# 2. Emit a fleet shape (200 metrics, varying cardinality)
python3 -c "
from prometheus_client import Counter, start_http_server
import random, time
start_http_server(8000)
counters = [Counter(f'service_{i}_requests_total', 'reqs', ['route','status'])
            for i in range(200)]
for _ in range(5):
    for c in counters:
        for r in [f'/r{i}' for i in range(50)]:
            for s in ['200','500']:
                c.labels(r, s).inc()
print('emitted ~20K series at :8000/metrics — let prometheus scrape it for a minute')
time.sleep(90)
" &

# 3. Drop in a metrics.yaml + waivers.yaml + run the P&L audit
cat > metrics.yaml <<EOF
metrics:
  - name: service_0_requests_total
    owner_team: payments
    owner_email: ravi@yatrika.com
    max_series: 100
EOF
echo 'waivers: []' > waivers.yaml
python3 cardinality_pnl.py
# Expect: payments team OVER budget, deletion-calendar candidates listed,
# audit exits non-zero (no waiver), per-team P&L printed.

Where this leads next

/wiki/tiered-storage-for-metrics-logs-traces takes the next-largest line on the bill — log-ingest and metric-retention — and applies the same per-tier discipline to retention windows: 7-day hot, 30-day warm, 365-day cold, with downsampling at each tier boundary. The cardinality budget primitive in this article composes with the tiered-retention primitive in the next: a metric whose budget is 100K series × 30 days can be promoted to 500K series × 7 days (same rupee cost, more cardinality for short-window debugging) without a budget renegotiation. The unit of fungibility is rupees-per-quarter, not series-per-day — once the budget is rupees, the team can trade cardinality against retention freely, which is the next article's argument.

/wiki/exemplars-linking-metrics-to-traces is the technical alternative to high-cardinality labels — drop the customer_id label from the histogram, attach it as a span attribute on the trace exemplar instead, and recover per-customer detail at investigation time with no metric-storage cost. The exemplar machinery is what makes the redesign in the Fortnightly Review tractable — without it, the team's only option to drop customer_id would be to lose per-customer detail entirely.

/wiki/wall-cardinality-is-the-billing-death-spiral is the wall-essay that argues the cardinality discipline is the largest single lever in the entire observability cost picture — and that teams which do not develop it will not be able to scale their observability stack past mid-stage growth. The Fortnightly Review and the rupee P&L in this article are the operating mechanisms that mid-stage companies (Razorpay-scale, Zerodha-scale, Swiggy-scale) actually use; that wall-essay is where the argument crystallises.

References