Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Wall: the discipline ties this all together
Aditi closes the laptop on the migration kickoff at 11:48 IST. The architecture document is written, the Helm chart is committed, the stack is chosen — VictoriaMetrics for metrics, ClickHouse for logs, Jaeger for traces. The CFO signed the vendor-vs-self-hosted spreadsheet two weeks ago. The bill is projected to drop from ₹62 lakh/quarter to ₹19 lakh/quarter once the migration completes, a saving of ₹43 lakh that will go straight into the platform-team headcount. Aditi is exhausted, satisfied, and wrong. Nine months from now, without anything anyone would call a mistake, the bill will be back at ₹54 lakh/quarter. The new cluster will be heavier, the cardinality higher, the retention longer. No single team did anything indefensible. The savings reverted because cost-and-retention is not a project with a delivery date — it is a discipline with an on-call rotation, a budget review, and a CI gate, and Yatrika hasn't built any of those yet. This is the wall closing Part 16: the tools in chapters 102–108 cut your bill twice, and discipline is what keeps it cut.
Every cost-control mechanism in Part 16 — tiered storage, downsampling, cardinality budgets, index-free log storage, vendor or self-host choice, stack pick — produces a one-time saving and then drifts back if no human owns it. The discipline is three artefacts: a quarterly cost review (numbers + drift attribution), a CI gate on cardinality and retention (catches regressions before they ship), and an observability on-call rotation that pages on bill-shape changes the same way SREs page on latency. Without those three, every Part-16 saving is half-life ~9 months.
Why every cost win drifts back — the second-law-of-bills mechanism
Tooling moves your bill from high to low once. Engineering activity moves it back. Every week, your platform ships features. New services emit new metrics. New product surfaces add new labels (vehicle_type, coupon_code, shipment_pincode). Data scientists log richer events for funnel analysis. Each addition is a single PR; each PR is reviewed for correctness, not for cost. Over 40 weeks at 60 PRs/week, the steady drip dwarfs the one-time tooling win. The bill is not a level — it is a flow, and the inflow is your engineering velocity.
Why this is structural, not a "we just need to be more careful" problem: every cardinality, retention, and storage decision is local — owned by the engineer who writes the PR — but the cost is global, paid by the platform team. Local decisions cannot see global consequences. An engineer adding a customer_id label sees their dashboard work; they do not see the 14M-series fan-out it will cause. The Coase-theorem framing: when the cost is externalised, individuals optimise their work without internalising the externality, and the aggregate drifts toward the worst-case bill. The fix is not "remind people more"; the fix is to internalise the cost in the engineer's workflow — make the PR fail when cardinality would explode, make the dashboard show the engineer's own service-team's bill, make the retention default short and require an explicit override.
The drift is not driven by carelessness — it is driven by invisibility. The engineer adding vehicle_type as a label to the rider-positioning service has no signal that this addition will multiply their team's metric series count by 8. The data engineer raising the freshness-gauge retention from 30 days to 90 days for a regulator audit has no signal that it triples their slice of the storage bill. The frontend team adding a new dashboard with a topk(50, ...) panel does not know the panel is a cardinality fan-out that will be evaluated every 30 seconds and inflate Mimir's query-path memory by 40%. The signal does not exist unless someone builds it. Discipline is what builds the signal.
The three discipline artefacts — what to actually build
The wall says: pick three artefacts, ship them, run them. Each has a clear cadence, a clear owner, and a measurable outcome. Skip any one and the drift returns.
Artefact 1 — quarterly cost review (cadence: 90 days, owner: platform-team lead). This is a 60-minute meeting with the engineering managers of every team that ships telemetry, plus the CFO or finance partner. The agenda is fixed: (a) what was the bill last quarter vs this quarter, (b) what changed in cardinality, retention, and log volume per team, (c) which 3 specific PRs / tickets caused the largest drift, (d) what is one commitment each team makes for next quarter. The meeting must produce a one-page document that goes to engineering leadership. The trap teams fall into: making this a "platform team rants at product teams" meeting. The discipline shape: every team brings their own cost-attribution table; the platform team just provides the methodology.
Artefact 2 — CI gate on cardinality and retention (cadence: every PR, owner: platform team). A check that runs on every PR touching service code, dashboard JSON, recording rules, alert rules, or retention configs. The check loads the current cardinality / retention budget for the affected team, computes the delta the PR introduces, and fails the build if the delta exceeds the budget. The gate must be fast (≤30s) and have a clear override path (a labelled escape: cost-override: approved-by:@aditi) — without the override, the gate becomes adversarial and gets disabled in a panic. Yatrika's gate, three months in, rejects ~11 PRs/quarter, of which 6 are bugs (someone added customer_id as a label without realising) and 5 are legitimate changes that get the override label after a 5-minute conversation with the team's tech lead.
Artefact 3 — observability on-call rotation (cadence: 24×7, owner: rotating SREs). Same model as the application on-call rotation, but the alerts are about the observability stack itself: ingestion-rate spike (more than 30% above 7-day median), cardinality breach (any tenant exceeded budget), storage-cost-per-day exceeded threshold, retention-policy edit pushed to production. The rotation has a runbook for each alert. Most teams skip this and try to address observability cost as a "side project" of the platform team. That fails because cost incidents are time-sensitive — a misconfigured Datadog @histogram tag with customer_id will burn ₹4 lakh/day until someone notices. Without an on-call rotation, the lag is 7–12 days. With one, it is 30 minutes.
A runnable cost-discipline scaffold
The script below is the kernel of the CI gate — the cardinality-and-retention budget check that runs on every PR. It pulls the current series count from VictoriaMetrics, compares against a per-team budget JSON file, and fails the PR if the delta would breach. The same logic doubles as the on-call alert evaluator. Owning one function — team_budget_check() — that drives both the CI gate and the on-call alert is the discipline shape: shared definitions, shared enforcement, no drift between "what we lint for" and "what we page on".
# cost_discipline.py — CI gate + on-call alert evaluator for observability cost.
# pip install requests pyyaml tabulate
# Usage (CI): python3 cost_discipline.py --pr-diff diff.json --team payments
# Usage (alert): python3 cost_discipline.py --evaluate-now --team payments
import argparse, json, sys, time, requests
from dataclasses import dataclass
from tabulate import tabulate
VM = "http://victoriametrics.platform:8428"
@dataclass
class TeamBudget:
team: str
series_budget: int # max active series this team owns
log_gb_per_day: float # max log volume per day
retention_days: int # max retention any of this team's metrics can have
# Loaded from a checked-in YAML; the team-leads PR-edit this file and the
# platform team approves; CFO sees the diff in the quarterly review.
BUDGETS = {
"payments": TeamBudget("payments", 1_200_000, 240.0, 30),
"rider": TeamBudget("rider", 850_000, 140.0, 14),
"ledger": TeamBudget("ledger", 400_000, 90.0, 90),
"data-plat": TeamBudget("data-plat", 4_500_000, 900.0, 14),
"ml-plat": TeamBudget("ml-plat", 2_100_000, 320.0, 14),
}
def current_series_for_team(team: str) -> int:
r = requests.get(f"{VM}/api/v1/series/count",
params={"match[]": f'{{team="{team}"}}'} , timeout=10)
r.raise_for_status()
return int(r.json()["data"][0])
def projected_series_after_pr(team: str, pr_added_series: int) -> int:
return current_series_for_team(team) + pr_added_series
def team_budget_check(team: str, pr_added_series: int = 0) -> dict:
b = BUDGETS[team]
proj = projected_series_after_pr(team, pr_added_series)
headroom = b.series_budget - proj
pct = 100 * proj / b.series_budget
return {
"team": team,
"current_series": proj - pr_added_series,
"pr_delta": pr_added_series,
"projected_series": proj,
"budget": b.series_budget,
"headroom": headroom,
"utilisation_pct": round(pct, 1),
"verdict": "PASS" if headroom > 0 else "FAIL",
}
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--team", required=True)
ap.add_argument("--pr-diff", help="JSON with key 'added_series'")
ap.add_argument("--evaluate-now", action="store_true")
args = ap.parse_args()
pr_added = 0
if args.pr_diff:
with open(args.pr_diff) as f:
pr_added = json.load(f).get("added_series", 0)
result = team_budget_check(args.team, pr_added)
print(tabulate([result], headers="keys", tablefmt="github"))
if result["verdict"] == "FAIL":
print(f"\nFAIL — projected {result['projected_series']:,} series exceeds "
f"budget {result['budget']:,} (headroom {result['headroom']:,}).")
print("Either: (a) reduce cardinality, (b) request budget increase via "
"PR to budgets.yaml, (c) add `cost-override: approved-by:@<lead>` "
"label and proceed at your own risk.")
sys.exit(1)
print(f"\nPASS — utilisation {result['utilisation_pct']}% of budget.")
sys.exit(0)
if __name__ == "__main__":
main()
Sample run on a PR that adds the vehicle_type label to the rider-positioning service (estimated +180,000 series from a 6-value label cross-product across existing series):
$ python3 cost_discipline.py --team rider --pr-diff /tmp/pr_42.json
| team | current_series | pr_delta | projected_series | budget | headroom | utilisation_pct | verdict |
|--------|------------------|------------|--------------------|----------|------------|-------------------|-----------|
| rider | 810,000 | 180,000 | 990,000 | 850,000 | -140,000 | 116.5 | FAIL |
FAIL — projected 990,000 series exceeds budget 850,000 (headroom -140,000).
Either: (a) reduce cardinality, (b) request budget increase via PR to budgets.yaml,
(c) add `cost-override: approved-by:@<lead>` label and proceed at your own risk.
Walking the load-bearing lines:
BUDGETS = {...}as a checked-in dictionary — this is the discipline-as-code shape. The budget is not a wiki page or a Confluence document; it is a YAML file in the platform repo. Changing it requires a PR. The CFO sees those PRs in the quarterly cost review (artefact 1). The diff is the audit trail. Why this shape outperforms a wiki page: a wiki page allows silent drift — anyone can edit it, no one notices, and the "budget" no one tracks ceases to be a budget. A checked-in YAML in the platform repo enforces three things at once: (a) every change is reviewed, (b) the change history is the conversation history, (c) the same file the CI reads is the file the on-call rotation reads, so there is one source of truth. The budget file becomes a living contract between platform and product teams that survives org changes, manager turnover, and the natural decay of "we agreed last quarter to..."current_series_for_teamcalling/api/v1/series/count— VictoriaMetrics' fast count endpoint, returns the integer in <500ms. Mimir has the equivalent at/api/v1/cardinality/active_series. Avoid the slower/api/v1/series(returns full label sets, multi-second on a 4M-series cluster). The CI gate must finish in <30s end-to-end, so the cardinality check is the budget-tightest call.pr_added_seriesas a parameter, not a measurement — for a PR that hasn't merged yet, the gate cannot measure the actual cardinality impact (the code isn't deployed). Instead, the PR author estimates it (or the gate estimates by parsing the diff for new label additions). Estimates are imperfect; the gate accepts that and uses headroom as a buffer. The post-merge measurement happens in the next quarterly review when the actual delta gets compared to the estimate, and consistently-wrong estimators lose budget headroom.- The override path (
cost-override: approved-by:@<lead>) — every blocking gate needs an escape hatch, and the escape hatch must have a name on it. Anonymous overrides (a--forceflag, a[skip-cost-check]commit message) become a habit; named overrides force a human-to-human conversation. This is not security theatre — it is workflow-design that aligns the override cost with the override consequence. Why named overrides bend the curve: if the override is anonymous, the marginal cost of using it is zero — engineers reach for it whenever the gate fires. If the override requires tagging a specific person, the marginal cost is "that person will ask why" — which is high enough to make engineers fix the underlying issue ~70% of the time. The gate's purpose is not to block every regression (impossible, would be adversarial) — it is to make the regressions visible before they ship. The override path is what keeps the gate cooperative. sys.exit(1)on FAIL with structured output — the CI runner reads the exit code; the on-call dashboard reads the JSON. Same script, two consumers. This is the artefact-3 / artefact-2 unification: when the on-call rotation's runbook says "run cost_discipline.py for each team", they get the same view the CI gate uses, so when an alert fires there is no "I need to figure out the right query" delay.- The verdict shape (
PASS/FAILplus aheadroomnumber) — both pieces matter. A binary pass/fail without a number cannot guide remediation; a number without a verdict requires the reader to remember the threshold. The combined shape is what makes the script readable as both a CI step and an alert evaluation.
How the artefacts compose — the operating loop
The three artefacts are not independent — they form a loop. The CI gate enforces the budget at PR time. The on-call rotation enforces the budget at runtime (when reality drifts from the estimate). The quarterly review reconciles the two — actual cost vs predicted cost — and rebases the budgets for next quarter. Each artefact catches a different drift mode: the CI gate catches known regressions, the on-call rotation catches unknown spikes, the quarterly review catches slow drift that neither real-time mechanism flags.
The non-obvious insight closing Part 16: the bill is a leading indicator of engineering culture, not a lagging indicator of telemetry choice. A team running Mimir + Loki + Tempo (the heaviest stack from the previous chapter) with the three discipline artefacts in place will out-cost-perform a team running SigNoz (the lightest stack) without them. The artefacts are not "process overhead" sitting on top of the engineering work — they are the engineering work for the cost dimension. Why this is empirically true and not an opinion: across the Indian platform-team migrations the chapters in Part 16 reference (Yatrika, PaisaBridge, the unicorn marketplaces), the variance in observability spend per million-active-series across stacks is ~2× (₹1.2L vs ₹2.4L per million series per month). The variance across discipline-equipped vs un-equipped teams on the same stack is ~5× (₹1.2L vs ₹6L per million series per month). The discipline dimension dominates the stack dimension by 2.5× — which is why the "right stack" question, while real, is the smaller question. The bigger question is "do you have the three artefacts." Part 17 is where this thread continues: the platform-team practices, the postmortem culture, the maturity model that determines whether Yatrika is a 5× variance team or a 1× variance team.
Common confusions
- "Cost discipline is a finance team responsibility." Half-true and dangerous. Finance owns the budget total (₹X lakh/quarter) but not the budget allocation across teams or the enforcement mechanism. The platform team owns enforcement; the team-leads own allocation; finance audits and signs off. If finance owns the whole loop, you get a year-end bill shock and a 30% cut mandate that breaks observability for everyone. The right shape: continuous attribution and per-team budgets that finance reviews quarterly, not approves daily.
- "We can replace the quarterly review with a real-time dashboard." No. The dashboard informs the review but does not replace it. The review is a forcing function for cross-team conversation — the platform lead and the rider-team manager need to agree on next quarter's budget, and that conversation does not happen if it is not scheduled. Real-time dashboards reduce surprise but do not produce alignment. Both are necessary; neither is sufficient.
- "Cardinality CI gates create friction and slow down development." True for the first 30 days; false after that. The first month of a CI gate produces ~50% rejection rate as the team learns to estimate cardinality impact. By month 3, the rejection rate stabilises at ~5–10%, almost all of which are legitimate concerns the engineer wanted flagged. The gate becomes a thinking aid, not a friction. The teams that report high friction are usually running the gate without an override path or without a per-team budget — both fixable design errors, not inherent to gating.
- "Once the migration is done, we can disband the on-call rotation." This is the most common reversion. Yatrika did this in month 6, declared victory, dissolved the rotation. By month 14 the bill was halfway back. The on-call rotation is not a migration-period artefact — it is a permanent engineering function, the same way the application-on-call rotation is permanent even though the original service has been live for years. Cost spikes are a category of incident; without an incident response, they go unaddressed.
- "Per-team budgets create internal politics." They make the politics visible, which is a feature not a bug. Without per-team budgets, the politics still exist — they just take the form of platform-team-vs-product-team complaints in retrospectives. With explicit budgets, the conversation moves to "this team needs more budget because they ship more telemetry-rich features" which is a healthier and more decidable conversation. Avoiding the budget conversation does not eliminate the politics — it just makes the politics implicit.
- "Discipline is a culture problem, not an engineering problem." This framing leads to slogans that don't change behaviour ("we should care more about cost"). Discipline is an engineering systems problem — what gates exist, what alerts page, what reviews happen, with which artefacts. Once the artefacts exist, the culture follows. Trying to fix the culture without the artefacts is what produces the next bill shock.
Going deeper
What "per-team attribution" actually requires — the label discipline
The CI gate, the on-call rotation, and the quarterly review all assume you can answer the question "which team owns this series?" That assumption requires a team label on every metric. Adding team retroactively to a multi-tenant cluster with 4M series is a 6–10 week project: every Prometheus scrape config needs the right external_labels: { team: ... }, every OTel Collector resource processor needs to inject the team based on Kubernetes namespace, every recording rule needs to preserve team through aggregations. The teams that try to skip this step (or make team a "best-effort" label) discover that their attribution is 30% wrong, the quarterly review degenerates into "who actually owns these orphan 800K series", and the discipline loop breaks at the foundation. The investment in clean attribution is the prerequisite for all three artefacts. Yatrika's migration plan budgets 8 weeks for label-discipline work before the CI gate goes live — and considers it the cheapest part of the migration.
The economics of dashboard-as-cost — why dashboards are a budget item
Dashboards are usually treated as zero-cost (they're "just queries"), but in a cluster with 4M series and 200 dashboards each refreshing every 30s, the query path consumes 25–40% of the cluster's CPU and 15% of its memory. A topk(50, sum by (customer_id) (rate(...[5m]))) panel evaluated every 30 seconds is a continuous 14M-row scan. Dashboards are a budget item; a per-team dashboard quota is a real artefact some teams ship: each team gets N panel-equivalents per quarter, where a panel-equivalent is weighted by query cost. Adding a 51st panel requires removing one or buying budget from another team. This is the discipline shape extended one layer beyond cardinality — and it tends to surface only after the cardinality discipline is in place, because cardinality dominates spend until you cap it.
Why "use vendor X for cost" almost never works as a strategy
Switching from Datadog to a self-hosted Mimir to "save money" without the discipline artefacts is a reliable way to re-arrive at the same bill in a different denomination — instead of paying Datadog ₹62 lakh/quarter, you pay AWS ₹38 lakh + 2 SREs' time ₹24 lakh = same bill, more operational risk. The vendor or stack switch (chapter 107, chapter 108) cuts cost only if the migration also installs the discipline. Most teams that migrate without the discipline either (a) regress within 12 months, or (b) save money by accident through other unrelated work and cannot tell which lever moved which dial. The migration is a forcing function — it gives the platform team the political capital to install the artefacts that they could not install before. The cost saving is the side-effect of the discipline, not the cause.
Reproduce this on your laptop
# Run a tiny VictoriaMetrics, push synthetic per-team series, run the budget check
docker run -d --name vm -p 8428:8428 victoriametrics/victoria-metrics
python3 -m venv .venv && source .venv/bin/activate
pip install requests prometheus-client pyyaml tabulate
# Emit 50K synthetic series tagged team=payments / team=rider / team=ledger
python3 emit_synthetic_series.py --target http://localhost:8428 \
--teams payments,rider,ledger --series 50000
# Run the discipline check
python3 cost_discipline.py --team rider --evaluate-now
# Then simulate a PR that adds 180k series
echo '{"added_series": 180000}' > /tmp/pr_42.json
python3 cost_discipline.py --team rider --pr-diff /tmp/pr_42.json
The emit_synthetic_series.py helper is a 30-line prometheus-client script — left as a stretch exercise but the kernel is Counter('reqs', 'reqs', ['team','endpoint','status']).labels(...).inc() in a loop, with 50K distinct label combinations sharded by team.
Where this leads next
This wall closes Part 16. The transition into Part 17 is direct: cost-and-retention is one of several disciplines that distinguish a mature observability platform team from a tooling-installation team. /wiki/building-the-team opens Part 17 with the team-shape question — what skills, what rotation, what charter the platform team needs to carry the disciplines this article describes. The three artefacts here are the cost-and-retention slice of a larger platform-team operating model.
/wiki/playbooks-post-mortems-and-blameless-culture is the cultural counterpart — the on-call rotation produces incidents, and incidents produce postmortems, and postmortems produce the next quarter's budgets. The loop in this article is one slice of the larger reliability loop.
/wiki/the-observability-maturity-model puts the discipline artefacts on a maturity scale — "have CI gate" and "have quarterly review" are concrete maturity-level checkpoints, alongside a dozen others. Part 17 is where the discipline pattern generalises beyond cost.
References
- Google SRE Book, Chapter 4 "Service Level Objectives" and Chapter 5 "Eliminating Toil" — the canonical articulation of operational discipline as engineering work, which is the frame this article extends to cost.
- Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022) — Chapter 13 ("Reducing the Cost of Observability") is the closest external analogue to the artefacts in this article; differs in emphasis (the book frames cost as a sampling problem; this article frames it as a culture problem).
- Will Larson, An Elegant Puzzle: Systems of Engineering Management (2019) — the chapter on "controls" articulates the principle that named-override gates outperform anonymous ones, which underpins artefact 2's design.
- Cindy Sridharan, Distributed Systems Observability (O'Reilly, 2018) — Chapter 7's discussion of operational maturity informs the on-call shape proposed for artefact 3.
- VictoriaMetrics, "Cardinality explorer and per-tenant limits" — the operational reference for the per-team cardinality endpoints that the runnable artefact uses.
- /wiki/wall-cardinality-is-the-billing-death-spiral — internal: the wall from Part 6 that establishes why cardinality is the dominant cost lever and the leading variable for any CI gate.
- /wiki/open-source-stacks-worth-running — internal: the previous chapter, which decides which stack the discipline runs on top of.
- /wiki/vendor-vs-self-hosted-economics — internal: the cost spreadsheet that the quarterly review ultimately reconciles against.