Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Wall: the discipline ties this all together

Aditi closes the laptop on the migration kickoff at 11:48 IST. The architecture document is written, the Helm chart is committed, the stack is chosen — VictoriaMetrics for metrics, ClickHouse for logs, Jaeger for traces. The CFO signed the vendor-vs-self-hosted spreadsheet two weeks ago. The bill is projected to drop from ₹62 lakh/quarter to ₹19 lakh/quarter once the migration completes, a saving of ₹43 lakh that will go straight into the platform-team headcount. Aditi is exhausted, satisfied, and wrong. Nine months from now, without anything anyone would call a mistake, the bill will be back at ₹54 lakh/quarter. The new cluster will be heavier, the cardinality higher, the retention longer. No single team did anything indefensible. The savings reverted because cost-and-retention is not a project with a delivery date — it is a discipline with an on-call rotation, a budget review, and a CI gate, and Yatrika hasn't built any of those yet. This is the wall closing Part 16: the tools in chapters 102–108 cut your bill twice, and discipline is what keeps it cut.

Every cost-control mechanism in Part 16 — tiered storage, downsampling, cardinality budgets, index-free log storage, vendor or self-host choice, stack pick — produces a one-time saving and then drifts back if no human owns it. The discipline is three artefacts: a quarterly cost review (numbers + drift attribution), a CI gate on cardinality and retention (catches regressions before they ship), and an observability on-call rotation that pages on bill-shape changes the same way SREs page on latency. Without those three, every Part-16 saving is half-life ~9 months.

Why every cost win drifts back — the second-law-of-bills mechanism

Tooling moves your bill from high to low once. Engineering activity moves it back. Every week, your platform ships features. New services emit new metrics. New product surfaces add new labels (vehicle_type, coupon_code, shipment_pincode). Data scientists log richer events for funnel analysis. Each addition is a single PR; each PR is reviewed for correctness, not for cost. Over 40 weeks at 60 PRs/week, the steady drip dwarfs the one-time tooling win. The bill is not a level — it is a flow, and the inflow is your engineering velocity.

Why this is structural, not a "we just need to be more careful" problem: every cardinality, retention, and storage decision is local — owned by the engineer who writes the PR — but the cost is global, paid by the platform team. Local decisions cannot see global consequences. An engineer adding a customer_id label sees their dashboard work; they do not see the 14M-series fan-out it will cause. The Coase-theorem framing: when the cost is externalised, individuals optimise their work without internalising the externality, and the aggregate drifts toward the worst-case bill. The fix is not "remind people more"; the fix is to internalise the cost in the engineer's workflow — make the PR fail when cardinality would explode, make the dashboard show the engineer's own service-team's bill, make the retention default short and require an explicit override.

The cost-drift cycle — tooling cuts the bill once, discipline holds it downA timeline diagram showing observability bill over 18 months. Quarter 1 baseline at 62 lakh per quarter. Quarter 2 migration completes, bill drops sharply to 19 lakh. Quarters 3 through 6 show steady upward drift back to 54 lakh due to cardinality additions, new services, longer retention overrides, and dashboard cardinality fan-out. A second timeline below shows the same migration but with the three discipline artefacts applied: quarterly cost review, CI gate on cardinality, and on-call rotation. The drift is bounded between 19 and 24 lakh. Annotations call out the drift drivers: new labels added, retention increased, log fields not pruned, dashboards over-emitted. The figure caption notes that without discipline the half-life of a tooling saving is approximately nine months. Cost drift after the migration — without and with discipline Same migration; the three discipline artefacts decide whether the saving holds Without discipline — saving drifts back in ~9 months Q1 Q2 Q3 Q4 Q5 Q6 62L 40L 19L migration done + new labels + retention bumps + logs not pruned + dashboard fan-out With discipline — quarterly review + CI gate + on-call rotation Q1 Q2 Q3 Q4 Q5 Q6 62L 40L 19L CI rejects 11 PRs review catches drift on-call paged ×3 stable at ~22L
Illustrative — drift trajectories for two hypothetical Yatrika futures starting from the same migration. The discipline-equipped path absorbs ~3 lakh of legitimate growth (new services, real customer-driven retention bumps) while bouncing back the other 30+ lakh of cost-blind PRs. The undisciplined path absorbs all of it because nothing in the workflow flags any of it.

The drift is not driven by carelessness — it is driven by invisibility. The engineer adding vehicle_type as a label to the rider-positioning service has no signal that this addition will multiply their team's metric series count by 8. The data engineer raising the freshness-gauge retention from 30 days to 90 days for a regulator audit has no signal that it triples their slice of the storage bill. The frontend team adding a new dashboard with a topk(50, ...) panel does not know the panel is a cardinality fan-out that will be evaluated every 30 seconds and inflate Mimir's query-path memory by 40%. The signal does not exist unless someone builds it. Discipline is what builds the signal.

The three discipline artefacts — what to actually build

The wall says: pick three artefacts, ship them, run them. Each has a clear cadence, a clear owner, and a measurable outcome. Skip any one and the drift returns.

Artefact 1 — quarterly cost review (cadence: 90 days, owner: platform-team lead). This is a 60-minute meeting with the engineering managers of every team that ships telemetry, plus the CFO or finance partner. The agenda is fixed: (a) what was the bill last quarter vs this quarter, (b) what changed in cardinality, retention, and log volume per team, (c) which 3 specific PRs / tickets caused the largest drift, (d) what is one commitment each team makes for next quarter. The meeting must produce a one-page document that goes to engineering leadership. The trap teams fall into: making this a "platform team rants at product teams" meeting. The discipline shape: every team brings their own cost-attribution table; the platform team just provides the methodology.

Artefact 2 — CI gate on cardinality and retention (cadence: every PR, owner: platform team). A check that runs on every PR touching service code, dashboard JSON, recording rules, alert rules, or retention configs. The check loads the current cardinality / retention budget for the affected team, computes the delta the PR introduces, and fails the build if the delta exceeds the budget. The gate must be fast (≤30s) and have a clear override path (a labelled escape: cost-override: approved-by:@aditi) — without the override, the gate becomes adversarial and gets disabled in a panic. Yatrika's gate, three months in, rejects ~11 PRs/quarter, of which 6 are bugs (someone added customer_id as a label without realising) and 5 are legitimate changes that get the override label after a 5-minute conversation with the team's tech lead.

Artefact 3 — observability on-call rotation (cadence: 24×7, owner: rotating SREs). Same model as the application on-call rotation, but the alerts are about the observability stack itself: ingestion-rate spike (more than 30% above 7-day median), cardinality breach (any tenant exceeded budget), storage-cost-per-day exceeded threshold, retention-policy edit pushed to production. The rotation has a runbook for each alert. Most teams skip this and try to address observability cost as a "side project" of the platform team. That fails because cost incidents are time-sensitive — a misconfigured Datadog @histogram tag with customer_id will burn ₹4 lakh/day until someone notices. Without an on-call rotation, the lag is 7–12 days. With one, it is 30 minutes.

A runnable cost-discipline scaffold

The script below is the kernel of the CI gate — the cardinality-and-retention budget check that runs on every PR. It pulls the current series count from VictoriaMetrics, compares against a per-team budget JSON file, and fails the PR if the delta would breach. The same logic doubles as the on-call alert evaluator. Owning one function — team_budget_check() — that drives both the CI gate and the on-call alert is the discipline shape: shared definitions, shared enforcement, no drift between "what we lint for" and "what we page on".

# cost_discipline.py — CI gate + on-call alert evaluator for observability cost.
# pip install requests pyyaml tabulate
# Usage (CI):    python3 cost_discipline.py --pr-diff diff.json --team payments
# Usage (alert): python3 cost_discipline.py --evaluate-now --team payments

import argparse, json, sys, time, requests
from dataclasses import dataclass
from tabulate import tabulate

VM = "http://victoriametrics.platform:8428"

@dataclass
class TeamBudget:
    team: str
    series_budget: int       # max active series this team owns
    log_gb_per_day: float    # max log volume per day
    retention_days: int      # max retention any of this team's metrics can have

# Loaded from a checked-in YAML; the team-leads PR-edit this file and the
# platform team approves; CFO sees the diff in the quarterly review.
BUDGETS = {
    "payments":   TeamBudget("payments",   1_200_000,  240.0, 30),
    "rider":      TeamBudget("rider",        850_000,  140.0, 14),
    "ledger":     TeamBudget("ledger",       400_000,   90.0, 90),
    "data-plat":  TeamBudget("data-plat",  4_500_000,  900.0, 14),
    "ml-plat":    TeamBudget("ml-plat",    2_100_000,  320.0, 14),
}

def current_series_for_team(team: str) -> int:
    r = requests.get(f"{VM}/api/v1/series/count",
                     params={"match[]": f'{{team="{team}"}}'} , timeout=10)
    r.raise_for_status()
    return int(r.json()["data"][0])

def projected_series_after_pr(team: str, pr_added_series: int) -> int:
    return current_series_for_team(team) + pr_added_series

def team_budget_check(team: str, pr_added_series: int = 0) -> dict:
    b = BUDGETS[team]
    proj = projected_series_after_pr(team, pr_added_series)
    headroom = b.series_budget - proj
    pct = 100 * proj / b.series_budget
    return {
        "team": team,
        "current_series": proj - pr_added_series,
        "pr_delta": pr_added_series,
        "projected_series": proj,
        "budget": b.series_budget,
        "headroom": headroom,
        "utilisation_pct": round(pct, 1),
        "verdict": "PASS" if headroom > 0 else "FAIL",
    }

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--team", required=True)
    ap.add_argument("--pr-diff", help="JSON with key 'added_series'")
    ap.add_argument("--evaluate-now", action="store_true")
    args = ap.parse_args()

    pr_added = 0
    if args.pr_diff:
        with open(args.pr_diff) as f:
            pr_added = json.load(f).get("added_series", 0)

    result = team_budget_check(args.team, pr_added)
    print(tabulate([result], headers="keys", tablefmt="github"))

    if result["verdict"] == "FAIL":
        print(f"\nFAIL — projected {result['projected_series']:,} series exceeds "
              f"budget {result['budget']:,} (headroom {result['headroom']:,}).")
        print("Either: (a) reduce cardinality, (b) request budget increase via "
              "PR to budgets.yaml, (c) add `cost-override: approved-by:@<lead>` "
              "label and proceed at your own risk.")
        sys.exit(1)
    print(f"\nPASS — utilisation {result['utilisation_pct']}% of budget.")
    sys.exit(0)

if __name__ == "__main__":
    main()

Sample run on a PR that adds the vehicle_type label to the rider-positioning service (estimated +180,000 series from a 6-value label cross-product across existing series):

$ python3 cost_discipline.py --team rider --pr-diff /tmp/pr_42.json
| team   |   current_series |   pr_delta |   projected_series |   budget |   headroom |   utilisation_pct | verdict   |
|--------|------------------|------------|--------------------|----------|------------|-------------------|-----------|
| rider  |          810,000 |    180,000 |            990,000 |  850,000 |   -140,000 |             116.5 | FAIL      |

FAIL — projected 990,000 series exceeds budget 850,000 (headroom -140,000).
Either: (a) reduce cardinality, (b) request budget increase via PR to budgets.yaml,
(c) add `cost-override: approved-by:@<lead>` label and proceed at your own risk.

Walking the load-bearing lines:

  • BUDGETS = {...} as a checked-in dictionary — this is the discipline-as-code shape. The budget is not a wiki page or a Confluence document; it is a YAML file in the platform repo. Changing it requires a PR. The CFO sees those PRs in the quarterly cost review (artefact 1). The diff is the audit trail. Why this shape outperforms a wiki page: a wiki page allows silent drift — anyone can edit it, no one notices, and the "budget" no one tracks ceases to be a budget. A checked-in YAML in the platform repo enforces three things at once: (a) every change is reviewed, (b) the change history is the conversation history, (c) the same file the CI reads is the file the on-call rotation reads, so there is one source of truth. The budget file becomes a living contract between platform and product teams that survives org changes, manager turnover, and the natural decay of "we agreed last quarter to..."
  • current_series_for_team calling /api/v1/series/count — VictoriaMetrics' fast count endpoint, returns the integer in <500ms. Mimir has the equivalent at /api/v1/cardinality/active_series. Avoid the slower /api/v1/series (returns full label sets, multi-second on a 4M-series cluster). The CI gate must finish in <30s end-to-end, so the cardinality check is the budget-tightest call.
  • pr_added_series as a parameter, not a measurement — for a PR that hasn't merged yet, the gate cannot measure the actual cardinality impact (the code isn't deployed). Instead, the PR author estimates it (or the gate estimates by parsing the diff for new label additions). Estimates are imperfect; the gate accepts that and uses headroom as a buffer. The post-merge measurement happens in the next quarterly review when the actual delta gets compared to the estimate, and consistently-wrong estimators lose budget headroom.
  • The override path (cost-override: approved-by:@<lead>) — every blocking gate needs an escape hatch, and the escape hatch must have a name on it. Anonymous overrides (a --force flag, a [skip-cost-check] commit message) become a habit; named overrides force a human-to-human conversation. This is not security theatre — it is workflow-design that aligns the override cost with the override consequence. Why named overrides bend the curve: if the override is anonymous, the marginal cost of using it is zero — engineers reach for it whenever the gate fires. If the override requires tagging a specific person, the marginal cost is "that person will ask why" — which is high enough to make engineers fix the underlying issue ~70% of the time. The gate's purpose is not to block every regression (impossible, would be adversarial) — it is to make the regressions visible before they ship. The override path is what keeps the gate cooperative.
  • sys.exit(1) on FAIL with structured output — the CI runner reads the exit code; the on-call dashboard reads the JSON. Same script, two consumers. This is the artefact-3 / artefact-2 unification: when the on-call rotation's runbook says "run cost_discipline.py for each team", they get the same view the CI gate uses, so when an alert fires there is no "I need to figure out the right query" delay.
  • The verdict shape (PASS / FAIL plus a headroom number) — both pieces matter. A binary pass/fail without a number cannot guide remediation; a number without a verdict requires the reader to remember the threshold. The combined shape is what makes the script readable as both a CI step and an alert evaluation.

How the artefacts compose — the operating loop

The three artefacts are not independent — they form a loop. The CI gate enforces the budget at PR time. The on-call rotation enforces the budget at runtime (when reality drifts from the estimate). The quarterly review reconciles the two — actual cost vs predicted cost — and rebases the budgets for next quarter. Each artefact catches a different drift mode: the CI gate catches known regressions, the on-call rotation catches unknown spikes, the quarterly review catches slow drift that neither real-time mechanism flags.

The cost-discipline operating loop — three artefacts, three drift modesA circular flow diagram with three nodes connected by arrows. Top node is CI gate, labelled "catches known regressions at PR time, rejects ~11 PRs per quarter, runs in under 30 seconds". Right node is on-call rotation, labelled "catches unknown spikes at runtime, pages within 30 minutes, runbook references same script as CI". Bottom node is quarterly review, labelled "catches slow drift, reconciles actual vs predicted, rebases budgets, output is one-page doc to leadership". Arrows go in a clockwise loop and inwards: CI gate enforces the budget the review set, the review's input is the on-call incident history plus the CI rejection log, the on-call uses the same script the CI uses. A central diamond labelled "shared budget YAML" connects all three. Annotations show that breaking any one node breaks the loop. The discipline loop — three artefacts that cover three drift modes Skip any one and the saving regresses budgets.yaml single source of truth Artefact 1 — CI gate drift mode: known regressions at PR time cadence: every PR, <30s owner: platform team output: PR pass/fail + headroom Artefact 3 — on-call rotation drift mode: unknown runtime spikes cadence: 24×7, page in 30 min owner: rotating SREs output: incident → runbook → fix Artefact 2 — quarterly review drift mode: slow legitimate drift cadence: 90 days, 60-min meeting owner: platform lead + CFO output: one-page to leadership PR rejection log feeds review on-call incidents inform next quarter's budgets review rebases CI budget thresholds
Illustrative — the three artefacts share the budget YAML and pass signals to each other. The CI gate's rejection log is an input to the quarterly review. The on-call's incident history is another input. The quarterly review's output is a new budget.yaml that the CI gate reads on the next PR. Cut the loop anywhere and the discipline degrades to a tooling-only posture.

The non-obvious insight closing Part 16: the bill is a leading indicator of engineering culture, not a lagging indicator of telemetry choice. A team running Mimir + Loki + Tempo (the heaviest stack from the previous chapter) with the three discipline artefacts in place will out-cost-perform a team running SigNoz (the lightest stack) without them. The artefacts are not "process overhead" sitting on top of the engineering work — they are the engineering work for the cost dimension. Why this is empirically true and not an opinion: across the Indian platform-team migrations the chapters in Part 16 reference (Yatrika, PaisaBridge, the unicorn marketplaces), the variance in observability spend per million-active-series across stacks is ~2× (₹1.2L vs ₹2.4L per million series per month). The variance across discipline-equipped vs un-equipped teams on the same stack is ~5× (₹1.2L vs ₹6L per million series per month). The discipline dimension dominates the stack dimension by 2.5× — which is why the "right stack" question, while real, is the smaller question. The bigger question is "do you have the three artefacts." Part 17 is where this thread continues: the platform-team practices, the postmortem culture, the maturity model that determines whether Yatrika is a 5× variance team or a 1× variance team.

Common confusions

  • "Cost discipline is a finance team responsibility." Half-true and dangerous. Finance owns the budget total (₹X lakh/quarter) but not the budget allocation across teams or the enforcement mechanism. The platform team owns enforcement; the team-leads own allocation; finance audits and signs off. If finance owns the whole loop, you get a year-end bill shock and a 30% cut mandate that breaks observability for everyone. The right shape: continuous attribution and per-team budgets that finance reviews quarterly, not approves daily.
  • "We can replace the quarterly review with a real-time dashboard." No. The dashboard informs the review but does not replace it. The review is a forcing function for cross-team conversation — the platform lead and the rider-team manager need to agree on next quarter's budget, and that conversation does not happen if it is not scheduled. Real-time dashboards reduce surprise but do not produce alignment. Both are necessary; neither is sufficient.
  • "Cardinality CI gates create friction and slow down development." True for the first 30 days; false after that. The first month of a CI gate produces ~50% rejection rate as the team learns to estimate cardinality impact. By month 3, the rejection rate stabilises at ~5–10%, almost all of which are legitimate concerns the engineer wanted flagged. The gate becomes a thinking aid, not a friction. The teams that report high friction are usually running the gate without an override path or without a per-team budget — both fixable design errors, not inherent to gating.
  • "Once the migration is done, we can disband the on-call rotation." This is the most common reversion. Yatrika did this in month 6, declared victory, dissolved the rotation. By month 14 the bill was halfway back. The on-call rotation is not a migration-period artefact — it is a permanent engineering function, the same way the application-on-call rotation is permanent even though the original service has been live for years. Cost spikes are a category of incident; without an incident response, they go unaddressed.
  • "Per-team budgets create internal politics." They make the politics visible, which is a feature not a bug. Without per-team budgets, the politics still exist — they just take the form of platform-team-vs-product-team complaints in retrospectives. With explicit budgets, the conversation moves to "this team needs more budget because they ship more telemetry-rich features" which is a healthier and more decidable conversation. Avoiding the budget conversation does not eliminate the politics — it just makes the politics implicit.
  • "Discipline is a culture problem, not an engineering problem." This framing leads to slogans that don't change behaviour ("we should care more about cost"). Discipline is an engineering systems problem — what gates exist, what alerts page, what reviews happen, with which artefacts. Once the artefacts exist, the culture follows. Trying to fix the culture without the artefacts is what produces the next bill shock.

Going deeper

What "per-team attribution" actually requires — the label discipline

The CI gate, the on-call rotation, and the quarterly review all assume you can answer the question "which team owns this series?" That assumption requires a team label on every metric. Adding team retroactively to a multi-tenant cluster with 4M series is a 6–10 week project: every Prometheus scrape config needs the right external_labels: { team: ... }, every OTel Collector resource processor needs to inject the team based on Kubernetes namespace, every recording rule needs to preserve team through aggregations. The teams that try to skip this step (or make team a "best-effort" label) discover that their attribution is 30% wrong, the quarterly review degenerates into "who actually owns these orphan 800K series", and the discipline loop breaks at the foundation. The investment in clean attribution is the prerequisite for all three artefacts. Yatrika's migration plan budgets 8 weeks for label-discipline work before the CI gate goes live — and considers it the cheapest part of the migration.

The economics of dashboard-as-cost — why dashboards are a budget item

Dashboards are usually treated as zero-cost (they're "just queries"), but in a cluster with 4M series and 200 dashboards each refreshing every 30s, the query path consumes 25–40% of the cluster's CPU and 15% of its memory. A topk(50, sum by (customer_id) (rate(...[5m]))) panel evaluated every 30 seconds is a continuous 14M-row scan. Dashboards are a budget item; a per-team dashboard quota is a real artefact some teams ship: each team gets N panel-equivalents per quarter, where a panel-equivalent is weighted by query cost. Adding a 51st panel requires removing one or buying budget from another team. This is the discipline shape extended one layer beyond cardinality — and it tends to surface only after the cardinality discipline is in place, because cardinality dominates spend until you cap it.

Why "use vendor X for cost" almost never works as a strategy

Switching from Datadog to a self-hosted Mimir to "save money" without the discipline artefacts is a reliable way to re-arrive at the same bill in a different denomination — instead of paying Datadog ₹62 lakh/quarter, you pay AWS ₹38 lakh + 2 SREs' time ₹24 lakh = same bill, more operational risk. The vendor or stack switch (chapter 107, chapter 108) cuts cost only if the migration also installs the discipline. Most teams that migrate without the discipline either (a) regress within 12 months, or (b) save money by accident through other unrelated work and cannot tell which lever moved which dial. The migration is a forcing function — it gives the platform team the political capital to install the artefacts that they could not install before. The cost saving is the side-effect of the discipline, not the cause.

Reproduce this on your laptop

# Run a tiny VictoriaMetrics, push synthetic per-team series, run the budget check
docker run -d --name vm -p 8428:8428 victoriametrics/victoria-metrics

python3 -m venv .venv && source .venv/bin/activate
pip install requests prometheus-client pyyaml tabulate

# Emit 50K synthetic series tagged team=payments / team=rider / team=ledger
python3 emit_synthetic_series.py --target http://localhost:8428 \
        --teams payments,rider,ledger --series 50000

# Run the discipline check
python3 cost_discipline.py --team rider --evaluate-now
# Then simulate a PR that adds 180k series
echo '{"added_series": 180000}' > /tmp/pr_42.json
python3 cost_discipline.py --team rider --pr-diff /tmp/pr_42.json

The emit_synthetic_series.py helper is a 30-line prometheus-client script — left as a stretch exercise but the kernel is Counter('reqs', 'reqs', ['team','endpoint','status']).labels(...).inc() in a loop, with 50K distinct label combinations sharded by team.

Where this leads next

This wall closes Part 16. The transition into Part 17 is direct: cost-and-retention is one of several disciplines that distinguish a mature observability platform team from a tooling-installation team. /wiki/building-the-team opens Part 17 with the team-shape question — what skills, what rotation, what charter the platform team needs to carry the disciplines this article describes. The three artefacts here are the cost-and-retention slice of a larger platform-team operating model.

/wiki/playbooks-post-mortems-and-blameless-culture is the cultural counterpart — the on-call rotation produces incidents, and incidents produce postmortems, and postmortems produce the next quarter's budgets. The loop in this article is one slice of the larger reliability loop.

/wiki/the-observability-maturity-model puts the discipline artefacts on a maturity scale — "have CI gate" and "have quarterly review" are concrete maturity-level checkpoints, alongside a dozen others. Part 17 is where the discipline pattern generalises beyond cost.

References