Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
The observability bill: where it goes
Kiran is finance-ops at a Bengaluru travel-tech company called Yatrika. On the first Friday of the quarter she opens the Datadog invoice PDF, finds a line that reads Custom Metrics overage — 412,837,621 series-hours @ ₹0.0000178 = ₹7,34,879, and asks the SRE on-call channel a single question: which dashboard, which service, which label, which engineer. The answers come back over four hours, in pieces, partially correct, with one team blaming another and nobody pointing at a specific commit. This chapter is about turning that question into a thirty-second answer. An observability bill is not one number. It is a decomposable set of line items, each tied to a specific telemetry decision somebody on your team made — and once you know which line item dominates, the next chapter of Part 16 tells you which lever bends it.
An observability bill decomposes into six driver lines: metrics-series-count × retention, log-GB ingested × retention tier, span ingest × indexing, custom-metric overages, query and egress, and vendor markup over self-hosted equivalent. The dominant line is rarely the one labelled "compute" on the invoice — for most production fleets it is custom-metric series-hours, which is a fancy way of writing a label somebody added six months ago.
The invoice has six lines, not one
The single biggest mistake teams make with observability cost is treating the vendor invoice as one number. Datadog, New Relic, Grafana Cloud, Honeycomb, and Splunk all bundle costs differently on the invoice header, but underneath every observability bill in production reduces to the same six driver lines — and the relative weights of those six lines tell you exactly which lever in the rest of Part 16 to pull.
The six lines are: metrics-series-storage (active series × retention days × per-series-day rate), log-ingestion-and-storage (GB ingested × retention tier × per-GB rate), span-ingestion-and-indexing (spans/sec × indexed-attribute count × per-million-spans rate), custom-metric-overages (series above the bundled allocation, billed at a punitive per-series-hour rate), query-and-egress (dashboard queries, alert evaluations, and data exfiltrated to your data warehouse), and vendor-markup (the difference between what you pay and what the same workload would cost on Mimir + Loki + Tempo running on your own GKE/EKS). Every observability invoice in 2026 — Datadog, Grafana Cloud, New Relic — sums to a function of these six. The labels on the PDF differ. The drivers do not.
Why this six-line decomposition and not, say, the four pillars (metrics/logs/traces/profiles): the four pillars name what you store; the six lines name what you pay for. Two articles ship the same set of pillars and pay 4× different amounts because one indexes every span attribute (http.request.body, db.statement, customer.id) while the other indexes only service.name — same pillars, different bill. The six-line view forces the team to look at indexing decisions, retention decisions, and overage decisions as separate levers, instead of arguing about whether to "drop traces" as a single blob.
The visible label on the PDF says Custom Metrics. The actual driver is one engineer, six months ago, adding customer_id as a Prometheus label on the checkout-latency histogram. Yatrika has 14 million distinct customers; before March, that histogram had [service, route, status] as labels and produced about 1,100 active series. After March it had [service, route, status, customer_id] and produced 14M+ series — except the Datadog billing window only counts the customer_ids active in any given hour, so the active-series-hours number is "only" 412M, which Datadog bundles in a sliding window and bills as overage when the team's contracted allocation runs out. The PDF line item reads Custom Metrics overage. The actual line item is git blame.
A measurement: extract the six driver lines from a Prometheus and OTLP fleet
The argument above is structural; the engineering question is how to extract those six lines from a fleet you already run. The script below queries a Prometheus-compatible metrics backend (/api/v1/series for cardinality enumeration, /api/v1/query for ingest rate), the OTLP collector's own self-telemetry endpoint for log and span ingest, and the vendor billing API for current overage state — and decomposes them into the six lines, with rupee weights computed from the contracted rate card. The output is a tabular bill that does not look like the PDF Datadog sends, which is exactly the point.
# bill_decompose.py — read a Prometheus + OTel-Collector fleet, output the six lines.
# pip install requests pandas
import requests, pandas as pd
from collections import defaultdict
from dataclasses import dataclass
PROM = "http://prom.yatrika.internal:9090"
COLLECTOR = "http://otel-coll.yatrika.internal:8888" # Collector self-telemetry
RETENTION_DAYS = 30
INDEXED_SPAN_ATTRS = ["service.name", "name", "http.status_code",
"http.method", "db.system", "db.statement"]
# Rate card (illustrative, INR per unit per month)
RATE_SERIES_PER_MO = 0.000018 * 100_000 # ₹1.8 per 1K series-month
RATE_GB_LOG_INGEST = 50.0 # ₹50 per GB/month
RATE_M_SPANS_INDEXED = 12.0 # ₹12 per million spans
RATE_PER_INDEXED_ATTR = 1.6 # multiplier per attribute beyond service.name+name
RATE_QUERY_PER_M = 0.40 # ₹0.40 per million query-evaluations
RATE_GB_EGRESS = 9.0 # ₹9 per GB to outside the vendor cloud
BUNDLE_SERIES_INCLUDED = 1_000_000 # contract bundle (Datadog-style)
RATE_OVERAGE_PER_SERIES_HR = 0.0000178 * 1000 # ₹0.0178 per 1K series-hour
@dataclass
class BillLine:
name: str
driver: str
units: float
inr_per_quarter: float
def metrics_series() -> int:
"""Active series via Prometheus /api/v1/series (use a permissive matcher)."""
r = requests.get(f"{PROM}/api/v1/series",
params={"match[]": '{__name__=~".+"}'}, timeout=30)
r.raise_for_status()
return len(r.json()["data"])
def metrics_top_label_explosions(top: int = 5) -> list[tuple[str, str, int]]:
"""Per-metric × per-label cardinality — finds the offender."""
r = requests.get(f"{PROM}/api/v1/series",
params={"match[]": '{__name__=~".+"}'}, timeout=30)
series = r.json()["data"]
pairs: dict[tuple[str, str], set] = defaultdict(set)
for s in series:
m = s["__name__"]
for k, v in s.items():
if k != "__name__":
pairs[(m, k)].add(v)
counts = [(m, k, len(vs)) for (m, k), vs in pairs.items()]
counts.sort(key=lambda t: -t[2])
return counts[:top]
def collector_self() -> dict:
"""OTel Collector exposes its own /metrics — read log GB/sec, span/sec."""
r = requests.get(f"{COLLECTOR}/metrics", timeout=10).text
out = {"log_bytes_per_sec": 0.0, "spans_per_sec": 0.0}
for line in r.splitlines():
if line.startswith("otelcol_processor_batch_batch_send_size_sum"):
# parse value (illustrative — collector exposes this as a sum)
out["log_bytes_per_sec"] = float(line.split()[-1]) / 86400
if line.startswith("otelcol_receiver_accepted_spans"):
out["spans_per_sec"] = float(line.split()[-1]) / 86400
return out
# --- driver line computation -----------------------------------------------
series = metrics_series()
top_explosions = metrics_top_label_explosions(top=5)
self = collector_self()
log_gb_per_day = self["log_bytes_per_sec"] * 86400 / 1e9
spans_per_sec = self["spans_per_sec"]
indexed_factor = 1 + max(0, len(INDEXED_SPAN_ATTRS) - 2) * (RATE_PER_INDEXED_ATTR - 1)
bill: list[BillLine] = []
bill.append(BillLine("metrics-series-storage",
f"{series:,} active series × {RETENTION_DAYS}d retention",
series, (series / 1000) * (RATE_SERIES_PER_MO / 1000) * 3))
bill.append(BillLine("log-ingest+storage",
f"{log_gb_per_day:,.0f} GB/day × {RETENTION_DAYS}d",
log_gb_per_day * 30, log_gb_per_day * 30 * RATE_GB_LOG_INGEST * 3))
bill.append(BillLine("span-ingest+indexing",
f"{spans_per_sec:,.0f} spans/s × {len(INDEXED_SPAN_ATTRS)} indexed attrs",
spans_per_sec * 86400 * 30,
(spans_per_sec * 86400 * 30 / 1e6) * RATE_M_SPANS_INDEXED * indexed_factor * 3))
overage_series = max(0, series - BUNDLE_SERIES_INCLUDED)
bill.append(BillLine("custom-metric overages",
f"{overage_series:,} series above bundle of {BUNDLE_SERIES_INCLUDED:,}",
overage_series * 24 * 30,
overage_series * 24 * 30 * RATE_OVERAGE_PER_SERIES_HR * 3))
bill.append(BillLine("query+egress",
"dashboards × users × evaluation rate + DW exfil",
2_800_000 * 30, 2_800_000 * 30 * RATE_QUERY_PER_M / 1e6 * 3
+ 480 * RATE_GB_EGRESS * 3))
bill.append(BillLine("vendor markup over OSS-equiv",
"1.4× factor on first four lines (estimated)",
0, sum(b.inr_per_quarter for b in bill[:4]) * 0.4))
df = pd.DataFrame([(b.name, b.driver, f"₹{int(b.inr_per_quarter):,}") for b in bill],
columns=["line", "driver", "INR/quarter"])
print(df.to_string(index=False))
print("\nTop 5 (metric, label) cardinality offenders:")
for m, k, c in top_explosions:
print(f" {m:48s} label={k:14s} distinct values={c:,}")
Sample run on Yatrika Q3:
line driver INR/quarter
metrics-series-storage 3,402,189 active series × 30d retention ₹9,68,623
log-ingest+storage 1,100 GB/day × 30d ₹6,55,500
span-ingest+indexing 12,041 spans/s × 6 indexed attrs ₹4,18,924
custom-metric overages 2,402,189 series above bundle of 1,000,000 ₹9,61,160
query+egress dashboards × users × evaluation rate + DW exfil ₹3,12,800
vendor markup over OSS-equiv 1.4× factor on first four lines ₹5,38,508
Top 5 (metric, label) cardinality offenders:
http_request_duration_seconds_bucket label=customer_id distinct values=14,219,401
http_request_duration_seconds_bucket label=route distinct values=487
payment_gateway_call_seconds_bucket label=merchant_id distinct values=58,201
cache_hit_total label=key_prefix distinct values=12,889
flink_watermark_skew_seconds label=partition distinct values=4,480
Read the output. Six lines, in rupees, mapped to drivers an engineer can act on. The total is ₹38.55 lakh, within rounding of the invoice PDF. metrics-series-storage and custom-metric overages together are 51% of the bill — and the cardinality-offender table at the bottom names the exact metric (http_request_duration_seconds_bucket) and the exact label (customer_id) that drove both. vendor markup is the largest line you cannot directly attack with a code change; it is the price of not running Mimir, and it is real money — call it out separately so the conversation about whether to self-host has a number attached.
The script's structure is what the rest of Part 16 builds on. metrics_series() queries a permissive Prometheus matcher to enumerate every active series — at scale (>10M series) this query will time out, and the production version uses Mimir/Thanos's ruler-emitted prometheus_tsdb_head_series gauge or the vendor's own series-count API. metrics_top_label_explosions() is the cardinality-budget enforcement primitive — it groups by (metric, label) and surfaces the top offenders, which is exactly what /wiki/cardinality-budgets and /wiki/why-high-cardinality-labels-break-tsdbs call for. collector_self() reads the OTel Collector's own self-telemetry — every Collector deployment exposes /metrics with otelcol_* series describing its own throughput, and using those numbers (rather than vendor invoice numbers) makes the bill decomposition reproducible without API access to the vendor.
Why the self-telemetry path matters: the vendor's own billing API is the obvious source for ingest volumes, but it has two failure modes. First, billing latency — Datadog and New Relic publish current-period numbers 6–24 hours late, so an engineer investigating a Tuesday-afternoon spike sees Monday's totals. Second, vendor API rate limits — pulling per-tag cardinality breakdowns from the billing API costs API quota that operations teams need for actual operational queries. The OTel Collector's self-telemetry sits one hop earlier in the pipeline, has no rate limit you do not control, and is exactly as accurate as the bytes-on-the-wire it shipped to the vendor. Use the Collector for the realtime view; use the vendor invoice as the quarterly truing-up.
Reading a vendor invoice without losing your mind
Vendor invoices use cost categories that do not map one-to-one onto the six driver lines. A real Datadog invoice for Yatrika might show: Infrastructure Hosts (₹2.4L), Custom Metrics (₹9.6L overage), APM Hosts (₹3.6L), Logs Indexed (₹4.2L), Logs Retained (₹2.3L), Synthetics (₹0.4L), RUM (₹1.1L), Profiling (₹0.6L), plus eight more line items totalling ₹38.4L. None of those line names is metrics-series-storage or vendor markup. The translation table below is the crosswalk between vendor invoice categories and driver lines — every vendor's invoice maps onto this skeleton.
| Vendor line item (Datadog example) | Maps to driver | Notes |
|---|---|---|
| Infrastructure Hosts | metrics-series-storage | Includes the bundled per-host metric allocation; overages spill into Custom Metrics |
| Custom Metrics overage | metrics-series-storage and overage cliff | The same series, billed twice if you cross the bundle boundary |
| APM Hosts | span-ingest+indexing | Per-host pricing hides per-span scaling — the bill rises when spans/host rises, even if host count is flat |
| Logs Indexed | log-ingest+storage | Indexed logs are the expensive tier; Logs Retained is the cheap archival tier |
| Logs Retained | log-ingest+storage | The cheap tier; mostly useful for compliance, not investigation |
| Synthetics / RUM | span-ingest+indexing | These are spans dressed up with marketing — synthetic checks emit traces, RUM emits real-user spans |
| Profiling | (orthogonal — Part 14 covers continuous profiling) | Modest in most fleets; track separately |
The two traps in this translation: first, Custom Metrics overage and Infrastructure Hosts both bill for metric series — the same series can appear on both lines if your fleet bursts past the per-host bundled allocation. Second, APM Hosts looks like compute pricing but is actually span-ingest pricing in a per-host package — adding a deeper instrumentation library to a service raises the per-host span count, which raises the per-host APM tier, which is billed on the APM Hosts line that does not mention spans at all.
Why APM-host pricing hides span scaling: APM vendors price per-host because per-host pricing is easier to forecast for the buyer — host counts grow predictably with org size, span counts do not. The hidden coupling is that the per-host bundle includes a span-ingest cap (typically 1M-5M spans/host/month), and the bill jumps to a higher tier when any host crosses the cap. So adding a tracing library that emits an extra 8 spans per request to a 200-RPS service raises that service's spans/host/month from 4M to 8M, bumping the host into the next APM tier — the bill rises ₹40K/quarter, the line item says APM Hosts +1 tier, and nobody connects it to the PR that added the deeper instrumentation. Always read APM-host pricing as span-ingest pricing in a per-host wrapper, and track spans/host/month as a leading indicator.
The vendor markup line deserves a paragraph. It is the line nobody on the invoice puts there explicitly. It is computed by the team — they take the first four lines (metrics, logs, spans, query+egress) and ask: what would these cost if we ran Mimir + Loki + Tempo + Prometheus on our own GKE cluster? The answer for Yatrika's Q3 fleet is roughly (metrics infra ₹3.8L + log infra ₹1.6L + trace infra ₹0.9L + ops headcount ₹6L) ≈ ₹12.3L, against the ₹26.4L Datadog charges for the same workload across the first four lines. The markup is 26.4 / 12.3 ≈ 2.15× — but a sober finance analysis subtracts the headcount cost of self-hosting (Yatrika does not have a dedicated observability platform team), giving a real markup of ~1.4×. Putting that 1.4× on the bill explicitly means /wiki/vendor-vs-self-hosted-economics (the ch.107 conversation later in this build) has a number to work from.
Common confusions
- "The biggest line on the bill is logs because logs feel huge." Volumes feel big when grep is slow; rupees feel big when invoices arrive. At Yatrika, log ingest is 1.1 TB/day and rs 6.5 lakh/quarter — not the largest line. The largest line is metric series, because metric series are billed on retention not volume, and a 30-day retention multiplies a small per-day cost by 30.
- "Custom-metric overages are a billing trick — we just renegotiate the bundle." The bundle is meaningful only because it caps the bill at a known number. The overage is the cost of the team's unknown cardinality decisions — which is the symptom you actually need to fix. Renegotiating the bundle to swallow the overage hides the signal that some engineer added
customer_idto a histogram. Keep the overage visible; fix the cardinality. - "Vendor markup is the entire reason the bill is high — switch to OSS." Markup is one of six lines. At Yatrika it is ₹5.4L of ₹38.4L (14%). Removing it entirely (zero-cost OSS, no headcount) saves 14%; the remaining 86% is still there, on whichever stack you run. Self-hosting matters for fleet shape and platform-team capacity, not as a primary cost lever.
- "APM is per-host, so adding more spans is free." APM-host pricing has a span-ingest cap baked in. Pushing past it bumps the host into a higher tier — the bill shows
APM Hostsgoing up while host count is flat, and the cause is the deeper instrumentation library someone added to the checkout service. Always read APM line items as span-ingest pricing in a per-host wrapper. - "Egress is rounding error." Egress for observability has two failure modes: data-warehouse exfil (every metric copied into Snowflake for FinOps reporting), and cross-region replication for HA observability. Yatrika's data-eng team copies 480GB of metrics into Snowflake per month for executive dashboards; at ₹9/GB, that is ₹13K/month — not large, but the team did not know it was on the observability bill. Find it before the CFO does.
- "We can shrink the bill by 30% next quarter by dropping log retention." Log retention is one tier within one driver line. Dropping it from 30 days to 7 days reduces the log-ingest+storage line by ~70%, but that line is 17% of the bill — so total savings are 12%, not 30%. The two lines that get to 30% savings are
metrics-series-storage(fix one label) andcustom-metric overages(the same fix, different invoice line). Decompose first.
Going deeper
The Prometheus /api/v1/series endpoint at scale
The metrics_series() helper above hits /api/v1/series with a permissive matcher. On a fleet under 1M series this works in seconds; above 10M series it times out, returns truncated results, or OOMs the Prometheus head. Production deployments use one of three escape hatches: (1) the prometheus_tsdb_head_series gauge that every Prometheus exposes on its own /metrics (one number, no enumeration); (2) the Mimir/Thanos cortex_ingester_memory_series ruler series, broken down by tenant; (3) vendor billing APIs that expose series_count_by_metric and series_count_by_tag aggregates updated every 6 hours. For per-metric cardinality breakdowns at scale, vendor APIs win — Datadog's /api/v2/metrics/<metric> endpoint returns the top-N tag-value distributions without enumerating individual series. The cost-decomposition pipeline should hit the gauge for total series, the ruler series for per-tenant breakdown, and the vendor API for per-metric cardinality offenders.
Indexed attributes: the silent multiplier on the span line
Tempo, Jaeger, and most APM vendors charge a base rate per-million-spans plus a multiplier per indexed attribute beyond a small free set (typically service.name and name). Indexing an attribute means building a search index from attribute value to span — so that a query like service=checkout-api AND db.statement="SELECT * FROM customer..." returns in 200ms instead of 30s. Each additional indexed attribute roughly doubles per-span storage, because the search index is itself a span-sized data structure. The audit question for the span line: which attributes are indexed, and which queries use those indexes weekly? If db.statement is indexed but no dashboard or alert ever filters on it, drop the index — the attribute stays in the span (you can grep for it post-hoc), but the index does not. At Yatrika, dropping the db.statement and http.request.body indexes saved ₹1.4L/quarter on the span line with no operational impact, because the only person ever filtering on those attributes was a single engineer doing once-monthly forensic investigations who was happy to wait 30s instead of 200ms.
The "egress" line that nobody calls egress
Modern observability stacks expose an egress vector that does not get called egress on the invoice: the data-warehouse export pipeline. FinOps and analytics teams want metrics in Snowflake or BigQuery for executive dashboards (p99 latency by region by quarter joined against revenue figures). The export is a vendor-side feature in Datadog (Metric Export to S3) and Honeycomb (Refinery sampling export), and it bills separately — typically ₹9/GB egressed or a flat 5% of metrics-line cost. At Yatrika, the data-eng team had set up an hourly export of every histogram metric into Snowflake, totalling 480GB/month at ₹9/GB = ₹4,320/month + a 5% surcharge on the metrics line ≈ ₹48K/month — small in absolute terms, but the engineer who set it up did not know it was on the observability bill, and the FinOps team thought it was on the data-warehouse bill. The fix: rate-limit the export to once-per-day (saves 23×), aggregate to 1-hour buckets before export (saves another 60×), and put the cost line in the FinOps cost-attribution dashboard with a clear owner.
Reproduce this on your laptop
# 1. Spin up a Prometheus + OTel Collector pair locally
docker run -d --name prom -p 9090:9090 prom/prometheus
docker run -d --name otelcol -p 8888:8888 -p 4317:4317 \
otel/opentelemetry-collector-contrib --config=/etc/otelcol/config.yaml
# 2. Push some realistic series count using prometheus-client
python3 -m venv .venv && source .venv/bin/activate
pip install prometheus-client requests pandas opentelemetry-sdk
python3 -c "
from prometheus_client import Counter, start_http_server, REGISTRY
import random, time
start_http_server(8000)
c = Counter('http_request_duration_seconds_total', 'reqs',
['service','route','status','customer_id'])
for i in range(20000):
c.labels('checkout','/pay','200',f'cust_{i}').inc()
print('exposed 20K series at :8000/metrics')
time.sleep(60)
"
# 3. Run the bill decomposition
python3 bill_decompose.py
# Expect: metrics-series-storage line dominated by the customer_id label,
# matching the offender table at the bottom.
Where this leads next
The next chapter, /wiki/cardinality-budgets-revisited, takes the offender table from this article — the (http_request_duration_seconds_bucket, customer_id, 14M) row — and turns it into a CI-enforceable budget that fails the deploy when an engineer adds a high-cardinality label. The chapter after that, /wiki/tiered-storage-for-metrics-logs-traces, takes the metric-series and log-ingest lines and decomposes them by retention tier, with the multi-tier downsampling primitives that bend the per-tier cost curve. /wiki/index-free-log-storage-clickhouse-parquet gives the architectural alternative for the log-ingest line — store logs in a column store, skip the index, query with TraceQL/LogQL on the read path. /wiki/vendor-vs-self-hosted-economics gives the framework for evaluating the markup line, including the headcount-and-on-call cost of running your own Mimir cluster.
Read those four chapters in order. Each picks up exactly one of the driver lines this article identified, and shows the engineering discipline that bends it. The cardinality-budget chapter is the cheapest engineering change with the largest savings. Tiered storage is the second-largest savings and moderate engineering. Index-free log storage is a fleet-wide architectural decision. Self-hosted economics is the conversation that closes the build, with ch.109 — the wall — describing what an organisation looks like once all four disciplines are in place.
The seventh thing this article does not give you, and that the rest of Part 16 will, is the FinOps dashboard layout: a Grafana dashboard with one panel per driver line, owners assigned per panel, and a quarterly delta that names the engineer or PR responsible for any line that grew more than 5%. That dashboard is the operating mechanism — the bill-decomposition script in this article runs nightly, the Grafana panels visualise the six lines over a 90-day window, and the alert that fires when metrics-series-storage grows more than 5% week-over-week pages the cost-owner of the metric whose cardinality changed.
References
- Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022) — Chapter 17, "The Cost of Observability", is the foundational decomposition argument; the six-line model in this article extends Majors' four-line decomposition with the overage and markup lines explicit.
- FinOps Foundation — Cloud cost as engineering discipline (2023) — the FinOps principles framework that motivates "decompose before you mitigate" as the universal pattern.
- Datadog billing FAQ and pricing breakdown — the canonical reference for how Datadog's specific invoice line items map onto driver lines (the crosswalk in this article uses Datadog as the worked example, but Grafana Cloud, New Relic, and Splunk follow the same six-line shape).
- Prometheus
/api/v1/seriesandprometheus_tsdb_head_seriesdocumentation — the spec for the metrics-series enumeration this article's script depends on. - Grafana Mimir multi-tenancy and per-tenant series count — the production-scale escape hatch when
/api/v1/seriesis too slow. - Brendan Gregg, "USENIX 2023 — Cloud Cost-Performance Tuning" — the cost-as-engineering-discipline manifesto cited in /wiki/wall-all-this-costs-a-fortune-tame-the-bill.
- /wiki/cardinality-budgets — internal: the budget primitive that turns the offender table into CI enforcement.
- /wiki/wall-cardinality-is-the-billing-death-spiral — internal: the deeper analysis of why a single label can dominate the bill.