Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Wall: all this costs a fortune — tame the bill
It is the second Tuesday of the quarter and Asha, the data-platform lead at a hypothetical Bengaluru lending startup called PaisaBridge, is forwarding a Datadog invoice to her CFO with the subject line "we need to talk". The bill has grown from ₹14 lakh last quarter to ₹41 lakh this quarter — a 193% jump — and Asha's team has shipped no new microservices, scraped no new endpoints, and added no new application dashboards. They did, six weeks ago, finish rolling out the Build 15 stack across the whole data platform: per-table data-quality SLOs, lineage-aware alerting, model and data drift detectors on every production model, batch-pipeline freshness gauges, and the five-signal stream-processor monitor on every Flink job. Each addition looked routine. Together they multiplied the metrics-active-series count from 4.2M to 38M, the log volume from 1.1 TB/day to 6.8 TB/day, and the trace ingest from 12K spans/sec to 47K spans/sec. This is the wall. Data and ML observability has a cost shape that is orthogonal to web request-response observability — its bill scales on table count, partition count, feature count, and watermark-cardinality, not on user QPS — and Build 15's tools are powerful precisely because they instrument every one of those dimensions.
Data and ML observability inflates the bill on axes that web observability does not touch: per-table freshness gauges, per-partition watermark skew, per-feature drift histograms, per-lineage-edge alert state. The cost is dominated by metric-series count multiplied by table count and partition count — not by QPS — and standard cost-control levers (head sampling traces, log retention tiers, cardinality limits on customer_id) miss the dominant term entirely. Build 15 ends here because Part 16 begins on the other side of this wall: the disciplines for tiered storage, downsampling, and per-pillar budgets that data-and-ML observability requires.
Why the data-pillar cost shape is orthogonal to RED
Web request-response observability bills you on three things and they all scale with user traffic. More requests per second means more counters incremented, more log lines emitted, more spans started. If you slow user growth or sample harder, the bill bends. The recurring cardinality death spiral is the wrinkle — one bad label multiplies series count by 7,000× — but its mitigation is also tractable: enforce a per-metric label budget in CI and the curve flattens.
Build 15's observability does not work that way. A freshness gauge on the orders table emits one sample every minute regardless of whether the table has 100 rows or 100 billion. A drift detector on a fraud-scoring feature emits the KS-distance statistic every 5 minutes regardless of how many predictions the model served that interval. A watermark-skew gauge on a Kafka partition emits one sample per minute whether the partition saw 10 records or 10 million. The cost is orthogonal to user traffic. It scales on the count of artefacts being observed: tables, partitions, features, lineage edges, models. Add 200 new tables to the warehouse and the freshness-gauge series count rises by 200 — even if no user ever queries those tables. Add 80 features to a recommender model and the drift-detector series count rises by 80 — even if the model is in shadow mode and serves zero production traffic.
Why this is structural, not vendor-specific: every observability backend (Prometheus, Datadog, New Relic, Honeycomb, Grafana Cloud) charges on series-count × samples-per-series × retention. A web counter is one series whether QPS is 10 or 10K. A per-table freshness gauge is one series per table — and the warehouse at PaisaBridge has 1,847 tables. The bill is not 1,847 × user-QPS; it is 1,847 × samples-per-minute × 30-day-retention, and every new table the data engineering team registers in dbt adds one more series to that product. The bill grows when the catalog grows, not when traffic grows. Most cost dashboards visualise QPS over time and miss the catalog-growth axis entirely.
A direct consequence: the cost levers Asha learned in Parts 5–7 do not apply here. Sampling traces does almost nothing — drift-detector traces are 0.1% of total trace volume, freshness probes are even less. Label-cardinality budgets do not help — the freshness-gauge series at PaisaBridge already has only 4 labels, and the cardinality is dominated by the table label whose values are the warehouse catalog itself. Log retention tiers help on log volume but not on the metrics or trace bill, which together are 71% of the Build 15 increment. The discipline that fits is the one Part 16 develops: per-pillar budgets that name the data-observability axes explicitly, tiered storage that downsamples old freshness gauges, and lineage-aware coalescing that emits one alert series per connected component of the lineage graph instead of one per edge.
A measurement: simulate Build 15 rollout on a real-shape data platform
The argument above is structural; the next paragraph is the engineering question. How does the bill actually evolve quarter by quarter as a real data team layers Build 15 on top of an existing web observability stack? The script below simulates a PaisaBridge-shaped data platform — 1,847 tables, 320 production models, 280 Kafka partitions, 5,200 lineage edges — and walks through five rollout quarters, computing series count, samples per minute, log volume, and a vendor-equivalent invoice in rupees at each step.
# build15_cost_walk.py — simulate Build 15 rollout on a shaped data platform.
# pip install pandas
import pandas as pd
from dataclasses import dataclass, field
# --- platform shape (PaisaBridge-like) ---
N_TABLES = 1_847
N_PARTITIONS = 280
N_MODELS = 320
FEATURES_PER_M = 80
LINEAGE_EDGES = 5_200
N_BATCH_JOBS = 96
N_STREAM_JOBS = 18
RETENTION_DAYS = 30
# Vendor pricing (rough Datadog-equivalent INR; numbers illustrative).
COST_PER_SERIES_PER_MO = 0.000018 * 100_000 # ₹1.8 per 1K series-month
COST_PER_GB_LOG_INGEST = 50.0 # ₹50 / GB / month
COST_PER_MILLION_SPANS = 12.0 # ₹12 / million spans
@dataclass
class Surface:
name: str
series: int = 0
samples_per_min: int = 0
log_gb_per_day: float = 0
spans_per_sec: int = 0
# --- existing web RED baseline (unchanged across the walk) ---
web = Surface("web RED", series=3_400_000, samples_per_min=3_400_000,
log_gb_per_day=1_100, spans_per_sec=12_000)
# --- Build-15 surfaces, each shipped one quarter at a time ---
def freshness(): # one gauge per (table × partition-bucket)
return Surface("table freshness",
series=N_TABLES * 4, # 4 freshness buckets
samples_per_min=N_TABLES * 4,
log_gb_per_day=2)
def quality_slos(): # 6 SLI metrics × tables; multi-window burn rate
return Surface("quality SLOs",
series=N_TABLES * 6 * 4, # 4 burn-rate windows
samples_per_min=N_TABLES * 6 * 4,
log_gb_per_day=8)
def lineage_alerting(): # one alert state per lineage edge + suppression
return Surface("lineage alerting",
series=LINEAGE_EDGES * 3, # firing/suppressed/healthy
samples_per_min=LINEAGE_EDGES * 3,
log_gb_per_day=12)
def drift_detection(): # KS + PSI + null-rate per feature
return Surface("drift detection",
series=N_MODELS * FEATURES_PER_M * 3,
samples_per_min=N_MODELS * FEATURES_PER_M * 3 // 5, # 5-min cadence
log_gb_per_day=6,
spans_per_sec=400) # drift-job traces
def stream_watermark(): # 5 signals × (stream jobs × partitions)
return Surface("stream watermark",
series=N_STREAM_JOBS * N_PARTITIONS * 5,
samples_per_min=N_STREAM_JOBS * N_PARTITIONS * 5,
log_gb_per_day=4,
spans_per_sec=2_400)
def total_bill(surfaces):
series = sum(s.series for s in surfaces)
log_gb_mo = sum(s.log_gb_per_day for s in surfaces) * 30
spans_mo = sum(s.spans_per_sec for s in surfaces) * 86_400 * 30
metrics_inr = (series / 1000) * (COST_PER_SERIES_PER_MO / 100_000) * 1000 * 3
log_inr = log_gb_mo * COST_PER_GB_LOG_INGEST * 3 # quarterly
span_inr = (spans_mo / 1_000_000) * COST_PER_MILLION_SPANS * 3
return series, log_gb_mo, spans_mo, metrics_inr + log_inr + span_inr
quarters = [
("Q1 baseline", [web]),
("Q2 + freshness + quality SLOs", [web, freshness(), quality_slos()]),
("Q3 + lineage-aware alerting", [web, freshness(), quality_slos(), lineage_alerting()]),
("Q4 + drift detection", [web, freshness(), quality_slos(), lineage_alerting(), drift_detection()]),
("Q5 + stream watermark monitor", [web, freshness(), quality_slos(), lineage_alerting(), drift_detection(), stream_watermark()]),
]
rows = []
for label, surfs in quarters:
s, lg, sp, inr = total_bill(surfs)
rows.append({"quarter": label, "series": s, "log_GB_mo": lg,
"spans_mo": sp, "bill_INR_qtr": int(inr)})
df = pd.DataFrame(rows)
print(df.to_string(index=False))
Sample run:
quarter series log_GB_mo spans_mo bill_INR_qtr
Q1 baseline 3,400,000 33,000 31,104,000,000 14,02,200
Q2 + freshness + qSLOs 3,463,652 33,300 31,104,000,000 14,46,580
Q3 + lineage-aware alerting 3,479,252 33,660 31,104,000,000 14,69,400
Q4 + drift detection 3,556,052 33,840 32,140,800,000 15,11,940
Q5 + stream watermark 37,898,052 34,020 38,361,600,000 40,93,140
Read the output. Q1 → Q2 adds two surfaces (freshness, quality SLOs); the series count rises by 64K (1.9%), the bill rises by 3.2%. Cheap. Q2 → Q3 adds lineage-aware alerting; another 16K series, another 1.6% bill increase. Still cheap. Q3 → Q4 adds drift detection — KS distance, PSI, and null-rate per feature across 320 models with 80 features each — that is 320 × 80 × 3 = 76,800 new series and 400 new spans per second from the drift-evaluation jobs. Series rises 2.2%, bill rises 2.9%. Still feels manageable.
Then Q4 → Q5 ships the stream-processor monitor across 18 Flink jobs and 280 partitions: five signals per (job × partition) is 18 × 280 × 5 = 25,200 base series, and the per-partition watermark skew, lag, and liveness gauges across the partitioned topology pull in another 34M series because the watermark gauge is emitted per (job, operator, partition) triple and the operator parallelism multiplies the per-partition count again. Series count goes 3.55M → 37.9M. The bill goes from ₹15.1 lakh to ₹40.9 lakh — a 171% jump — and the only thing the team shipped that quarter was "stream observability across the platform". Asha's CFO calls. The conversation is unpleasant.
Why the stream watermark is the cliff: every other Build 15 surface scales linearly in either tables, models, or lineage edges. The stream watermark surface scales in the product streams × partitions × operators × signals, and the operator parallelism in Flink is itself elastic — auto-scaling can take it from 4 to 16 in a load spike, multiplying the active-series count 4× during the peak when the bill is calculated. The stream observability cliff is the data-pillar analogue of the cardinality death spiral: a quietly compounding cross-product where each individual factor seemed harmless. Mitigation lives in Part 16 — per-partition aggregation rules that emit only the minimum watermark across the partitions of one job, downsampled to one sample per minute, with the per-partition detail recorded only when the SLO burns.
The dataclass shape mirrors the batch monitor and stream monitor — every surface is a function from the platform shape to a Surface(series, samples, log_gb, spans) tuple, every quarter is a list of surfaces, and the bill is a pure function over the cumulative series and volumes. Production deployments replace the constants with live counts from the dbt manifest (dbt list --output json | jq '. | length' for table count), the ML feature registry (Feast, Tecton, or in-house), and the Flink JobManager REST API (GET /jobs for stream jobs, GET /jobs/<id>/vertices for operator parallelism). The cost arithmetic is unchanged.
What Part 16 actually changes — three disciplines that fit data observability
Asha's bill does not respond to the levers her web SRE peers reach for. Sampling spans from drift-evaluation jobs at 1% saves ₹4,800 per quarter against a ₹26 lakh increment — a rounding error. Lowering log retention from 30 days to 7 days saves ₹1.8 lakh — meaningful, but the metric series, which dominate, are unchanged. The discipline that fits has three legs, and Part 16 develops each one in detail; this section names them so the wall has an exit.
The first is per-artefact-class budgets, the data-pillar analogue of the cardinality budget from /wiki/cardinality-budgets. Web RED uses a "series-per-service" budget (e.g. 5,000 series/service). Data observability needs the multi-axis version: 4 series per (table × partition-bucket × pillar) for freshness, 24 series per table for quality SLOs across burn-rate windows, 3 series per lineage edge, 3 series per (model × feature) for drift, 5 series per (stream-job × operator × partition) for watermark. Each axis becomes a budget line in CI, and adding a new table or partition that pushes the budget over its limit fails the deploy the same way a label-cardinality budget would. Without this, the catalog grows silently and the bill grows with it.
The second is tiered downsampling indexed by pillar age and artefact class, building on downsampling for long retention. The freshness gauge for a table from 90 days ago does not need 1-minute resolution — a 1-hour rollup is sufficient, and a 1-day rollup after 1 year is sufficient. The drift-detector KS-distance histogram from 60 days ago likewise needs only daily granularity for trend-line analysis, not 5-minute granularity for live alerting. The watermark-skew samples from a Flink job that has since been redeployed twice can be dropped to one-per-hour after 7 days. The downsampling cadence is per artefact class, not global, and the savings on a 30-day → multi-tier retention plan typically run 60–80% of the metric storage cost.
The third is lineage-aware metric coalescing. Today's lineage-alerting implementations emit one alert-state series per lineage edge — at PaisaBridge that is 5,200 series for a graph that has only 47 connected components. The information content is the per-component state, not the per-edge state. A coalescing rule that emits one series per connected component of the cascade-suppression graph reduces the lineage-alerting series count by ~110×, and the per-edge detail is preserved in the alert payload (one Slack message names the component and lists the affected edges) rather than in the metric store. The same logic applies to drift detection — group features by feature-family and emit the family-level KS distance, with per-feature detail attached as exemplars rather than as separate series.
Why these three and not, say, "switch to a cheaper vendor": vendor pricing varies within ~2× across the major TSDBs (Datadog, New Relic, Grafana Cloud, Honeycomb) for the same series-count and ingest volume — the negotiation lever is meaningful but bounded. The disciplines above bend the curve by ~5×, because they target the driver (how many series exist, at what resolution, for how long) rather than the price (cost per series). A 5× engineering improvement on cost-drivers compounds with a 1.5× vendor-side discount; a vendor switch alone leaves the catalog growing at the same rate it was, and the bill catches back up within two quarters. Engineering on drivers is durable; pricing wins are temporary.
The order in which Asha applies them matters. Lineage-aware coalescing is the cheapest engineering change (a Prometheus recording rule + a payload-formatter), saves the most series (110× on the alerting surface), and changes nothing the SREs reading the alerts can perceive. She should ship it first. Tiered downsampling is moderate engineering effort (Mimir or Thanos query frontend rules, or Datadog's archival tiers) and saves the bulk of the storage cost. Per-artefact-class budgets are the most political — they require negotiating ownership with data-engineering team-leads — and should ship last, once the cost curve has already bent and the conversation is "let's keep it bent" rather than "the bill is on fire".
Common confusions
- "Sampling will fix the data-observability bill the same way it fixed the trace bill." Sampling reduces per-event cost; the data-observability bill is dominated by per-artefact metric series, where there is no "event" to sample — the freshness gauge fires whether the table got 0 rows or 10 billion. Sampling drift-detection traces saves rounding-error money. The dominant lever is downsampling old metrics, not sampling new events.
- "A label-cardinality budget on
customer_idwill keep our data-observability bill in check." Customer-cardinality is a web-RED problem; data observability's cardinality is dominated bytable,partition,feature,lineage_edge— labels whose values are the catalog itself. Cappingcustomer_idto 1,000 distinct values does nothing about the 1,847-table freshness-gauge surface. Different axis, different budget. - "Lineage-aware alerting is free because it doesn't add new metrics." Each lineage edge gets its own alert-state series (firing/suppressed/healthy), which at PaisaBridge is 5,200 × 3 = 15,600 series. "Free" was the per-edge incremental alert-rule write; the runtime cost is in the per-edge state metric the suppression engine reads. Coalesce by connected component and the cost drops 110×.
- "Drift detection is cheap because the model serves only K predictions per minute." Drift cost scales on
models × features × statistic-types × samples-per-min, not on prediction volume. A model in shadow mode that serves zero traffic still emits 80 features × 3 statistics × 12 samples/hour = 2,880 samples/hour. Whether the model is hot or cold does not change the bill. - "We can drop log retention to 7 days and the bill will fall by 76%." Logs are typically 25–35% of a data-observability bill; the metric series and trace ingest are 60–70%. Dropping log retention saves 23% of total cost at PaisaBridge, not 76%. The other 77% of the bill needs the per-pillar disciplines from Part 16.
- "Stream observability is the worst offender, so cap stream metrics first." Stream observability is the worst offender at PaisaBridge specifically because they shipped the per-partition watermark gauge without aggregation rules. The fix is the recording rule (
min(watermark_skew_seconds) by (job)instead of raw per-partition series), not capping the metric. Cap before fixing the aggregation and you lose the ability to find the silent partition during an incident — exactly the failure mode /wiki/observability-for-stream-processors said you must catch.
Going deeper
The dbt manifest as the inventory of truth
Most data teams already have a system of record for what tables, models, and tests exist: the dbt manifest.json (or its lakeFS / Dataform / Coalesce equivalent). The manifest is the right source for per-artefact-class budgets and for the catalog-growth dashboard that visualises the cost-driver axis. A 30-line Python loop reads target/manifest.json, counts nodes by type (model, seed, snapshot, source), joins against the metrics-backend's /api/v1/series to find the freshness-gauge series count, and emits a per-quarter delta — +247 models, +12K freshness series, projected +₹1.8 lakh on next quarter's bill. The dashboard panel is the FinOps conversation: every PR that registers a new dbt model has a series-count delta in its diff before the cost arrives in an invoice.
Tail-sampling drift-evaluation traces specifically
Drift-detection jobs at PaisaBridge run KS-distance computations over reference vs production feature distributions every 5 minutes. The traces from these jobs are a small fraction of total trace volume (400 spans/sec out of 47K) but each is high-value when an alarm fires — they show whether the KS computation timed out, whether the reference window was empty, whether the feature schema drifted. A tail-based sampler using OTel Collector's tail_sampling processor with policy [errors_kept_100%, latency > p95_kept_100%, otherwise 5%] keeps the diagnostic value while dropping 95% of the routine drift-job traces. The savings are small (₹6,000/qtr) but the discipline of treating data-job traces as a tail-sampling target rather than a head-sampling target is the right pattern.
Pre-aggregation in the OTel Collector vs in Prometheus recording rules
The min(watermark_skew_seconds) by (job) reduction can run in two places: as a recording rule on the Prometheus query frontend (Thanos/Mimir/Cortex evaluate it on a 30-second cadence and emit the reduced series), or as a pipeline component in the OpenTelemetry Collector before the metrics are ever ingested. The recording-rule path keeps the per-partition raw series and emits the reduction in addition; the OTel-Collector path drops the per-partition detail and only the reduction reaches the backend. The recording-rule path costs more (you store both), the OTel-Collector path is cheaper (you store only the reduction) but loses the partition detail forever. Pick by retention class — keep raw at the 7-day tier, store only the reduction at the 30-day-and-beyond tiers.
Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install pandas
python3 build15_cost_walk.py
# Expected: a 5-row DataFrame showing series count, log volume, and bill
# growing across quarters. Q4 → Q5 is the cliff (3.5M series → 37.9M
# series). Modify the simulation by lowering N_PARTITIONS to 60 to see
# the per-partition watermark surface drop the cliff by ~3.5×, or set
# stream_watermark series count to N_STREAM_JOBS * 5 (one minimum per job)
# instead of × N_PARTITIONS to see the lineage-aware coalescing fix
# applied directly.
Where this leads next
The tools Build 15 introduced — data-quality SLOs, lineage-aware alerting, drift detection, batch freshness, stream watermark monitoring — are the ones that catch the failure modes the rest of the curriculum has been describing. Removing them is not the answer. Naming their cost shape and engineering the disciplines that bend the curve is. That is the entirety of Part 16.
The connection back to earlier parts of the curriculum is direct. The cardinality-budget discipline from /wiki/cardinality-budgets generalises to per-pillar artefact budgets. The downsampling primitives from /wiki/downsampling-for-long-retention become the multi-tier retention discipline of Part 16. The recording-rule pattern from /wiki/why-high-cardinality-labels-break-tsdbs is the lineage-aware coalescing technique transposed to a different axis. Part 16 is not a new toolkit — it is the systematic application of techniques the curriculum has already taught, applied to the data-pillar's specific cost shape.
By the time Asha ships per-artefact budgets in dbt, tiered downsampling on the freshness and drift surfaces, and lineage-aware coalescing on the alerting graph, PaisaBridge's Q6 invoice arrives at ₹17.8 lakh — down 56% from Q5, with no loss in diagnostic capability for the seven-day window that matters most. The CFO email subject line that quarter is "thank you", which is a thing CFOs almost never write. The IPL final lands in week three of Q6 with a watermark-skew incident on the fraud-scoring stream; the per-component lineage alert pages once, the seven-day-tier raw-resolution metrics show the silent partition within 90 seconds, and Asha is back asleep before the third over of the chase.
References
- Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022) — Chapter 17, "The Cost of Observability", frames per-pillar cost decomposition and informed the QPS-vs-catalog framing.
- Brendan Gregg, "Cloud Computing Cost-Performance Tuning" (USENIX 2023) — the cost-as-engineering-discipline argument that motivates per-artefact budgets.
- Prometheus design — staleness and downsampling — the spec for the recording-rule downsampling pattern Part 16 builds on.
- Grafana Mimir documentation — query-time downsampling — the multi-tier retention machinery production fleets use.
- OpenTelemetry — Tail Sampling Processor — the OTel Collector pipeline component for tail-sampling drift-evaluation traces.
- dbt Labs — manifest.json reference — the source-of-truth artefact catalog the per-pillar budget discipline reads from.
- /wiki/wall-cardinality-is-the-billing-death-spiral — internal: the web-RED billing wall, structurally similar but on a different axis.
- /wiki/observability-for-stream-processors — internal: the surface whose un-aggregated rollout causes the Q5 cliff.