Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Wall: observability for data and ML is different

It is 11:42 IST on a Wednesday at a hypothetical Razorpay risk-platform team and Aditi, the on-call SRE, is staring at a dashboard that says everything is fine. The fraud-scoring service risk-score-v2 shows p99 latency of 142ms (well under the 200ms SLO), error rate of 0.04%, the burn-rate gauge is a flat blue line, the OpenTelemetry trace for a sample request shows a clean six-span tree across the API gateway, the feature lookup, the model server, and the response. Every alert in PagerDuty is green. Every dashboard tile is green. The OTLP collector is shipping 240k spans/sec without a hiccup. By every signal Build 14 taught her to trust, the system is healthy. And yet, since the overnight feature pipeline ran at 02:00 IST, the model has approved nine and a half thousand card-not-present transactions it should have declined, and ₹4.2 crore of fraud has been waved through. The only reason Aditi knows is that the chargeback team has been in her DMs since 11:30. Her tracing tells her the model server returned 200 OK on every request. It does not — cannot — tell her the answers were wrong.

This is the wall that ends Build 14 and starts Build 15. Every observability primitive you just learnt — spans, traces, metrics, logs, p99s, burn rates, error budgets — was designed for synchronous request-response systems where a "wrong answer" looks identical to a "right answer" on the wire. Data pipelines, feature stores, model servers, training jobs, and embedding indexers do not break the way HTTP services break, so they cannot be observed the way HTTP services are observed. The wall is not that your existing observability stack is bad; it is that it is solving the wrong problem.

Traditional observability watches the transport — did the request arrive, did the response come back, was it fast enough, did it 5xx. Data and ML systems fail in the payload — the numbers were stale, the schema drifted, the feature distribution shifted, the model's predictions decoupled from reality. A green burn-rate gauge over a pipeline that shipped NaNs is the canonical Build-15 photograph. The next ten chapters are about the four new signals you need — freshness, completeness, distribution, and prediction quality — and the new alarms, dashboards, and on-call patterns built around them.

What "wrong" looks like for data and ML, and why your traces miss it

A web service has one failure mode that matters operationally: it returns a wrong status code, a wrong shape, or returns it too slowly. All three are visible at the boundary — the HTTP layer your auto-instrumentation already wraps. A 5xx is a 5xx. A 14-second checkout is a 14-second checkout. Your tracing sees it; your metrics count it; your alerts fire on it.

A data pipeline has at least four failure modes that the boundary cannot see, because they are properties of the payload, not of the call.

Freshness failure. The pipeline ran. It exported its OpenTelemetry spans. The Airflow DAG is green. But the source data was 6 hours stale because the upstream Kafka consumer fell behind during the 21:00 IST IPL toss spike, so the "yesterday's transactions" table the fraud model trains on is missing the last six hours of last night. The pipeline succeeded. The numbers are wrong. (See /wiki/wall-batch-metrics-arent-fresh-enough for the data-engineering side of this wall.)

Completeness failure. The pipeline ran. It exported clean spans. It wrote 84,12,406 rows to the warehouse. Yesterday it wrote 1,02,40,883 rows. Nobody declared an SLO on row count, so nothing alerted. The 18% drop is because a producer service silently switched on a feature flag that started dropping events with currency != "INR", but the producer service's own SLO was on its HTTP error rate (still 0.01%), not on its event-emit rate.

Distribution failure. The pipeline ran. It wrote the right number of rows. The schema validates. Every row's amount_paise field is a non-null integer between 0 and 10^9. But the distribution of amount_paise shifted from a median of ₹420 to a median of ₹4 because a partner gateway started reporting amounts in rupees instead of paise after a config push at 19:00 IST. The fraud model was trained on the paise distribution; the rupee inputs sit far below every threshold the model learnt; every "₹4" transaction is auto-approved. There is no NaN, no schema break, no exception.

Prediction-quality failure. The pipeline ran. The features look fine. The model is up. Inference latency is 14ms p99. The model is returning predictions. The predictions are wrong — not because the model code broke, but because the world it learnt from no longer exists. A card-issuing-bank changed its fraud signature, so the patterns the model learnt last quarter no longer correlate with fraud, and the model's accuracy collapsed from 94% to 71% over six hours without a single line of code changing. Your tracing sees the inference call. It does not see the truth label, because the truth label takes 30 days to arrive (when the chargeback fires).

Four failure modes traditional observability cannot seeA two-row diagram. Top row: a traditional HTTP service with three boundary signals (status code, latency, error rate) all green. Bottom row: a data and ML pipeline with the same boundary signals all green, but four payload-level failures highlighted in red — stale source data, missing rows, shifted feature distribution, and degraded model predictions. The arrow between rows shows that the boundary signals are identical in both cases, but the truth diverges below the boundary. The boundary lies — same green dashboard, very different truths Illustrative — both systems return 200 OK at the wire; only one is actually healthy. Traditional HTTP service — checkout-api Boundary signals (what your tracing sees): status: 200 OK p99: 142ms (SLO 200ms) error rate: 0.04% Truth (what the user experiences): healthy — boundary is the truth Data + ML pipeline — overnight features > risk-score-v2 Boundary signals (what your tracing sees): status: 200 OK p99: 14ms (inference) error rate: 0.01% Truth (what the system is actually doing): stale: 6h late incomplete: -18% drift: paise > INR accuracy: 94% > 71% silent failure — boundary is green, payload is broken, ₹4.2 crore of fraud approved
Illustrative — the two systems return identical boundary signals (status, latency, error rate). Underneath, one is genuinely healthy and one is shipping wrong answers at full speed. Build 14's primitives all live at the boundary; Build 15 is about the four payload signals — freshness, completeness, distribution, prediction quality — that nobody at the request layer can see.

Why these four failures are invisible to the OTLP-shaped observability you just built: spans, metrics, and logs are emitted at call sites in your code. Freshness is a property of the data the call site read — the call site has no opinion on whether transactions_yesterday.parquet was last refreshed at 02:00 or at 21:00. Completeness is a property of aggregate row counts over time — a single span sees one row, not the population. Distribution is a statistical property of millions of values — no individual span carries it. And prediction quality is a temporal join with truth labels that arrive days later — at inference time, the truth literally does not exist yet. The four signals all live in dimensions OpenTelemetry's data model does not have a primitive for, which is exactly why Build 15 needs new tools.

Why request-shaped SLOs do not translate

A working SLO definition for checkout-api looks like "99.9% of POST /checkout requests return 2xx within 200ms over a rolling 30-day window". Three things make this work: (1) every request is independently observable; (2) "success" is a function of the request alone; (3) the time-to-detect is bounded by the request rate (at 800 req/sec, you accumulate enough samples for a 1-hour burn-rate window in seconds). Try writing the same SLO for "the overnight risk feature pipeline" and every part of it falls apart.

There is one event per day, not 800/sec. A 99.9% SLO over 30 days allows 0.03 failures — fewer than one. You cannot do statistics on that. The "request" runs for 90 minutes; "p99 latency" of a once-a-day job is just "how long it took yesterday", not a quantile. "Success" is not a function of the run alone — a run that completes successfully but ships stale data is worse than a run that fails loudly, because the loud failure pages someone. And the time-to-detect is bounded by the next downstream consumer's freshness expectation — if the fraud team only opens the dashboard at 09:30, your six-hour-stale data has already done six hours of damage by the time anyone notices.

The mental shift is from request SLO to data contract. A data contract on transactions_yesterday says: "this table is refreshed by 04:00 IST every day; row count is within ±5% of a 7-day moving average; the amount_paise column has zero nulls and a median between ₹100 and ₹2000; the event_time column has a max within 30 minutes of the previous midnight". Every clause of that contract is a check you can run after the pipeline says it succeeded. The contract is the SLO; the contract violation is the alert; the contract is what your stakeholders actually care about. The Airflow DAG being green is necessary and not sufficient.

The same translation applies on the ML side. A model SLO is not "p99 inference latency under 50ms" (though that matters for serving) — it is "AUC on the rolling 7-day shadow-evaluated holdout stays above 0.86", or "the feature-distribution PSI between training and serving stays below 0.2", or "the prediction-distribution KL-divergence from yesterday's predictions stays below a threshold". These are statistical SLOs over batches of data, not threshold checks over individual requests. They require a different alert evaluator (one that takes a population, not a single sample), a different burn-rate concept (you cannot burn an error budget when there is only one batch a day to fail), and a different on-call playbook (the responder has to look at distributions and lineage, not at a flame graph).

What the new signals actually measure — running code

Here is the smallest end-to-end demonstration of why Build 14's signals miss what Build 15 needs to catch. The script generates a feature pipeline run that completes "successfully" by every traditional metric, then exposes four payload-level checks that catch the silent failure. Run it and you can see exactly the gap on your own laptop.

# data_obs_demo.py — show why request-level observability misses the four
# payload failures, and what catching them looks like in practice.
# pip install opentelemetry-sdk opentelemetry-api numpy
import time, random
from datetime import datetime, timedelta
import numpy as np
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (BatchSpanProcessor, ConsoleSpanExporter)
from opentelemetry.sdk.resources import Resource

provider = TracerProvider(resource=Resource.create({
    "service.name": "razorpay-feature-pipeline",
    "service.version": "2.4.1",
}))
provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("razorpay.features.daily")

# A "yesterday's data" simulator. Two scenarios — healthy and silently-broken.
def fetch_yesterday_transactions(scenario: str) -> tuple[list[dict], datetime]:
    rng = np.random.default_rng(42)
    n = 84_12_406 if scenario == "broken" else 1_02_40_883  # -18% in broken
    # Broken scenario: source consumer fell 6h behind, so max event_time is
    # six hours older than expected, AND a partner started reporting in INR
    # instead of paise (median amount drops 100x).
    if scenario == "broken":
        max_event_time = datetime(2026, 4, 24, 18, 0)  # 6h stale
        amounts_paise = rng.lognormal(mean=4.0, sigma=1.0, size=n).astype(int)  # ~₹0.5 median
    else:
        max_event_time = datetime(2026, 4, 24, 23, 59)  # fresh to yesterday's midnight
        amounts_paise = rng.lognormal(mean=10.0, sigma=1.5, size=n).astype(int)  # ~₹220 median
    rows = [{"amount_paise": int(a), "event_time": max_event_time} for a in amounts_paise[:1000]]
    return rows, max_event_time, amounts_paise

# Build 14 view: just trace the pipeline, count rows, return success.
def run_pipeline_build14(scenario: str) -> str:
    with tracer.start_as_current_span("features.daily.run") as span:
        rows, max_t, amounts = fetch_yesterday_transactions(scenario)
        span.set_attribute("rows.processed", len(amounts))
        span.set_attribute("pipeline.duration_s", 92)  # both runs take 92s
        span.set_attribute("pipeline.status", "ok")
        return "ok"  # the DAG is green either way

# Build 15 view: same pipeline + four payload-level data contracts.
def run_pipeline_build15(scenario: str, history: dict) -> tuple[str, list[str]]:
    violations = []
    with tracer.start_as_current_span("features.daily.run") as span:
        rows, max_t, amounts = fetch_yesterday_transactions(scenario)
        # 1. Freshness: max event_time within 30min of previous midnight.
        expected_freshness = datetime(2026, 4, 24, 23, 59)
        lag_h = (expected_freshness - max_t).total_seconds() / 3600
        span.set_attribute("data.freshness_lag_hours", round(lag_h, 2))
        if lag_h > 0.5:
            violations.append(f"freshness: lag {lag_h:.1f}h > 0.5h")
        # 2. Completeness: row count within ±5% of 7-day rolling average.
        baseline = history["row_count_7d_avg"]
        deviation = (len(amounts) - baseline) / baseline
        span.set_attribute("data.row_count", len(amounts))
        span.set_attribute("data.row_count_deviation_pct", round(deviation * 100, 2))
        if abs(deviation) > 0.05:
            violations.append(f"completeness: row count off by {deviation*100:+.1f}%")
        # 3. Distribution: median amount within range learnt from 30-day history.
        median_paise = int(np.median(amounts))
        span.set_attribute("data.median_amount_paise", median_paise)
        if not (history["median_paise_p05"] <= median_paise <= history["median_paise_p95"]):
            violations.append(
                f"distribution: median ₹{median_paise/100:.2f} outside "
                f"[₹{history['median_paise_p05']/100:.2f}, ₹{history['median_paise_p95']/100:.2f}]")
        # 4. Prediction-quality proxy: PSI between today's and yesterday's amount distribution.
        bins = np.logspace(0, 8, 11)
        today_dist = np.histogram(amounts, bins=bins)[0] / len(amounts) + 1e-9
        psi = np.sum((today_dist - history["yesterday_dist"]) *
                     np.log(today_dist / history["yesterday_dist"]))
        span.set_attribute("data.amount_psi_vs_yesterday", round(float(psi), 3))
        if psi > 0.2:
            violations.append(f"drift: amount PSI {psi:.2f} > 0.2 (training/serving skew)")
        if violations:
            span.set_attribute("data_contract.status", "violated")
            return "violated", violations
        span.set_attribute("data_contract.status", "ok")
        return "ok", []

# A 30-day history baseline for the contracts to compare against.
rng = np.random.default_rng(7)
yesterday_amounts = rng.lognormal(mean=10.0, sigma=1.5, size=1_00_00_000).astype(int)
bins = np.logspace(0, 8, 11)
history = {
    "row_count_7d_avg": 1_00_00_000,
    "median_paise_p05": 18_000, "median_paise_p95": 25_000,
    "yesterday_dist": np.histogram(yesterday_amounts, bins=bins)[0] / len(yesterday_amounts) + 1e-9,
}

for scenario in ["healthy", "broken"]:
    print(f"\n=== scenario: {scenario} ===")
    print(f"Build 14 says: {run_pipeline_build14(scenario)}")
    status, vs = run_pipeline_build15(scenario, history)
    print(f"Build 15 says: {status}")
    for v in vs: print(f"  - {v}")
provider.force_flush()
Sample run:
=== scenario: healthy ===
Build 14 says: ok
Build 15 says: ok
=== scenario: broken ===
Build 14 says: ok
Build 15 says: violated
  - freshness: lag 6.0h > 0.5h
  - completeness: row count off by -17.8%
  - distribution: median ₹0.55 outside [₹180.00, ₹250.00]
  - drift: amount PSI 4.71 > 0.2 (training/serving skew)

Read the broken-scenario output. The Build 14 pipeline says ok — its DAG ran, its spans exported, its row count made it into a metric. It would not page anyone. The Build 15 pipeline runs the same spans and the same OTLP exporter, then on top adds four payload contracts. Three of the four fire on the broken scenario; together they pin the failure mode precisely — the source is six hours stale, 18% of rows are missing, the partner is reporting in rupees instead of paise, and the resulting distribution is so far from training that any downstream model is out of regime. Why those four checks happen after the pipeline finishes, not as part of it: the contracts measure properties of the output, not properties of the run. They need access to the data the run produced and to the historical baseline of past runs. A pipeline can absolutely run successfully and produce broken output, which is the entire reason data-contract observability is a separate concern from pipeline-execution observability. The data-engineering chapter data-quality-metrics-as-slos covers this same idea from the SLO side.

The four extra set_attribute calls — data.freshness_lag_hours, data.row_count_deviation_pct, data.median_amount_paise, data.amount_psi_vs_yesterday — are the bridge between the two worlds. They flow through your existing OTLP collector, into your existing trace backend, and become queryable in TraceQL or Honeycomb without any new infrastructure. The infrastructure shift is small; the conceptual shift — "the run finishing is not the same as the run being correct" — is enormous. Build 15 is mostly about the conceptual shift, plumbed through the OTLP pipes you already have.

Where the on-call discipline has to change

The PagerDuty handoff for a Build-14-shaped service is well-rehearsed. A HighErrorRate alert fires; the on-call opens the runbook; the runbook says "open the trace, find the slow span, look at its children, find the failing dependency". The full diagnostic ladder lives inside one trace. A Build-15 incident — the chargeback team noticed something at 11:30 — is a different shape entirely.

The responder has to answer: was the data fresh? Was it complete? Did its distribution drift? Did the model's predictions decouple from reality? Each question lives in a different system — the freshness check is in the warehouse manifest, the completeness check is in the data-quality framework (Great Expectations, Soda, Monte Carlo), the distribution check is in a feature-monitoring tool (whylabs, Arize, evidently), the prediction-quality check is in a shadow-evaluation pipeline that joins predictions to truth labels with a 30-day lag. Reading them all in parallel is a different cognitive task from reading a flame graph.

The diagnostic ladder for data and ML failures, in the shape that has actually worked at Razorpay/Hotstar/Flipkart-scale teams, looks like four explicit rungs. Rung 1 — freshness: is the data the consumer is reading actually from the time window the consumer thinks it is from? Open the warehouse manifest, look at the latest partition's max(event_time) versus now() - 24h. Rung 2 — completeness: is the row count within historical bounds? Pull a 14-day series of daily row counts and look for a step. Rung 3 — distribution: are the values within historical bounds? Compute PSI on the top-5 features versus a 30-day baseline. Rung 4 — prediction quality: are the model's predictions still calibrated? Pull the most recent shadow-evaluated holdout and check AUC, calibration, and per-segment performance. A team that has internalised this ladder takes 8 minutes from page to root cause; a team that is still reading flame graphs takes 90.

The rungs map to the on-call team that owns each. Freshness incidents go to the data-platform team (the people who own the warehouse and the streaming layer). Completeness incidents go to the producer team (whose service emitted fewer events). Distribution incidents go to the team whose feature shifted (often a partner-integration team or an upstream data producer). Prediction-quality incidents go to the ML-platform team. The single biggest organisational change Build 15 forces is that the on-call team for an incident is determined by the rung where the failure lives, not by the service that "broke". The risk-scoring service did not break in Aditi's incident; the upstream gateway switched units, so the partner-integration team owns it. Without an explicit ladder, the page goes to the wrong team and bounces — every rotation. With a ladder, the responder names the rung in the first 90 seconds and routes correctly.

The Build 15 diagnostic ladder — four rungs, four ownersA vertical ladder diagram with four horizontal rungs. From top to bottom: Rung 4 — Prediction quality (owner: ML platform; signal: AUC, calibration, shadow holdout); Rung 3 — Distribution (owner: feature-source team; signal: PSI, KS-test, histogram drift); Rung 2 — Completeness (owner: producer team; signal: row count, null rate, missing partitions); Rung 1 — Freshness (owner: data platform; signal: max event_time, watermark lag, partition latency). An arrow on the left labelled "diagnose top-down" points downward, indicating responders climb down the ladder until they find the rung that fired. Each rung shows the typical detection latency in minutes. The four rungs of data and ML observability — top-down diagnosis Illustrative — at Razorpay/Hotstar-shape teams, naming the rung in the first 90 seconds is the difference between an 8-minute and a 90-minute MTTR. diagnose top-down Rung 4 — Prediction quality owner: ML platform · signal: AUC, calibration, shadow-holdout drift · detect: hours–days slowest Rung 3 — Distribution owner: feature-source team · signal: PSI, KS, histogram drift vs baseline · detect: minutes–hours Rung 2 — Completeness owner: producer team · signal: row count vs 7d avg, null rate, missing partitions · detect: minutes Rung 1 — Freshness owner: data platform · signal: max(event_time), watermark lag, partition latency · detect: seconds fastest Climb down from Rung 1; the first rung whose check fires names the failure and the owner.
Illustrative — the diagnostic ladder for a Build-15 incident. The rungs are ordered by detection latency (Rung 1 fires in seconds, Rung 4 fires in days), so a responder always checks freshness first and prediction quality last. The fact that each rung has a different owner is the second-biggest property of the ladder — without explicit rung-to-team mapping, every incident bounces between three rotations.

The rungs also dictate what alert window is even possible. Rung 1 (freshness) can have a 1-minute alert window because the warehouse manifest updates every minute. Rung 2 (completeness) needs a 1-hour window because row counts are noisy at sub-hour granularity. Rung 3 (distribution) needs at least a 6-hour window — PSI computed on 1 hour of data is dominated by sampling noise. Rung 4 (prediction quality) needs at least 24 hours and often 7 days, because truth labels arrive late and the signal-to-noise on a single day's holdout is poor. Why the windows scale up by 100x as you climb the ladder: each rung is a more derived signal, computed over more data, with more sources of statistical noise. Freshness is a single timestamp comparison — one number versus another, no sampling involved. Distribution is a 200-bucket histogram comparison — sampling noise is real, and a 1-hour window has too few samples in the tail buckets to compute a stable PSI. Prediction quality has the same noise problem plus a temporal-truth-arrival problem — the truth labels for today's predictions trickle in over 30 days, so a "today's accuracy" computed at midnight tonight is missing 95% of the eventual labels. The window has to be long enough to gather labels; below that floor, you're alerting on noise. A team that tries to alert on prediction quality at 1-hour granularity will get a flapping mess of false-positive pages; a team that tries to alert on freshness at 24-hour granularity will discover the problem after the business has already taken loss. Window selection is rung-specific and getting it wrong is the second-most-common Build-15 failure mode after using Build-14 windows by default.

Common confusions

  • "Data observability is just observability with new dashboards." It is not. The signals are statistical (population-level, not request-level), the detection windows are 100x–10000x longer, the alert rules require new evaluators (PSI, KS-test, row-count-anomaly), and the on-call response is a four-rung diagnostic ladder, not a flame-graph dive. Bolting four new Grafana panels onto your existing setup is the start, not the end.
  • "If the pipeline DAG is green, the data is fine." It is not. A green DAG means the code ran without throwing an exception; it says nothing about whether the data the code produced matches business reality. The pipeline succeeded → the row count is right → the values are reasonable → the model's predictions are still useful is a chain of four independent checks, not one. Conflating the first with the last four is the central Build-14-thinking trap.
  • "OpenTelemetry can carry the new signals as custom span attributes, so it is the same system." It can carry them — and you should, because plumbing reuse is the cheap part. But the evaluators (PSI, anomaly detection on row counts, watermark-lag alerts, AUC drift) are not in the OTLP spec and are not in any collector processor. You will need a separate evaluator that reads the spans/metrics OTLP shipped, computes the contract checks, and emits new alerts. The transport is shared; the brain is not.
  • "You can SLO a once-a-day batch with the same burn-rate maths as a 200-req/sec API." You cannot. Burn-rate alerting assumes enough events to do statistics; one-event-per-day pipelines never have enough samples in a short window. The Build-15 equivalent is a per-run contract (each run either passes or violates each contract clause) plus a count of consecutive violations as the alert signal — fundamentally different maths from error-budget burn rate.
  • "Drift is just an ML problem." It is not. Distribution drift hits any system whose downstream consumer assumes a stable input distribution — fraud rules with hardcoded thresholds, surge-pricing algorithms calibrated to a fixed demand curve, GST-filing aggregations that assume a stable transaction-amount distribution. The ML community wrote the literature; the failure mode applies to every analytics-driven decision pipeline.
  • "Lineage is documentation; it doesn't help during an incident." It is the most useful Build-15 artefact in an incident. When the chargeback team pages at 11:30 IST, the responder's first question is "what upstream tables produced the features the model just consumed?" — that is a graph traversal in the lineage system. Without it, you are reading 14 dbt YAMLs by hand at 11:31 IST. The chapter /wiki/lineage-aware-alerting covers what lineage looks like when it is structured for incident response, not for compliance.

Going deeper

Why the OTLP data model needs an extension for data observability

OTLP carries spans, metrics, and logs. It does not carry datasets, partitions, or schemas as first-class entities. A row count is a metric (fine); a row count of a specific partition of a specific dataset that was produced by a specific run is awkward — you can encode it as a metric with dataset.name, partition.id, and run.id labels, but you have just blown up your cardinality budget and you still cannot ask "show me the schema of the data this metric was computed over". OpenLineage and the in-progress OpenTelemetry data-observability working group are extending the spec with Dataset, DataQualityResult, and DatasetVersion primitives that fit alongside spans. Until those land, most Build-15-shape teams encode dataset identity as span attributes and accept the cardinality hit, then re-evaluate when the spec stabilises. The chapter data-quality-metrics-as-slos covers the encoding patterns; expect a follow-on chapter when the OTel spec lands.

How shadow evaluation and feedback loops actually close

Rung 4 (prediction quality) is the hardest rung because the truth label arrives late. For fraud, the truth is the chargeback that fires 30 days later. For loan defaults, the truth is the missed payment that arrives 90 days later. For recommendation, the truth is the user's click or non-click within the next 5 seconds (fast feedback) or the eventual purchase 14 days later (slow feedback). A production ML system needs two evaluation loops: a fast shadow loop that compares today's predictions against a labelled holdout that the team curates daily (catches gross failures within hours), and a slow truth loop that joins predictions to real truth labels as they arrive (catches subtle drift over weeks). Most Razorpay-shape platforms run both — the fast loop alerts on AUC drop; the slow loop drives quarterly model retraining. The chapter /wiki/model-drift-and-data-drift goes into the shapes of these loops in detail.

Why "test in CI" does not extend to data and ML

Software CI tests on a fixed input dataset and asserts a fixed output. Data pipelines have no fixed input — yesterday's data is always different from today's, by design — so CI-style assertions ("output equals expected") do not apply. The closest analogue is statistical assertions ("median amount is within 95% CI of the historical median") and shape assertions ("schema matches; null rate is below 0.1%; row count is within 5% of 7-day rolling average"). These are runtime contracts, not compile-time tests; they live in the pipeline's post-hoc validation step (Great Expectations checkpoints, dbt tests, Soda agreements), not in your unit-test suite. A team that tries to write data-pipeline tests in pytest discovers within two sprints that they are reinventing Great Expectations badly; the early pivot to a contract-style framework is the cheap one.

The cost shape of data observability is different from request observability

A fraud-API request costs a few microseconds of inference; instrumenting it with a 200-byte span is fine. A daily feature pipeline produces 80 crore rows; computing PSI on each of 200 features × 365 days × 30-day rolling baseline = 219 million PSI computations a year. The compute cost of observing the data can dwarf the compute cost of producing the data — especially for high-cardinality features where the histogram has to be recomputed per segment. The discipline that scales is sampling — don't compute PSI on the full population, compute it on a stratified 1% sample updated daily. Most production data-observability tools (Monte Carlo, whylabs, evidently, Arize) sample by default; teams that build their own often skip sampling and discover their observability bill is 4x their pipeline bill. The cardinality discipline from /wiki/cardinality-the-master-variable applies just as hard on the data side as on the metric side.

Reproduce this on your laptop

# Reproduce the demo and inspect the OTLP spans both pipelines emit
python3 -m venv .venv && source .venv/bin/activate
pip install opentelemetry-sdk opentelemetry-api numpy
python3 data_obs_demo.py
# Watch: both pipelines emit one span. The Build 14 span has rows.processed=84_12_406
# and pipeline.status=ok. The Build 15 span has the same fields PLUS
# data.freshness_lag_hours=6.0, data.row_count_deviation_pct=-17.8,
# data.median_amount_paise=55, data.amount_psi_vs_yesterday=4.71, data_contract.status=violated.
# Same OTLP plumbing; new payload signals; entirely different conclusion about the run.

Where this leads next

Build 15 is the next ten chapters: data-quality-metrics-as-slos, lineage-aware-alerting, model-drift-and-data-drift, observability-for-batch-pipelines, observability-for-stream-processors, and the chapters that follow on feature-store monitoring, vector-index health, training-pipeline observability, and the wall that closes the build. Each takes one rung of the ladder and writes the dashboards, alerts, and on-call playbooks for it.

The thread that ties Build 15 together is that the OpenTelemetry plumbing you finished in Build 14 is the right plumbing — spans, metrics, logs, OTLP, the collector, the sampling story — but the signals you ship through it are different, the evaluators that read them are different, the alerts they generate are different, and the team that responds is different. Build 14 taught you how to instrument an HTTP call. Build 15 teaches you how to instrument a dataset, a feature, a partition, and a model. Aditi's 11:42 IST incident does not get diagnosed by reading a flame graph; it gets diagnosed by climbing the four-rung ladder, finding that Rung 3 (distribution) fired six hours ago because a partner switched units, and routing the page to the partner-integration team — all in under ten minutes, without anyone in the chargeback team needing to DM the SRE.

The deeper lesson is that observability is fundamentally about closing the gap between what you can see and what is actually happening. For HTTP services, the wire is the truth — what the user sees is approximately what the boundary measures. For data and ML systems, the wire lies — the wire shows the call succeeded, the model returned a number, the pipeline finished — and the truth lives in the payload, the distribution, the prediction quality, the chargeback that arrives next month. Build 15 is the discipline of not trusting the wire when the wire is not the customer's experience. That is a real philosophical shift; the next ten chapters are how you operationalise it.

References