Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Data quality metrics as SLOs

It is 09:17 IST at a hypothetical Flipkart pricing-platform team and Karan, the on-call data engineer, has just been asked the question every data-platform team eventually gets: "what is your SLO?" The platform team across the corridor runs pricing-api and can answer in one sentence — "99.9% of GET /price/:sku returns 2xx within 80ms over a rolling 28-day window, error budget is 40 minutes per 4 weeks, current burn-rate is 0.3". Karan runs the pipeline that produces the prices pricing-api reads. The pipeline ran last night. It was green. The numbers were wrong by an average of ₹62 per SKU because a partner feed silently switched units. There is no sentence Karan can say back. He has a runbook, a Slack channel, a Great Expectations YAML, and a vague feeling that "we should SLO this" — and no idea how, because every burn-rate calculator on the internet was written for 800-req-per-sec APIs and his pipeline runs once a day.

This chapter is the sentence Karan does not yet have. The previous chapter (wall: observability for data and ML is different) named four payload signals — freshness, completeness, distribution, prediction quality — that traditional observability cannot see. This chapter is the operationalisation: how you turn each signal into a measurable SLI, how you set an SLO target on it, and how you alert on burn without lying to yourself about statistical power.

A data-quality SLO is a per-batch contract evaluated after the pipeline finishes, not a percentage of requests evaluated at the boundary. Each clause — freshness lag, row-count deviation, null rate, distribution drift — is a binary check per run; the SLI is (passed runs) / (total runs) over a rolling window long enough to do statistics on. Burn-rate alerting works once you stop pretending one event a day is a request stream and start counting consecutive contract violations as the alarm signal.

Why request SLOs do not translate, in one number

A 99.9% availability SLO on pricing-api at 800 req/sec means in any 5-minute window you accumulate roughly 240,000 samples; a single 2xx rate computed over that window has a 95% confidence interval of about ±0.013% — narrow enough that a real drop from 99.9% to 99.7% is statistically detectable in five minutes. The same 99.9% target applied to a once-a-day pipeline means in any 5-minute window you accumulate zero samples. Over 30 days you accumulate 30 samples. A 99.9% SLO over 30 samples allows 0.03 failures — fewer than one — and a single failed run takes you from "above SLO" to "below SLO" with an event count that no honest statistician would call a sample. The maths that worked at 800 req/sec breaks at 1 req/day; that single ratio — 69,120,000× — is why every translation in this chapter exists.

Why request SLOs and data SLOs need different mathsA two-row chart comparing event rate, samples per 5-minute window, and statistical power. Top row shows pricing-api at 800 req/sec, 240000 samples per 5-minute window, narrow confidence interval, 5-minute detection. Bottom row shows daily pipeline at one event per day, zero samples per 5-minute window, 30 samples per 30 days, hours-to-days detection. The visual emphasises that a 99.9% target evaluated as a fraction-of-requests is meaningless on the bottom row. The 69-million-fold gap — same SLO target, different statistical regime Illustrative — comparing samples-per-window between an API and a daily batch. pricing-api — 800 req/sec, 99.9% SLO samples per 5-min window: 240,000 95% CI on success rate: ±0.013% time to detect a real 0.2% drop: ~5 minutes overnight feature pipeline — 1 run/day, naive 99.9% SLO samples per 5-min window: 0 samples per 30 days: 30 (allows 0.03 failures at 99.9%) time to detect "is the SLO violated": days — and the next run might fix it
Illustrative — a 99.9% target requires comparable sample counts to be a meaningful contract. Five-minute windows on a daily batch have zero events; thirty-day windows have thirty. Both ends of the request-SLO maths break, which is why the data SLO has to be built on a different counting unit (passed runs) and a different alarm primitive (consecutive violations).

Why the 5-minute multi-window burn-rate alert from chapters /wiki/multi-window-multi-burn-rate-alerts and /wiki/burn-rate-alerting does not port: those alarms compute actual_error_rate / (1 - slo_target) over a short window (1h) and a long window (6h), and require both windows to breach simultaneously. With one event per day, the 1h window is empty 23 hours out of 24, and any single failed run flips the long-window rate from 0% to 100% in one event. The denominator is too small for the maths; what works instead is a population-of-recent-runs SLI computed over weeks, plus a separate consecutive-violations counter for fast detection. Both are needed, and they answer different questions.

The four SLI primitives — what you measure on each batch

A data-quality SLI is a function of the output of a batch, not of the run's success. Four primitives cover the vast majority of pipelines; you compose them per dataset. Each emits a binary (passed, failed) per run, plus a numeric value for trending.

Freshness lag. freshness_lag = expected_max_event_time - actual_max_event_time. For an "overnight transactions" table that should be current to yesterday's 23:59 IST, the SLI passes if freshness_lag <= 30 min. Threshold comes from the downstream consumer's tolerance — if the fraud team can ride out 30 minutes of stale data before false positives spike, that is the floor; tighter than that wastes engineering effort, looser than that lets damage through. Every dataset needs an explicit number, picked by the consumer team.

Completeness. row_count_deviation = (today_rows - rolling_baseline) / rolling_baseline. The SLI passes if |deviation| <= 5%. The rolling_baseline is a 14-day median of row counts (median, not mean — robust to last week's spike from the IPL final or this week's dip from a holiday). The 5% number is not universal; for a Hotstar viewing-events table with strong daily variance, the floor is 15%. Pick per-dataset using the historical noise floor — the smallest deviation that is not explained by normal variance — and add a 50% safety margin.

Distribution. psi = sum((p_today - p_baseline) * log(p_today / p_baseline)) over histogrammed buckets of the value. The SLI passes if psi <= 0.2 (a number borrowed from the credit-risk literature where PSI was invented; values below 0.1 are "stable", 0.1–0.2 are "minor shift", above 0.2 demand investigation). PSI is computed per column, not per row; a pipeline with 12 numeric columns has 12 distribution SLIs. The compute cost is real — see the cost discussion in §6 of the previous chapter.

Null / schema integrity. null_rate = sum(col is null) / total_rows. The SLI passes if null_rate <= 0.1% for required columns and the schema (column names, types, ordering) matches the registered version exactly. Schema drift is the binary part — any new column, any type change, any reordering fails the check; null rate is the continuous part because real-world feeds occasionally drop one row's value without dropping the whole batch.

These four are the floor; specific domains add more (e.g. referential integrity for a star schema, monotonicity for time-series, range constraints for known-bounded values). The discipline is: every dataset declares its SLIs explicitly in a registered contract, every batch evaluates them, every result is logged. The remaining sections show how the contracts compose into SLOs and burn-rate alarms that actually work.

A working contract, evaluated, alerted — runnable

The smallest end-to-end demonstration: a contract definition, a batch that runs against synthetic data, an SLI evaluator, a 30-day SLO computation, and a consecutive-violations alarm. Save as dq_slo.py and run.

# dq_slo.py — data-quality SLO end-to-end on a synthetic daily batch.
# pip install numpy pandas
import json
import numpy as np
import pandas as pd
from dataclasses import dataclass
from datetime import datetime, timedelta

@dataclass
class Contract:
    dataset: str
    freshness_lag_threshold_min: float
    row_count_deviation_threshold: float
    psi_threshold: float
    null_rate_threshold: float
    slo_target: float          # rolling 30-day pass-rate target
    consecutive_violations_to_page: int  # fast-path alarm

# A Flipkart pricing-feed contract — every clause is a number, picked from the
# downstream consumer's tolerance, not from a vendor default.
contract = Contract(
    dataset="flipkart.partner_pricing.daily",
    freshness_lag_threshold_min=30,
    row_count_deviation_threshold=0.05,    # ±5%
    psi_threshold=0.2,                      # credit-risk literature default
    null_rate_threshold=0.001,              # 0.1%
    slo_target=0.97,                        # 29 of last 30 runs must pass
    consecutive_violations_to_page=2,       # two-in-a-row pages on-call
)

def evaluate_batch(batch: pd.DataFrame, baseline: dict, ts: datetime, c: Contract) -> dict:
    """Returns the four SLI verdicts plus numeric values for the run."""
    expected = ts.replace(hour=23, minute=59, second=0, microsecond=0) - timedelta(days=1)
    actual_max = batch["event_time"].max()
    lag_min = (expected - actual_max).total_seconds() / 60

    n = len(batch)
    dev = (n - baseline["row_count_median_14d"]) / baseline["row_count_median_14d"]

    bins = np.logspace(0, 8, 21)
    today_dist = np.histogram(batch["amount_paise"].values, bins=bins)[0] / n + 1e-9
    psi = float(np.sum((today_dist - baseline["amount_dist"])
                       * np.log(today_dist / baseline["amount_dist"])))

    null_rate = batch["amount_paise"].isna().mean()

    return {
        "ts": ts.isoformat(),
        "freshness_lag_min": round(lag_min, 1),
        "row_count_deviation": round(dev, 3),
        "psi": round(psi, 3),
        "null_rate": round(null_rate, 5),
        "freshness_pass": lag_min <= c.freshness_lag_threshold_min,
        "completeness_pass": abs(dev) <= c.row_count_deviation_threshold,
        "distribution_pass": psi <= c.psi_threshold,
        "schema_pass": null_rate <= c.null_rate_threshold,
    }

def run_passed(verdict: dict) -> bool:
    return all(verdict[k] for k in ("freshness_pass", "completeness_pass",
                                    "distribution_pass", "schema_pass"))

# Synthesise 30 days of runs — most healthy, three injected failures.
rng = np.random.default_rng(2026)
baseline_amounts = rng.lognormal(10.0, 1.5, 1_000_000).astype(int)
baseline = {
    "row_count_median_14d": 1_000_000,
    "amount_dist": np.histogram(baseline_amounts, bins=np.logspace(0, 8, 21))[0]
                   / 1_000_000 + 1e-9,
}

history = []
for d in range(30):
    ts = datetime(2026, 4, 1, 4, 0) + timedelta(days=d)
    if d == 18:    # injected: source consumer fell behind
        max_t = ts - timedelta(hours=6, minutes=45)
        amounts = rng.lognormal(10.0, 1.5, 990_000).astype(int)
    elif d == 19:  # injected: same incident continues, partner unit switch
        max_t = ts - timedelta(hours=4, minutes=20)
        amounts = rng.lognormal(4.0, 1.0, 820_000).astype(int)
    elif d == 27:  # injected: clean partial outage (rows missing only)
        max_t = ts - timedelta(minutes=10)
        amounts = rng.lognormal(10.0, 1.5, 920_000).astype(int)
    else:
        max_t = ts - timedelta(minutes=rng.integers(2, 18))
        amounts = rng.lognormal(10.0, 1.5, rng.integers(985_000, 1_015_000)).astype(int)
    batch = pd.DataFrame({"amount_paise": amounts,
                          "event_time": [max_t] * len(amounts)})
    history.append(evaluate_batch(batch, baseline, ts, contract))

# Compute the rolling-30-day SLI and the consecutive-violations counter.
df = pd.DataFrame(history)
df["passed"] = df.apply(run_passed, axis=1)
slo_30d = df["passed"].mean()
streak, max_streak = 0, 0
for p in df["passed"]:
    streak = 0 if p else streak + 1
    max_streak = max(max_streak, streak)

print(f"30-day SLO actual:      {slo_30d:.3f} (target {contract.slo_target})")
print(f"runs passing:           {df['passed'].sum()} / {len(df)}")
print(f"longest violation run:  {max_streak}")
print(f"would page (>= {contract.consecutive_violations_to_page}):    "
      f"{'YES' if max_streak >= contract.consecutive_violations_to_page else 'no'}")
print("\nFailures:")
print(df.loc[~df["passed"], ["ts", "freshness_lag_min", "row_count_deviation",
                             "psi", "null_rate"]].to_string(index=False))
Sample run:
30-day SLO actual:      0.900 (target 0.97)
runs passing:           27 / 30
longest violation run:  2
would page (>= 2):      YES

Failures:
                       ts  freshness_lag_min  row_count_deviation    psi  null_rate
2026-04-19T04:00:00          405.0              -0.010   0.014       0.0
2026-04-20T04:00:00          260.0              -0.180   4.712       0.0
2026-04-28T04:00:00           10.0              -0.080   0.013       0.0

Read the output. Three runs failed in the 30-day window, putting the actual SLO at 90.0% — well below the 97% target, so the budget alarm fires (the team is consuming error budget faster than the contract allows). Separately, the consecutive-violations alarm fires because runs 19 and 20 failed back-to-back: a six-hour stale source on day 19 and the partner-unit switch on day 20. The third failure on day 27 is a single-run completeness drop — the 30-day SLO notices it, the consecutive-violations alarm does not. Why the two alarms catch different incidents on purpose: the consecutive-violations counter is a fast-path detector for active incidents in progress — back-to-back failures almost always mean a real, ongoing problem (the source is still broken, the partner has not switched back). It optimises for low time-to-detect; one missed run waits for a second confirmation, two missed runs page. The 30-day SLO catches slow degradation — three single-run failures spread across a month indicate something systemic (an aging consumer, a slow producer drift, a flaky upstream) that no single failure justifies paging on. Both signals are necessary; running only one misses half the failure modes.

The contract dataclass is the load-bearing artefact. freshness_lag_threshold_min=30 is a number the consumer team picks — there is no global default, no vendor recommendation, no smart auto-derivation. The platform team writes the contract with the consumer team in a 30-minute meeting, and the meeting outputs a YAML file that lives next to the pipeline code. slo_target=0.97 is the SLO target; 29 out of 30 runs must pass for the team to be in budget. consecutive_violations_to_page=2 is the fast-path threshold; one failed run is allowed to be noise (the network blipped, the partner gateway timed out for 11 minutes), two in a row is statistically unlikely to be coincidence. run_passed is an all() over the four clauses — any single failure fails the run, which is the conservative default; a more sophisticated contract weights clauses (e.g. freshness fail blocks downstream, distribution fail only warns) and the dataclass extends accordingly.

Burn-rate, redefined for batches

Burn-rate alerting on request SLOs uses two windows (1h and 6h) and the formula burn = error_rate / (1 - slo_target); if burn ≥ 14.4 over both windows you have consumed 2% of a 30-day budget in 1 hour and you page. None of those numbers carry over for a daily batch, but the concept — "you are burning faster than you can afford" — does, with a redefined unit.

A daily batch on a 97% / 30-day SLO has a budget of 0.9 failed runs per 30 days. The unit is "fraction of allowed failures consumed". A single failed run consumes 1 / 0.9 = 111% of the 30-day budget. That number sounds catastrophic and is — one failure does not just dent the budget, it blows it. This is the right intuition: data SLOs at low event rates have so little budget that the burn-rate concept compresses to a binary "you are over budget" / "you are not". Multi-window arithmetic does not buy you anything because there is only one window's worth of events.

What works instead is a trailing failure count over multiple horizons, with thresholds that escalate. Three horizons in practice: last 3 runs (fast-path, 3-day cycle, used for the consecutive-violations alarm), last 14 runs (medium-path, captures recurring weekly-cycle failures), last 30 runs (the SLO-budget alarm proper). Each horizon has its own threshold:

Horizon Threshold Triggers when
Last 3 runs ≥2 failed Active incident in progress
Last 14 runs ≥3 failed Recurring problem (weekly cron, periodic upstream outage)
Last 30 runs >budget (e.g. >0.9) SLO breach, monthly review trigger

The fast-path pages on-call. The medium-path opens a Jira ticket and notifies in Slack — not a page, because a recurring weekly cron failure does not need anyone awake at 02:00 IST. The slow-path drives the monthly post-mortem and the budget-policy decision (do you slow feature work to fix reliability, or accept the lower SLO and renegotiate with stakeholders?). Why three horizons rather than the two-horizon (1h, 6h) burn-rate template: the request-SLO template assumes events are i.i.d. and arrive at high enough rate that a single short window has enough samples to compute a rate. Daily batches are neither i.i.d. (today's failure is correlated with yesterday's failure if the same upstream is broken) nor high-rate. Three horizons mirror the three correlation timescales at which failures actually cluster — same-incident (3 runs), same-weekly-cron (14 runs), same-architectural-issue (30 runs). The thresholds are picked so each horizon covers a different decision: page, ticket, post-mortem.

The Python evaluator above already implements the fast-path; extending to the medium and slow paths is a few extra lines. The discipline is the same: the alarm is a count over a horizon, not a rate over a window, because rates require sample sizes you do not have.

Three-horizon failure-count alarm for batch SLOsThree horizontal timelines, one per horizon. Top: last 3 runs (3 cells, 2 marked failed in red, threshold "page when >= 2"). Middle: last 14 runs (14 cells, 3 marked failed, threshold "ticket when >= 3"). Bottom: last 30 runs (30 cells, 3 marked failed, "SLO breach when budget exceeded"). Each row labelled with detection latency and response action. Three horizons replace two-window burn-rate for daily batches Illustrative — failed runs in dark; the same three failures trigger three different responses depending on which horizon's threshold breaches. Last 3 runs — fast path page on-call when ≥ 2 failed; detect latency ~1 day d-2 d-1 today PAGE — incident in progress Last 14 runs — medium path ticket+Slack when ≥ 3 failed; detect latency ~2 weeks TICKET — recurring Last 30 runs — slow path SLO breach when failed > budget (0.9); detect latency ~1 month REVIEW — monthly Same three failures, three different decisions — page, ticket, monthly review — based on which horizon they breach.
Illustrative — three horizons, three thresholds, three response paths. The fast horizon catches active incidents (page); the medium horizon catches recurring problems (ticket); the slow horizon drives the SLO-policy review. Two-window burn-rate from the request world collapses to a single binary at this event rate; three failure-count horizons replace it.

Where contracts live, who owns them, how they evolve

A contract that lives only in the platform team's heads is not a contract; it is a tribal-knowledge SLO that breaks the first time the original engineer leaves. Production contracts live in three places at once: a YAML file in the producer service's repository (versioned with the producer's code), a registered entry in a central data-contract registry (queryable by any consumer), and a runtime-evaluable form (Great Expectations checkpoint, Soda agreement, or hand-rolled Python like the example above). All three must agree, and a CI check fails the producer's PR if they drift.

The ownership model has to mirror Conway's law or it does not stick. The producer team owns the contract clauses that depend on producer behaviour (schema, null rate, row-count stability, freshness when freshness is a producer property). The consumer team owns the clauses that depend on consumer needs (the freshness threshold, the distribution thresholds, the per-segment requirements). The platform team owns the registry, the runtime evaluator, and the SLO computation. When a clause is violated, the page goes to the team whose ownership covers that clause — freshness violations to the producer team, distribution violations to the team whose feature shifted, schema-drift violations to the producer team. This mirrors the rung-to-team mapping from the previous chapter; the contract is the artefact that makes the mapping explicit and machine-readable.

Contracts evolve. A real production contract for flipkart.partner_pricing.daily will get its 30-minute freshness threshold relaxed to 60 minutes during the Big Billion Days week (when partner feeds are predictably 45 minutes late under load); the YAML carries a temporal override during: "2026-10-08T00:00:00Z/2026-10-15T23:59:59Z"; freshness_lag_threshold_min: 60. The consumer team approves; the platform team merges; the contract registry tracks the override with start and end timestamps; alarms during the override use the relaxed threshold and post-event the team reviews whether to make it permanent. Without temporal overrides, contracts either bake in the worst case (over-permissive, hides real failures) or bake in the best case (under-permissive, fires every BBD); with overrides, the contract tracks the actual operational regime.

The hardest contract evolution is adding clauses retroactively. The day after the partner-unit-switch incident, the platform team will want to add a "median amount within paise range" SLI; doing so on a contract that has already been passing for months means the new clause has no historical baseline. The discipline that works: add the clause in log-only mode for 14 days (compute, log to the SLO history, do not page), accumulate a baseline, then enable enforcement. Skipping the log-only phase produces a flapping mess of pages on the first 14 days of historical noise. Why log-only-then-enforce is non-negotiable: every new SLI clause has a baseline that must be calibrated against the actual data distribution, not against a vendor default or a guess. The PSI threshold of 0.2 from the credit-risk literature is a reasonable starting point but is wrong for a feed where the 14-day baseline noise floor is already 0.15 — the clause would page on day one. Enabling in shadow mode, watching the noise floor for two weeks, then setting the threshold at floor + 1.5× the standard deviation is the only way to get a clause that fires on real signal.

Common confusions

  • "A data SLO is just a percentage of successful runs, like an availability SLO." It is not. Run success is necessary but not sufficient — a run can succeed and produce wrong data (the entire premise of the previous chapter). The SLO is on per-run contract pass, where the contract has multiple clauses (freshness, completeness, distribution, schema) and a run passes only if every clause passes. Counting raw run success conflates "the code ran" with "the data is right" and misses the silent failures that motivated the entire framing.
  • "Burn-rate alerting works the same way for a daily batch as for an API." It does not. Burn-rate maths assumes enough events per window to compute a stable rate; one event per day has zero events in a 1h window and one event in a 24h window. The replacement is a three-horizon failure-count alarm (3-run, 14-run, 30-run) with explicit thresholds and explicit response actions (page, ticket, review). The intent — "alert when you are consuming budget faster than you can afford" — survives; the implementation is entirely different.
  • "Set the SLO target by picking a number like 99.9% and standardising across all datasets." That is the request-SLO instinct and it produces nonsense for data. A 99.9% / 30-day target on a daily batch allows 0.03 failed runs in 30 days, which rounds to "no failures ever". A reasonable starting point for daily batches is 96% or 97% — one allowed failure per 30 days — calibrated per dataset against historical noise. Datasets with stricter downstream consequences (real-time fraud features) deserve tighter targets and tighter freshness thresholds; datasets with looser consequences (weekly analytics aggregates) deserve looser ones. Standardising kills the discipline.
  • "You can compute SLIs on a stratified sample of the batch to save compute." You can for distribution and null-rate clauses; you cannot for freshness and completeness, which are properties of the whole output. A 1% stratified sample of a 1-crore-row batch gives a stable PSI estimate and a stable null-rate estimate at 1% the compute cost. The same sample tells you nothing useful about whether all the rows arrived (completeness) or whether the latest event is fresh (freshness, which only requires reading one row — the one with the maximum event_time). Sample where sampling is statistically valid; do not sample where the SLI is structural.
  • "Data contracts are documentation; the runtime check is the SLO." They are the same thing or you have a problem. The contract YAML, the registry entry, and the runtime evaluator must agree, with CI enforcement on the producer's PR. A contract that lives only as documentation drifts from the runtime within weeks — someone changes a threshold in Great Expectations without updating the YAML, or vice versa — and the SLO loses its meaning. Single source of truth, machine-readable, CI-enforced; nothing else holds up.

Going deeper

How freshness, completeness, distribution, and schema clauses compose

A pipeline with multiple datasets has SLIs on each, and downstream consumers care about the intersection — the fraud model needs transactions AND merchants AND accounts to all be fresh, complete, in-distribution, and schema-valid. The composed SLO is min(SLO_transactions, SLO_merchants, SLO_accounts) over the same window — the worst dataset dictates the consumer's experience. Reporting per-dataset SLOs is necessary for diagnosis (which producer team is the bottleneck) but the consumer's experience is the min, and that is the number that lives on the consumer's dashboard. Most teams report both; some only report the min, which loses the per-producer signal and makes finger-pointing harder; some only report per-dataset, which loses the consumer view and makes priority-setting harder. The both-views default is the discipline that scales.

When to use Great Expectations vs Soda vs hand-rolled Python

Great Expectations excels at schema and null-rate checks; its expect_column_values_to_be_between family covers 80% of distribution checks for structured tables. Soda is similar with a YAML-first surface area and lighter integration cost. Hand-rolled Python wins when the contract has a custom statistical clause (PSI on log-bucketed amounts, KS-test against a calibrated baseline, segment-aware checks where the threshold differs by region or partner-id) that does not fit either tool's built-in vocabulary. The 80/20 in practice is: Great Expectations or Soda for the structural clauses, hand-rolled Python for the statistical ones, both wrapped in the same evaluator and emitting the same SLI output format. Mixing tools is fine; mixing output formats is the trap to avoid — the SLO computation downstream needs a uniform input.

Contracts under regulatory regimes (RBI, GST, SEBI)

A pricing-feed contract at Flipkart is one thing; a transaction-reporting contract at Razorpay or Cred has to satisfy RBI's expectations on transaction completeness and timeliness, and SEBI for any market-data feed. Regulators do not care about your 97% SLO; they care about every transaction being reported, every time, with audit trails. The contract clause that satisfies a regulator is null_rate <= 0 (zero, not 0.1%), row_count_deviation_threshold = 0.0 against a reconciled source-of-truth count, and a consecutive_violations_to_page = 1 (any failure pages on-call). The SLO target rises to 99.99% or higher because the regulator's expectation is approximately "do not fail". These contracts are a different discipline from the engineering-SLO contracts in this chapter; they get audited, they require reconciliation against an external system, and they pay for tighter thresholds with much higher engineering cost. The chapter /wiki/observability-and-compliance-overlaps covers the full picture; the relevant takeaway here is that not every dataset is a 97% SLO — some are 99.99% with reconciliation, and the contract structure has to support both.

What "violated" emits as telemetry

The runtime evaluator's job does not end at returning passed=False; it must emit telemetry that downstream tooling can route, query, and visualise. The shape that has worked at Razorpay/Hotstar-shape teams: an OpenTelemetry span per run with attributes for each SLI value (data.freshness_lag_min, data.row_count_deviation, data.psi_amount, data.null_rate) and a binary data_contract.status attribute, plus a Prometheus metric data_contract_pass_total{dataset, clause} incremented per clause per run. The span gives you single-run diagnosability; the metric gives you the trailing-30-day pass-rate and the burn-rate computation; both flow through the same OTLP collector and the same backend (Tempo + Prometheus) the rest of the observability stack already uses. Building a parallel system for "data telemetry" is the temptation to resist — the OTLP plumbing is the right plumbing; the signals are different, but the transport is shared. This mirrors the conclusion from the previous chapter.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
pip install numpy pandas
python3 dq_slo.py
# Watch: 30 runs synthesised, three injected failures (days 18, 19, 27).
# Output shows actual SLO at 0.900, longest violation streak of 2, would_page=YES.
# Tweak `consecutive_violations_to_page=3` and re-run — would_page flips to no,
# even though the SLO is still violated. The two alarms answer different questions.

Where this leads next

The contract is the structural artefact that the rest of Build 15 hangs from. The next chapter, /wiki/lineage-aware-alerting, shows how a contract violation at the producer end propagates through the lineage graph to the consumer's alarm — the responder sees not just "freshness on partner_pricing failed" but "fraud-model risk-score-v2 is downstream of three failing contracts; here are the seven other models also affected". Lineage is what turns per-dataset contracts into a system-level observability story.

After that, /wiki/model-drift-and-data-drift extends the distribution clause to the ML side: the same PSI / KS-test framework, applied to features at training time and at serving time, with the truth-arrives-30-days-late problem fully addressed. By the end of Build 15 the contract you wrote in this chapter will compose with the lineage graph, the drift detectors, and the shadow-evaluation pipeline into a single observability surface that catches the silent failures the request-shaped stack cannot see — and that, structurally, is what closes Aditi's 11:42 IST incident in 8 minutes instead of 90.

References