Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Observability for batch pipelines

09:04 IST on a Wednesday. Aditi, the platform on-call at a hypothetical Swiggy, gets a Slack message from the growth team: the daily restaurant_funnel dashboard is showing 04 April again — yesterday's date — even though it is now 09:04 on the 5th. Airflow shows the DAG ran at 02:14 with state=success. Every task is green. Every retry counter is zero. The Loki logs for the run contain no warnings. The Grafana panel for the warehouse load shows the expected 18 GB written. Nothing is wrong, except that the dashboard is wrong.

This is the canonical batch-pipeline failure: the job succeeded, end-to-end, and produced garbage. A web-service observability stack — RED metrics, distributed traces, error-rate SLOs — would have flagged none of it, because none of those signals are wrong. The 200s are real, the spans closed cleanly, error rate is 0.0%. The thing that broke is something none of those tools measure: the shape, freshness, and lineage of the data the pipeline produced. Batch observability is the discipline of measuring those instead, and it is structurally different from the request-response observability the rest of this curriculum is built on.

Batch pipelines fail by succeeding on the wrong data, so the SRE signals — error rate, latency, span counts — are useless. Observability for batch is built on four orthogonal signals: freshness (how old is the youngest partition), volume (how many rows landed), schema (did the columns match the contract), and distribution (did the values look like they did yesterday). Treat each as an SLO with a burn rate, page on freshness misses, and never trust a green DAG run.

Why request-response signals do not fit batch

A web service emits a request every few milliseconds. Its observability surface is built on the assumption that you have a high-frequency stream of identical events to aggregate — count them, time them, group them by status code. RED works because per-second rate, error, duration are well-defined and dense.

A batch pipeline emits one event per run. The Airflow DAG restaurant_funnel runs once at 02:00 IST and produces one outcome: success or failure. There is no rate. There is no per-second percentile. The "duration" is the run-time, which you might measure in tens of minutes. A 4xx-equivalent for batch — the job failed loudly — is the easy case; Airflow already alerts on that. The dangerous case is the silent one: the job succeeded, but the partition it produced has 3% of yesterday's rows, or yesterday's schema, or last week's date column.

Why traditional metrics miss this: a batch pipeline's "request" is the run itself, and "the response" is the data on disk. The pipeline can return HTTP 200 / exit code 0 / Airflow state=success while still having written wrong data. Health checks built on the run's metadata (did it finish? did it throw?) cannot distinguish a correct run from a vacuous one — the only way to tell the difference is to inspect the artefact. Request-response observability never had to do this because the request was the artefact; in batch, the artefact lives downstream of the run.

Why request-response observability misses batch failuresTwo-row diagram. Top row labelled "Web service": dense stream of small request events into a service, RED metrics aggregate them, dashboard shows error_rate and p99 latency. Bottom row labelled "Batch pipeline": single run event into Airflow producing a partition artefact on warehouse, with arrow showing the artefact is what readers consume; signals on the run itself (success/duration) cannot see whether the artefact is correct. Web request-response vs batch artefact production Illustrative — the unit of observation is the request in one case and the artefact in the other. Web service — many small requests, signals dense per second ~10k req/s checkout-api RED metrics dense err_rate=0.04% p99=240ms measured per second If it breaks, RED catches it in under a minute Batch pipeline — one run, one artefact, signal density of 1 per day 02:14 IST run Airflow DAG state=success partition dt=2026-04-04 on warehouse — the artefact — dashboard read by humans DAG green ≠ artefact correct measure the artefact
Illustrative — the unit of observation moves from the request (web) to the artefact (batch). Run-level metadata cannot tell you whether the artefact is correct; you must measure the artefact directly.

The four batch-observability signals

Treat batch observability as four orthogonal signals — every healthy partition must satisfy all four. Most platforms get one of them right (usually freshness, because Airflow gives you SLA misses for free) and ignore the rest. The interesting failures live in the other three.

Freshnesshow old is the youngest partition? The Swiggy restaurant_funnel table is supposed to have a dt=2026-04-05 partition by 03:00 IST every day. At 09:04 the youngest partition is dt=2026-04-04. That is a 30-hour freshness miss against a 1-hour SLO. Freshness is the only batch signal that maps cleanly onto SRE-style SLOs: define a target ("p95 of partitions land within 1 hour of midnight IST"), measure actual landing time per partition, compute a 30-day error budget, alert on burn rate. The relevant SLI is (landing_time - schedule_time), the SLO is a P-quantile threshold on it, and a burn rate above 14.4× over 1h is a page.

Volumehow many rows landed? If yesterday's partition had 8.4M rows and today's has 240k, something is wrong even if the partition exists and is fresh. Volume drift is usually upstream truncation (a producer sent only one shard's worth of data because a Kafka consumer rebalanced mid-write), schema mismatch (a join silently dropped 95% of rows because a key column got renamed), or filter regression (someone added WHERE status='active' to the staging query and now historical inactive rows are excluded). The detector: rolling 7-day median row count per partition, alert on > 30% relative deviation. Use the median, not the mean — a single anomalous day biases the mean for a week.

Schemadid the columns match the contract? The downstream dashboard reads funnel_stage_id (int) and funnel_stage_name (string). The producer's PR last week renamed the integer to stage_id while leaving the string column unchanged. The dashboard SQL still parses (the column it reads exists), but the joins to the stages dimension table now produce nulls. Schema observability catches this by snapshotting the partition's schema (column names, types, nullability) and diffing against the previous run's snapshot. Any add / drop / rename / type change is a contract event that should fire — the on-call decides whether it is intentional, but the system never lets it pass silently.

Distributiondid the values look like they did yesterday? Even with the right number of rows and the right schema, the contents can be wrong. The country_code column suddenly has 12% nulls (yesterday: 0.04%). The event_timestamp column has a chunk of values dated 1970-01-01 (Unix epoch — a serialiser bug fed 0 instead of now()). The revenue_paise column has a long tail of 10× values because a unit got silently changed from paise to rupees. Distribution drift is the same problem as input-distribution drift in /wiki/model-drift-and-data-drift — KS test on numeric columns, chi-squared on categoricals, null-rate ratio on every column. It catches the failures the other three signals cannot.

Why all four are needed: each signal catches a kind of failure the others cannot. Freshness alone fires when the partition is late but says nothing about what is in it; volume alone catches truncation but not value corruption; schema alone catches structural changes but not silent value changes; distribution alone catches value changes but not missing rows or a missing partition. Production batch failures distribute across all four categories roughly evenly, so a platform that monitors only one or two will silently lose three-quarters of its incidents to "everything looks green".

A working batch monitor — runnable

The smallest end-to-end monitor: a synthetic warehouse table with five days of partitions, freshness/volume/schema/distribution checks computed per partition, alerts emitted when any signal trips. Save as batch_monitor.py and run.

# batch_monitor.py — the four signals, computed per partition.
# pip install pandas numpy scipy
import numpy as np
import pandas as pd
from dataclasses import dataclass, field
from datetime import datetime, timedelta, timezone
from scipy import stats
from typing import Optional

rng = np.random.default_rng(7)

@dataclass
class Partition:
    dt: str                       # e.g. '2026-04-05'
    landed_at: datetime           # actual landing time (UTC)
    rows: pd.DataFrame
    schema: dict[str, str] = field(default_factory=dict)

def make_partition(dt: str, landed_at: datetime, n: int,
                   nulls_country: float = 0.0004,
                   revenue_unit: str = 'paise') -> Partition:
    df = pd.DataFrame({
        'order_id': np.arange(n),
        'country_code': np.where(rng.random(n) < nulls_country, None,
                                  rng.choice(['IN','US','SG','AE'], n, p=[0.91,0.05,0.03,0.01])),
        'revenue_paise': (rng.lognormal(6.4, 0.9, n) *
                          (100 if revenue_unit == 'rupees' else 1)).astype(int),
        'event_ts': [int((landed_at - timedelta(seconds=int(s))).timestamp())
                     for s in rng.integers(0, 86400, n)],
    })
    return Partition(dt=dt, landed_at=landed_at, rows=df,
                     schema={c: str(df[c].dtype) for c in df.columns})

# Five days of partitions: four healthy, one broken in three different ways.
midnight = lambda d: datetime(2026, 4, d, 18, 30, tzinfo=timezone.utc)  # 00:00 IST
hist = [make_partition(f'2026-04-0{d}', midnight(d) + timedelta(hours=2, minutes=14), 8_400_000)
        for d in (1, 2, 3, 4)]
# Today: late by 6h, 3% of usual volume, revenue silently in rupees not paise, country_code 12% null.
today = make_partition('2026-04-05', midnight(5) + timedelta(hours=8, minutes=14),
                       240_000, nulls_country=0.12, revenue_unit='rupees')

def freshness_check(p: Partition, schedule_ist_hour: int = 0, slo_hours: float = 3.0) -> dict:
    schedule = datetime.strptime(p.dt, '%Y-%m-%d').replace(
        tzinfo=timezone.utc) + timedelta(hours=schedule_ist_hour - 5.5 + 24)
    delta_h = (p.landed_at - schedule).total_seconds() / 3600
    return {'signal': 'freshness', 'delta_hours': round(delta_h, 2),
            'slo_hours': slo_hours, 'breached': delta_h > slo_hours}

def volume_check(p: Partition, history: list[Partition], tol: float = 0.30) -> dict:
    hist_rows = np.array([len(h.rows) for h in history])
    median = float(np.median(hist_rows))
    rel = abs(len(p.rows) - median) / max(median, 1)
    return {'signal': 'volume', 'rows': len(p.rows), 'hist_median': int(median),
            'rel_deviation': round(rel, 3), 'breached': rel > tol}

def schema_check(p: Partition, prev: Partition) -> dict:
    diffs = {c: (prev.schema.get(c, 'MISSING'), p.schema.get(c, 'MISSING'))
             for c in set(prev.schema) | set(p.schema)
             if prev.schema.get(c) != p.schema.get(c)}
    return {'signal': 'schema', 'diffs': diffs, 'breached': bool(diffs)}

def distribution_check(p: Partition, prev: Partition,
                        null_rate_max_ratio: float = 5.0,
                        ks_p_min: float = 0.001) -> dict:
    breaches = []
    for col in ['country_code']:
        prev_null = prev.rows[col].isna().mean()
        cur_null  = p.rows[col].isna().mean()
        if cur_null > max(prev_null * null_rate_max_ratio, 0.01):
            breaches.append((col, 'null_rate_spike',
                             round(prev_null, 4), round(cur_null, 4)))
    for col in ['revenue_paise']:
        ks = stats.ks_2samp(prev.rows[col].sample(min(5000, len(prev.rows)), random_state=1),
                             p.rows[col].sample(min(5000, len(p.rows)), random_state=1))
        if ks.pvalue < ks_p_min:
            breaches.append((col, 'distribution_shift',
                             f'KS p={ks.pvalue:.2e}'))
    return {'signal': 'distribution', 'breaches': breaches, 'breached': bool(breaches)}

results = [
    freshness_check(today),
    volume_check(today, hist),
    schema_check(today, hist[-1]),
    distribution_check(today, hist[-1]),
]
for r in results:
    flag = 'BREACH' if r['breached'] else 'OK'
    print(f"[{flag:6}] {r}")
Sample run:
[BREACH] {'signal': 'freshness', 'delta_hours': 8.23, 'slo_hours': 3.0, 'breached': True}
[BREACH] {'signal': 'volume', 'rows': 240000, 'hist_median': 8400000, 'rel_deviation': 0.971, 'breached': True}
[OK    ] {'signal': 'schema', 'diffs': {}, 'breached': False}
[BREACH] {'signal': 'distribution', 'breaches': [('country_code', 'null_rate_spike', 0.0004, 0.1199), ('revenue_paise', 'distribution_shift', 'KS p=0.00e+00')], 'breached': True}

Read the output. Three of the four signals fired on the broken partition: freshness caught that the partition landed 8.23 hours after midnight IST against a 3-hour SLO; volume caught the 97% drop from the rolling 7-day median (240k vs 8.4M); distribution caught both the country_code null-rate spike (0.04% → 12%) and the revenue_paise unit change (KS p ≈ 0 — the lognormal mean shifted by 100×). Schema did not fire because the column names and types are unchanged — the bug is in the values, not the structure. This is the value of running all four: each catches what the others miss, and a "data is healthy" gate that does not check all four will silently let three-quarters of real-world batch failures through.

Why volume uses the median against a rolling window instead of the previous run: a single anomalous day shouldn't bias the baseline. If yesterday was the broken day and you alert on today vs yesterday, you either page on yesterday and never on today (because today now matches the broken yesterday) or you suppress today's alarm because yesterday's outlier hides today's. The 7-day median is robust to a single bad day — even if one of the seven is broken, four healthy days dominate. Production systems extend this to a 28-day rolling median with a stable separately-tracked baseline that is hand-curated weekly.

The dataclass-light style here matches the pattern in /wiki/model-drift-and-data-drift: Partition is the unit of measurement (one run's artefact), every check takes a Partition plus context (history, previous run, SLO target) and returns a typed dict the alertmanager templates into the page subject. schedule_ist_hour=0 means "this DAG is scheduled for midnight IST"; the conversion to UTC is the +18:30 offset. Production version replaces these stubs with pyiceberg or delta-rs for the partition listing, great_expectations or soda for the per-column checks, and Prometheus push-gateway for alerting integration.

Designing batch SLOs that actually fire

The single hardest part of batch observability is writing SLOs that page when something is wrong and stay quiet when nothing is. Get the SLO wrong and you either page nightly on noise (the team disables the alert within a week) or never page at all (the team finds out from the marketing dashboard).

The right shape is freshness as the primary SLO, the other three as gates. Freshness has a natural rate (one partition per schedule period), a natural unit (hours late), and a natural budget (you can absorb N late partitions per quarter before the consumer's promise is broken). Volume, schema, and distribution are not naturally rate-shaped — they are per-partition pass/fail checks — and shoehorning them into a burn-rate SLO produces alerts that flap on benign weekly seasonality.

For the freshness SLO, the multi-window-multi-burn-rate formulation from /wiki/multi-window-burn-rate carries over verbatim. Define the SLO as "99% of partitions land within 1 hour of schedule over a 30-day window" — that is a 1% error budget over 30 partitions ≈ 0.3 partitions per month. A burn rate of 14.4 over 1h means burning 14.4 × 0.3 / 720 ≈ 0.6% of the monthly budget per hour; a 1h+5m page-trigger fires when the current run is already 30+ minutes late, which is the right time to wake someone up.

Volume, schema, and distribution checks should fire as categorical alarms tied to the partition — they are not SLOs, they are gates. The alert payload says partition=2026-04-05, signal=volume, rel_deviation=0.97, expected_median=8.4M, observed=240k and the on-call decides whether to acknowledge (an expected upstream change) or escalate. Treating these as boolean per-partition checks keeps the page text precise and avoids the trap of "the volume SLO is at 87% — what does that mean?". Pass/fail per partition is a 1 or 0; SLOs are statistical aggregates over many of those.

Freshness SLO with multi-window burn rate, gated on volume/schema/distribution checksDiagram split in two. Left: a 30-day timeline of daily partition landings, mostly within the 1-hour SLO window, with two late landings consuming the error budget. A budget bar shows 78% remaining after the late landings. Right: a partition-quality gate with four signals (freshness, volume, schema, distribution); a green-amber-red status block per signal for today's partition. Freshness SLO + per-partition gates Illustrative — freshness is the SLO, the other three are pass/fail gates per partition. Freshness over 30 days target: p99 land within 1h of schedule 8h 4h 0h SLO 1h 3.5h late 6.8h late — page budget remaining: 78% page = burn rate > 14.4 over 1h sustained 5m warn = burn rate > 6 over 6h sustained 30m Today's partition gates dt=2026-04-05 freshness FAIL 8.2h late volume FAIL −97% schema PASS no diff distribution FAIL null + KS Any FAIL → block downstream consumers freshness FAIL also burns SLO budget
Illustrative — the freshness SLO is rate-shaped and supports a multi-window burn-rate alert; the other three signals are categorical pass/fail gates per partition. Together they form the partition-quality contract that downstream consumers depend on.

Lineage and the cascading-failure problem

A batch pipeline is rarely one DAG. The Swiggy restaurant_funnel table is built from orders_fact, which is built from payments_fact and restaurants_dim, which are themselves derived from raw event streams. When restaurant_funnel looks wrong at 09:04, the bug is almost never in restaurant_funnel itself — it is upstream, and you must walk the lineage graph to find it.

This is exactly the discipline from /wiki/lineage-aware-alerting, specialised to batch. Three rules carry over and one is new:

  1. Suppress downstream alarms when an upstream gate has failed. If payments_fact failed its volume gate at 03:14, every downstream table built from it is going to fail too. Page the producer's on-call once, suppress the cascade. A platform that pages every leaf table on every upstream incident burns its on-call rotation in a quarter.

  2. Rank pages by lineage depth. When multiple incidents fire simultaneously, page on the highest upstream failure first. The downstream pages are downstream symptoms; fixing the root usually clears them all.

  3. Edge weights from co-failure history. Tables that consistently fail together (because they share an upstream) should be visually grouped on the lineage graph. The on-call learns to read the graph as a topology of failure modes, not just a static dependency.

  4. (Batch-specific) Track the partition through the lineage, not just the table. If restaurant_funnel dt=2026-04-05 failed, the question is not "which upstream table is broken" but "which upstream partition is broken or missing". Modern lineage-aware platforms (DataHub with partition-aware lineage, OpenLineage with RunFacet partition keys) thread the partition identifier through every transform; a 09:04 incident becomes "follow the dt=2026-04-05 partition backwards through the DAG until you find a stage that did not produce it or produced it wrong".

The Open Lineage project formalises this: every job emits a RunEvent with input datasets, output datasets, and partition facets, allowing the alertmanager to query "for output partition X, what is the chain of (input dataset, input partition, run, status) that produced it?". The on-call sees the chain in the alert subject line and walks it directly.

Common confusions

  • "Airflow state=success means the partition is good." It means the Python code raised no exception. The partition could be empty, schema-changed, value-corrupted, or just delayed past its SLO — none of which Airflow knows about. Airflow's success signal is a necessary but radically insufficient condition for partition health.
  • "Data quality and observability are different teams." They are the same discipline. The four signals in this article (freshness, volume, schema, distribution) are exactly the signals data-quality frameworks like Great Expectations and Soda emit, and exactly the signals an observability platform should ingest as SLIs. Treating them as separate is how you end up with a Soda dashboard that no one looks at and a Grafana that doesn't know about partition correctness.
  • "Schema-on-read formats like Parquet protect us from schema drift." They protect the reader from a deserialisation crash; they do not protect the consumer from a column being silently renamed or a type being silently widened. The reader gets a column it does not know about (or does not get a column it expected) — the SQL still parses, the dashboard still renders, and the numbers are wrong. Schema observability is necessary even on Parquet, Iceberg, or Delta.
  • "We can detect this from query latency." The query that reads the broken partition runs in the same time it always does. The wrongness is in the result set, not the query plan. Query observability tells you the database is healthy; partition observability tells you the data is right. Conflating them is how Hotstar's analytics team once spent four hours during an IPL final convinced their warehouse was slow when in fact it was returning yesterday's funnel because the partition pointer had not advanced.
  • "Backfills don't need observability — they're one-off." Backfills are exactly when the four signals matter most, because a backfill is rewriting partitions the rest of the pipeline already trusts. A botched backfill silently overwrites correct data with wrong data, and the absence of monitoring on the backfill run means the corruption is detected only when a downstream consumer notices the dashboard moved. Run the same gates on backfills as on scheduled runs; refuse to commit if any gate fails.
  • "Distribution drift on batch is the same as model drift." They are mechanically similar (KS / chi-squared / null-rate ratios) but operationally different. Model drift assumes you have a model to retrain; batch distribution drift only tells you that the data changed — the response could be retraining a downstream model, fixing an upstream producer, escalating to the source-system team, or accepting it as a planned change. The batch gate fires on the fact of a distribution change; the response is decided by the partition's owner.

Going deeper

Open Lineage and the partition facet

The open-source OpenLineage project (now an LF AI & Data foundation project) defines a JSON schema for emitting lineage events from arbitrary batch frameworks (Spark, Airflow, dbt, Flink). Every RunEvent carries inputs, outputs, and a set of facets — extensibility hooks for run-specific metadata. The PartitionFacet extension threads the partition identifier through the lineage graph: when restaurant_funnel.dt=2026-04-05 is produced, the run event records that it consumed orders_fact.dt=2026-04-05 and payments_fact.dt=2026-04-05. The downstream alerting layer can then issue lineage queries scoped to a specific partition, not just a table — turning "what produced today's restaurant_funnel?" from a manual graph walk into a single API call. The Marquez backend implements this fully; the dbt-OpenLineage and Airflow-OpenLineage providers emit the events.

Why backfills break observability assumptions

Most batch monitors assume a single producer per partition: the scheduled DAG run for a given dt. Backfills violate this — a backfill DAG can rewrite hundreds of partitions in a single run, often outside the normal schedule, often with different code than the scheduled run used originally. If the monitor detects "partition dt=2026-03-14 was rewritten at 14:00 on 2026-04-05", what is the freshness SLO for that event? What is the historical baseline for the volume check (the original 2026-03-14 row count, or the row count distribution across all 2026-03 partitions)? Production teams handle this by tagging backfill runs with a run_kind=backfill facet in OpenLineage, suppressing the freshness SLO for backfill-produced partitions, and using a dedicated baseline window (the original partition's pre-backfill snapshot, stored separately) for the volume and distribution checks.

Coordinated omission for batch

Coordinated omission, the latency-measurement bug from /wiki/coordinated-omission, has a batch analogue: if your monitoring only inspects partitions that exist, you systematically miss partitions that were never produced. The Swiggy incident at 09:04 is a partial example — the monitor checked the youngest existing partition (still dt=2026-04-04), saw it was healthy by all four signals, and reported nothing wrong. The signal "the partition for dt=2026-04-05 is missing entirely" is not a check on a partition; it is a check on the absence of one. Production batch monitors must enumerate the expected set of partitions (driven by the schedule) and alarm on absence, not just on presence with anomalies. This is the batch version of the Prometheus absent() function — and it is the single most-forgotten check in early-stage data platforms.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
pip install pandas numpy scipy
python3 batch_monitor.py
# Expected: three of four signals breach on the synthetic broken partition —
# freshness (8.23h late vs 3h SLO), volume (97% drop vs 7-day median),
# distribution (country_code null spike + revenue_paise KS p≈0).
# Schema passes because column names and types are unchanged.
# Tweak the today partition's parameters (revenue_unit='paise', n=8_400_000,
# nulls_country=0.0004) to recover green; flip one parameter at a time to
# isolate which signal each kind of corruption surfaces in.

Where this leads next

Batch observability is one half of the data-pipeline observability story; /wiki/observability-for-stream-processors is the other. Stream observability shifts the unit of measurement again — from a partition to a watermark, a topic offset, a consumer-group lag. The four signals reshape: freshness becomes lag, volume becomes throughput, schema becomes contract version, distribution becomes per-window stats. The discipline is the same; the temporal unit is sub-second instead of daily.

The connection back to the rest of Build 15 is direct: data-quality SLOs from /wiki/data-quality-metrics-as-slos are the contracts that batch gates enforce; lineage-aware alerting from /wiki/lineage-aware-alerting is how the cascade gets suppressed; model-drift detection from /wiki/model-drift-and-data-drift is what runs on top of a healthy partition once it lands. The chapter after stream — /wiki/wall-all-this-costs-a-fortune-tame-the-bill — closes Build 15 by counting the cost of running all of this. Distribution checks on every column of every partition is not free, and the bill scales with cardinality just as relentlessly as the metrics-side cardinality budget from /wiki/why-high-cardinality-labels-break-tsdbs.

By the time Aditi ships the four-signal monitor on restaurant_funnel, the 09:04 IST incident becomes a 02:18 page: freshness fires at +14 minutes against the 3h SLO (because the upstream orders_fact was 6h late), the volume gate trips at 02:31 when the partition lands at 3% of expected, and the on-call walks the lineage from restaurant_funnel to orders_fact to a misconfigured Kafka connector that started 02:08. The dashboard is wrong by 02:35; the team knows by 02:31.

References