On-call for data: alerts that matter
Asha at Cred opens her phone at 02:38 IST. The page reads pipeline_failed: rewards_daily_v3 — task RewardsAggregator failed. She squints, opens the Airflow UI on her laptop, scrolls, and finds the failure: a transient Snowflake credit error the system already retried three times and recovered from. She acks the page, goes back to bed at 03:14, gets paged again at 04:02 for a different pipeline, and at 06:30 for a third. By Monday morning's standup she has slept four hours, three of the seven pages were real, and the team's quarterly burnout survey is two weeks away. The mistake is not Asha's pager discipline — it is the alerting design that fires on task_failed instead of on business_outcome_breached.
Good data on-call alerting fires only on outcomes a human must act on tonight — SLA breach risk, freshness violation, contract break — and stays silent on internal mechanics like task retries, transient errors, or warehouse blips. The discipline is symptom-based alerting (page on what hurts the consumer), severity bands tied to action, runbook links on every alert, and a weekly review that prunes alerts more aggressively than it adds them.
What "alerts that matter" actually means
The first instinct of a junior data team is to alert on everything that could be wrong: every task failure, every retry, every row-count anomaly, every schema mismatch. Six months in, the on-call channel has 200 messages a day, the team has installed Slack mute schedules, and the page that finally matters at 03:00 IST is buried beneath nineteen INFO-level dashboard refreshes. The fix is not better dashboards; it is fewer alerts, attached to outcomes.
The principle borrowed from SRE — symptom-based alerting — is the binding constraint. Alert on what the consumer experiences (the gst_filing_dashboard is stale, the merchant_payouts table missed its 06:00 freshness SLA, the feature_store/risk_score lookup is returning 30% nulls). Do not alert on the internal mechanism (a Snowflake warehouse auto-resumed slowly, an Airflow task retried twice, a Kafka consumer lagged for 14 seconds). The mechanism is what the alert investigation uncovers; the symptom is what triggers the page.
The second rule is severity tied to action. A "P1" page must mean: someone is awake right now, fixing this, and the consumer feels the impact. A "P2" alert means: deal with it tomorrow morning. A "P3" notice means: aggregate across the week, look at trends. If every alert is P1, none are. The split between P1 and P2 is the single biggest lever a data team has on on-call quality, and most teams get it wrong by tagging too many things as P1 because nobody had the political capital to argue for downgrading.
The third rule is every alert ships with a runbook link. If the on-call engineer at 02:30 IST has to start with "what does this alert even mean?", you have shipped an unfinished alert. The Slack message — or PagerDuty incident, or Opsgenie alert — must include: what broke, the affected SLA, the immediate first step ("check https://airflow.cred.in/dag/rewards_daily_v3 for the most recent run"), the rollback command if applicable, the escalation path. The body of this chapter is largely about how to ship that runbook link automatically.
Why outcome-based alerting beats mechanism-based alerting in practice: a single business outcome (the merchant payouts dashboard is stale) can have 30 different mechanism causes (warehouse credit limit, S3 throttling, schema drift, expired secret, network partition, upstream Kafka lag, etc.). Alerting on each mechanism gives you 30 alerts when one outcome breaks. Alerting on the outcome gives you 1 alert that fires regardless of cause — and the investigation discovers which mechanism failed. The first style scales as O(mechanisms × pipelines). The second scales as O(SLAs).
The four alert classes you actually need
Most data teams need exactly four kinds of alerts. More than that and the on-call surface area gets unmanageable; fewer than that and you start missing real incidents. The four are:
1. Freshness alerts. "Did the table update by the time the SLA promised?" These are the bread and butter. The check is straightforward: every X minutes, look at MAX(updated_at) (or the equivalent partition coverage check) and compare against the table's freshness SLA. If now - max_updated > sla, page. Most data warehouses (Snowflake, BigQuery, Databricks) have native freshness check primitives now; dbt has freshness: in sources.yml; Soda and Great Expectations both support these directly. The trap is alerting on every table with a freshness SLA — instead, alert only on tables that have consumers tracked in the lineage graph. A table nobody reads doesn't deserve a 02:30 page.
2. Volume / row-count anomaly alerts. "Did approximately the right amount of data land?" A table that normally gets 50 lakh rows/day suddenly has 8 lakh rows is broken upstream even if MAX(updated_at) looks fine. The naive implementation alerts on a fixed threshold ("less than 40 lakh = page"); the better implementation uses a baseline (rolling 7-day median ± a band) and a seasonality adjustment (Mondays are different from Sundays, festival weeks are different from regular weeks). Volume alerts are a P2 by default — they catch real incidents but rarely need a 02:30 response unless the missing data feeds a freshness-critical downstream.
3. Contract / schema alerts. "Did the producer change the shape without telling us?" These fire when the source table or upstream API breaks the contract: a column was dropped, a type changed from int to string, a previously-non-null column started having nulls, an enum gained a new value. Tools: dbt's contract enforcement, Great Expectations, Soda's checks, internal contract registries. The right severity is usually P2 unless the contract break causes a freshness breach (in which case the freshness alert fires too — and that's the page).
4. Business-outcome alerts. "Is the thing the consumer cares about visibly broken?" The risk_score feature is returning >5% nulls. The payment_success_rate metric is reading 0 because of a divide-by-zero. The live_users count on the homepage is showing -1. These are downstream of the data pipeline but specific to a business outcome a human needs to fix tonight. They fire P1 because they map directly to revenue or compliance.
Anything outside these four classes — task retries, warehouse auto-resumes, transient errors, dbt model rebuild times — goes to logs and gets reviewed weekly, not paged. Why this taxonomy survives contact with reality: every real incident in a mature data platform is a freshness breach, a volume anomaly, a contract violation, or a business outcome going wrong — and most of them surface as freshness or business-outcome alerts because those have the tightest consumer-impact coupling. The other two are early-warning systems, not pages.
A complete alerting harness, in code
The mechanism is concrete enough that the script is clearer than the prose. Below is a small alerting harness that evaluates the four alert classes against a stub data warehouse, classifies the severity, applies suppression to avoid duplicate pages, attaches a runbook URL, and routes to the right destination. In production each stub becomes a real client (Snowflake, OpenLineage, PagerDuty, Slack), but the logic stays exactly the same — and that's the point of writing it small first.
# oncall_alert_harness.py — evaluate, classify, suppress, route
from datetime import datetime, timedelta
import json, hashlib
# --- stubs (replace with real clients in production) ----------------
class WarehouseStub:
"""Returns table-state for evaluation."""
def __init__(self):
self.now = datetime(2026, 4, 25, 2, 30)
self.tables = {
"merchant_payouts": {"max_updated": self.now - timedelta(hours=4),
"row_count_today": 4_50_000, "baseline": 50_00_000},
"rewards_daily_v3": {"max_updated": self.now - timedelta(minutes=15),
"row_count_today": 12_50_000, "baseline": 12_00_000},
"risk_score_features":{"max_updated": self.now - timedelta(minutes=8),
"null_pct": 0.34, "baseline_null_pct": 0.02},
}
def get_state(self, t): return self.tables[t]
class RouterStub:
def __init__(self): self.sent = []
def page(self, alert): self.sent.append(("PD", alert))
def slack(self, alert): self.sent.append(("SLACK", alert))
def digest(self, alert):self.sent.append(("DIGEST", alert))
# --- alert definitions (the SLA contract) ---------------------------
SLAS = {
"merchant_payouts": {"freshness_min": 60, "consumer": "ops_dashboard", "owner": "@kiran"},
"rewards_daily_v3": {"freshness_min": 30, "consumer": "loyalty_team", "owner": "@asha"},
"risk_score_features": {"freshness_min": 10, "consumer": "fraud_serving", "owner": "@jishant"},
}
RUNBOOKS = "https://runbooks.cred.in/data/{slug}"
# --- evaluators (one per alert class) -------------------------------
def eval_freshness(t, state, sla, now):
lag = (now - state["max_updated"]).total_seconds() / 60
if lag > sla["freshness_min"]:
return {"class": "freshness", "severity": "P1", "table": t,
"msg": f"{t} stale by {lag:.0f} min (SLA {sla['freshness_min']})",
"consumer": sla["consumer"], "owner": sla["owner"]}
return None
def eval_volume(t, state):
if "row_count_today" not in state: return None
actual, baseline = state["row_count_today"], state["baseline"]
if actual < 0.5 * baseline:
return {"class": "volume", "severity": "P2", "table": t,
"msg": f"{t} got {actual:,} rows vs baseline {baseline:,} ({100*actual/baseline:.0f}%)"}
return None
def eval_outcome(t, state):
if "null_pct" not in state: return None
if state["null_pct"] > 5 * state["baseline_null_pct"]:
return {"class": "outcome", "severity": "P1", "table": t,
"msg": f"{t} null rate {state['null_pct']:.0%} vs baseline {state['baseline_null_pct']:.0%}"}
return None
# --- suppression: dedupe identical alerts within 30 min -------------
SEEN = {}
def fingerprint(a): return hashlib.md5(f"{a['class']}|{a['table']}".encode()).hexdigest()[:10]
def suppress(a, now):
fp = fingerprint(a); last = SEEN.get(fp)
if last and (now - last).total_seconds() < 1800: return True
SEEN[fp] = now; return False
# --- attach runbook + route -----------------------------------------
def enrich(a):
a["runbook"] = RUNBOOKS.format(slug=f"{a['class']}-{a['table']}")
a["fired_at"] = datetime(2026, 4, 25, 2, 30).isoformat() + "Z"
return a
def route(a, r):
{"P1": r.page, "P2": r.slack, "P3": r.digest}[a["severity"]](a)
# --- run ------------------------------------------------------------
wh, router = WarehouseStub(), RouterStub()
now = wh.now
for tbl in SLAS:
state = wh.get_state(tbl); sla = SLAS[tbl]
for evl in (eval_freshness(tbl, state, sla, now), eval_volume(tbl, state), eval_outcome(tbl, state)):
if evl and not suppress(evl, now):
route(enrich(evl), router)
for kind, a in router.sent:
print(f"[{kind}] {a['severity']} {a['class']:9s} {a['table']:25s} {a['msg']}")
print(f" runbook: {a['runbook']}")
# Output:
[PD] P1 freshness merchant_payouts merchant_payouts stale by 240 min (SLA 60)
runbook: https://runbooks.cred.in/data/freshness-merchant_payouts
[SLACK] P2 volume merchant_payouts merchant_payouts got 4,50,000 rows vs baseline 50,00,000 (9%)
runbook: https://runbooks.cred.in/data/volume-merchant_payouts
[PD] P1 outcome risk_score_features risk_score_features null rate 34% vs baseline 2%
runbook: https://runbooks.cred.in/data/outcome-risk_score_features
Walk through the load-bearing pieces. Lines 25–29 are the SLA contract — every table that gets evaluated has a declared freshness window, a tracked downstream consumer, and an owning team. A table without an entry here is silently not evaluated, which is the design choice you want: alerting only fires on contracts you've explicitly agreed to. Lines 35–40 are the freshness evaluator, the simplest and most-common alert class. It compares max_updated against the SLA window and returns a structured alert payload. Why the alert returns a dict instead of just calling pager.fire() directly: separating evaluation from routing lets you unit-test the evaluators without mocking the router, lets the suppression layer dedupe before paging, and lets the enrichment layer attach the runbook URL in one place. Mixing them is a smell that bites you in week 4. Lines 49–52 are volume evaluation, deliberately a simple ratio against baseline; the production version uses a rolling-7-day median ± 2σ with seasonality adjustments, but the structure is identical. Lines 56–60 are outcome evaluation — null-rate spikes are the most common business-outcome alert because they correlate strongly with feature-store breakage that causes user-visible product issues (a fraud-detection model returning low confidence because risk-score is null). Lines 64–68 are the suppression layer. Without this, a freshness breach that lasts three hours and gets re-evaluated every 5 minutes pages 36 times. With dedup keyed on (class, table) and a 30-minute TTL, the on-call engineer gets one page, fixes it, and doesn't get re-paged unless the breach persists into a new dedup window. Lines 73–75 attach the runbook URL — every alert that lands in PagerDuty includes a clickable link to a per-class-per-table runbook page. The URL doesn't have to resolve to a unique page (most data teams have one runbook per alert class with table-specific sections), but the URL pattern must be predictable enough that on-call clicks it from muscle memory.
In production at Cred, Razorpay, or Zerodha this harness has a few extras the stub omits: persistent state for the suppression cache (Redis, not an in-memory dict, so it survives the harness restarting), backoff escalation (if P1 isn't acknowledged within 10 minutes, page the secondary; if still not within 20, page the eng manager), and integration with the lineage graph so a freshness alert on a parent table doesn't redundantly fire freshness alerts on every child table downstream — see column-level-lineage-why-its-hard-and-why-it-matters for the lineage primitive that makes this possible.
The weekly alert review — the load-bearing ritual
Code that ships alerts is half the system. The other half is the ritual that prunes them. Most data teams that succeed at on-call have a 30-minute meeting on Friday — call it the "alert review" — where the on-call engineer from the past week walks through every page that fired and the team makes one of three decisions per alert: kept as-is, downgraded (P1 → P2, or P2 → digest), or deleted. The team commits to a hard rule: net-new alerts in any month ≤ alerts removed in the same month. Without this rule the alert surface only grows, and within a year the on-call experience deteriorates back to where it started.
The agenda for the review is fixed. For each alert that fired this week: did it require human action? Was the severity correct? Was the runbook link useful? Did the on-call engineer find the right context to fix it within 5 minutes? If any answer is "no" — file a PR before leaving the meeting. The PR usually deletes the alert (most common), occasionally downgrades it, and rarely (genuinely rarely) adjusts a threshold. Adding new alerts is allowed but counter-balanced: every new alert added in the review must either replace an existing one or be paid for by deleting two others.
Razorpay's data platform team has run this ritual for three years; their internal write-up reports that the team's median weekly page count went from 47 (in 2023) to 6 (in 2026), without missing any P1 incidents. The lever was not better tools — it was the discipline of the review. PhonePe's team uses a similar cadence with one extra ritual: every quarter, a "page audit" where the on-call rotation reviews the previous quarter's incidents and asks whether the alert that fired (or didn't) was the right shape. Incidents that were missed entirely are the most valuable signal — they tell you which outcome you forgot to alert on.
Alert hygiene at scale — what changes past 1,000 tables
A 50-table data platform can run on dbt's built-in freshness: and a Slack webhook. A 5,000-table platform cannot. Three things break at scale, and the runbook fixes them in this order:
Alert ownership routing. Once you cross ~200 tables, "page the on-call data engineer" stops working — different teams own different table families and a P1 on merchant_payouts should page the payments team's data on-call, not the central platform's. The fix is owner metadata on every table (in dbt, this is meta.owner; in Iceberg/Snowflake/BigQuery it's table tags or labels) and a routing layer in the alert engine that reads owner → PagerDuty schedule. Without this, every alert wakes the same five engineers regardless of relevance, and the platform team burns out first.
Multi-tier severity. Past 1,000 tables, the binary P1/P2 split isn't enough. A freshness breach on the homepage's live_user_count (visible to every customer) is louder than a freshness breach on internal_marketing_v3 (visible to two analysts). The fix is consumer-tier metadata — every consumer (dashboard, model, downstream pipeline) gets a tier (T0 customer-facing, T1 internal-critical, T2 analytical, T3 experimental) — and the alert severity inherits from the consumer's tier, not from the table's. A T0 freshness breach pages immediately; a T3 one goes to a digest.
Alert silencing during scheduled work. Backfills, migrations, planned freezes — these intentionally violate freshness and volume alerts for hours at a time. The harness must support silence(table, until=t) so the on-call engineer running a backfill can suppress alerts on the affected tables without disabling the alert globally and forgetting to re-enable it. The standard pattern is a maintenance-window table that the harness checks at evaluation time; tools like Alertmanager and PagerDuty support this directly. Why silencing has to expire automatically: the most common alerting failure at scale is not a missed alert — it's an alert that was silenced for a backfill three months ago and never re-enabled. The expiry is what prevents that. If your silence has no until field, you've shipped a future P1 outage for the engineer who picks up the on-call rotation in October.
Common confusions
- "Alert on every task failure — better safe than sorry." This is the source of 80% of on-call burnout. Task failures are mechanism; tasks retry; most failures self-heal. Alert only when retries are exhausted and the failure threatens an SLA. The right number of P1 pages per week is 1–2, not 40.
- "P1 means high-priority." P1 means page someone right now. If the alert can wait until 09:00 IST tomorrow, it's not P1, no matter how important it feels. The discipline is: P1 = wake up; P2 = next workday; P3 = aggregated review. Diluting P1 destroys the signal.
- "Alert thresholds should be conservative — page on small anomalies." Conservative thresholds produce noise. The right thresholds page on real incidents (volume <50% baseline, freshness >2× SLA window) and let the smaller anomalies show up in dashboards or weekly digests. The correct alert is "off by a factor", not "off by 5%".
- "Lineage isn't needed for alerting." Lineage is what tells you that a freshness breach on
raw_paymentswill cause downstream breaches on 12 marts in the next 90 minutes. Without lineage, the on-call engineer learns this by getting paged 12 more times. With lineage, the harness can deduplicate at the parent level — seecolumn-level-lineage-why-its-hard-and-why-it-matters. - "Runbooks are documentation, separate from alerts." Runbooks attached to alerts are the alert; runbooks in a wiki nobody opens are not. Every PagerDuty page that fires must include a runbook URL inline. If the runbook page is empty, the alert is not yet ready to ship.
- "You should add an alert for every metric you care about." No. You should add an alert for every metric where the consumer would notice the breach within the SLA window. Caring about a metric is necessary but not sufficient — the alert must be actionable, the action must be time-sensitive, and the runbook must be runnable in under 20 minutes.
Going deeper
Symptom-based vs cause-based alerting — the SRE adaptation
The original SRE doctrine (Beyer, Jones, Petoff, Murphy — Site Reliability Engineering, 2016) argues for symptom-based alerting in service systems: page when a user-visible symptom (latency, error rate, availability) breaches an SLO, not when an internal metric (CPU, memory, queue length) crosses a threshold. The data-engineering version of this doctrine maps almost cleanly: the user-visible symptom is the table is stale, the dashboard is wrong, the feature is null; the internal metric is task retried, warehouse slow, file system full. The mapping is not perfect because data systems have a longer feedback loop than service systems — a user might not see the freshness breach for an hour, by which time you've already missed your SLA — so the data-engineering version pages slightly earlier, on predicted SLA breach rather than realised user impact. The 2024 paper "SLOs for Data Systems" (Microsoft Research) formalises this as a "freshness budget" analogous to an error budget, with the same monthly accounting and the same conversation-with-the-business when the budget runs out. Razorpay's internal data SLO framework is built on this primitive.
Multi-tenant alerting — pages routed by data domain
In a multi-tenant data platform (one warehouse, many product teams), the alerting routing problem becomes a real engineering challenge. The pattern that works: every dataset is owned by a domain; every domain has a primary on-call rotation in PagerDuty/Opsgenie; the alert engine reads owner metadata and routes accordingly. Tools like Monte Carlo, Bigeye, and Acceldata have built this routing into their products; for teams running open-source stacks, the build-it-yourself version is ~150 lines of Python that maps dataset → owner → pagerduty_schedule_id. The trap is when ownership is ambiguous — two teams both touch a table — and the alert ends up paging a generic "platform" rotation that knows nothing about the data. The fix is to mandate a single owner per dataset in the data catalog, with a clear informational_consumer list for the others. Build 16's column-level-access-and-row-level-security covers the catalog primitive that makes this enforceable. Compare this with airflow-vs-dagster-vs-prefect-the-real-design-differences — the orchestrator handles the when, the catalog handles the who. Why ownership routing is harder than service-system ownership routing: a microservice has one owning team in 99% of cases; a dataset has 1 producer team and N consumer teams, all of whom care about freshness, only one of whom should be paged. Building this correctly takes a quarter at most platforms — and skipping it means the central platform team gets paged for everyone's incidents and burns out.
Alert testing — yes, you test alerts
Alert definitions are code, and code without tests rots. The discipline at production data teams: every alert has at least one test that triggers it artificially (e.g., synthetic stale-data injection) and verifies the page fires with the right severity, the right routing, and the right runbook link. The "chaos for data" pattern — running these tests weekly in staging — catches alert regressions early. Common failures: a refactor that renames a table breaks the freshness alert silently; a schema change moves the timestamp column and the freshness query starts returning NULL (which doesn't trigger any alert); a new dashboard consumer was added without registering its SLA. Without alert tests, these surface in production at 03:00 IST, which is the worst possible time to discover a missing alert. Razorpay's data platform reportedly runs ~600 synthetic alert tests every Sunday night against staging — the cost is real but the catch rate is ~3 regressions per quarter, each of which would have been a P1 incident eventually.
The "page that didn't fire" — the hardest alert problem
The asymmetry of alerting is brutal: a noisy alert is annoying; a missed alert is a P1 outage. The hardest part of running data on-call is detecting missing alerts — incidents that the alerting harness should have caught and didn't. Three heuristics help. First, after every incident, run a 5-minute retro: was an alert configured for this? Did it fire? If not, why not? Add the alert before closing the ticket. Second, run a quarterly page audit across the previous quarter's incidents and ask whether each was caught by alert or by user complaint — the latter is a missed alert per definition. Third, instrument the alert engine itself: track alerts_evaluated_total, alerts_fired_total, alert_evaluation_failures, and alert when those metrics behave anomalously (the meta-alerting problem). The 2025 OpenLineage RFC adds an alert_coverage event type for exactly this purpose.
The handoff — what makes a clean on-call rotation
The shift change between two on-call engineers is the most common point at which alert-related context drops on the floor. The discipline that works: a 5-minute synchronous handoff at the start of every rotation (Monday 09:00 IST, every week), where the outgoing on-call walks the incoming one through any active P1/P2 incidents, any silenced alerts (and when they expire), any runbook updates from the past week, and any planned maintenance windows that will affect the upcoming week. The alternative — an async Slack message — works fine 80% of the time and fails badly the 20% when something is genuinely in flight. Cred and Zerodha both run synchronous handoffs; Flipkart and PhonePe run async. The teams running synchronous handoffs report fewer "I didn't know that was still ongoing" incidents.
Where this leads next
- /wiki/runbooks-the-ones-that-actually-work-at-3am — the runbook itself: structure, naming, and the discipline of writing them in advance.
- /wiki/slas-on-data-what-you-can-actually-promise — the SLA framing this chapter assumes; without an SLA, "freshness alert" has no threshold to fire on.
- /wiki/column-level-lineage-why-its-hard-and-why-it-matters — the lineage primitive that makes parent-level alert deduplication work.
- /wiki/cost-attribution-who-pays-for-that-query — the alert routing primitive (owner metadata) is the same primitive that drives cost attribution; design them together.
The on-call discipline is what separates teams that ship reliable data from teams that ship lots of data. The alerting harness is the surface area; the weekly review is the engine; the runbook is the artifact each page lives or dies by. Get all three working and the 02:30 IST page becomes rare, real, and resolvable — get any of the three wrong and on-call eats the team.
References
- Beyer, Jones, Petoff, Murphy — Site Reliability Engineering, O'Reilly, 2016 — the foundational text on symptom-based alerting; Chapter 6 ("Monitoring Distributed Systems") is the binding reference.
- Google SRE Workbook — Alerting on SLOs — the practical adaptation: error budgets, burn rates, multi-window alerts.
- PagerDuty — Incident Severity Best Practices — vendor doc on the P1/P2/P3 split that survives team turnover.
- Monte Carlo — Data Reliability Engineering, 2023 — public posts on the four-class alert taxonomy and weekly review ritual.
- OpenLineage Specification — the lineage event format that makes parent-level alert deduplication possible at scale.
- Razorpay Engineering Blog — public write-ups on data platform on-call discipline at UPI scale.
- /wiki/slas-on-data-what-you-can-actually-promise — the SLA framing prerequisite for any alert threshold.
- /wiki/column-level-lineage-why-its-hard-and-why-it-matters — the lineage primitive that makes alert deduplication possible.