Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Lineage-aware alerting
03:14 IST. PagerDuty wakes Aditi, the on-call for the fraud team at a hypothetical Flipkart pricing platform. The alert reads risk-score-v2: prediction-distribution drift > 0.3. She opens the dashboard. The drift is real — the score distribution has shifted hard left, well outside the 14-day envelope. She spends 22 minutes ruling out the obvious: model rollback, feature-store schema change, traffic anomaly. None of them. At 03:36 she pings the platform Slack and learns that partner_pricing.daily failed its freshness contract at 02:48 IST — 26 minutes before her page fired. Her model is downstream of that feed via three feature transforms. The actual root cause was visible 26 minutes before her alert; her alert was the last domino, not the first. By 04:11 IST the responder learns that six other models share that ancestor and have been silently scoring on stale prices since 02:48. None of them paged anybody.
This chapter is the design that makes 03:14 unnecessary. The previous chapter (data-quality metrics as SLOs) showed how to write a contract on a single dataset and alert when its clauses fail. That is the per-node alarm. Lineage-aware alerting is the per-graph alarm: a contract violation at a producer must propagate through the lineage DAG to the consumer's dashboard, the page must go to the team that owns the root cause, the cascade of derived alarms must be suppressed to prevent storming, and the responder must see which dependents are now untrustworthy without opening seven Grafana tabs.
A lineage-aware alarm fires once at the root failure, names every affected downstream node, and suppresses the derivative alarms those nodes would have raised on their own. Implementation: a directed acyclic graph of datasets and models with edges produced by ETL/feature-store metadata, a contract-violation event on a node that walks the graph and emits a single composite incident, and a router that pages the team owning the root edge — not the team owning the symptom.
Why the page goes to the wrong team without lineage
A traditional alert is a function of one signal. risk-score-v2 drift > 0.3 is computed from one model's predictions; it knows nothing about why the input distribution moved. The alert routes to the model owner because the metric lives in the model owner's namespace. That routing is correct only if the model itself is the root cause — a bad rollout, a bug in feature engineering, a stale serving binary. When the root cause is a stale upstream feed, the model owner gets paged for a problem they cannot fix without first calling the data-platform team, who cannot fix it without calling the partner-feed team. Three pages, three handoffs, and the metric that was actually wrong (partner_pricing.freshness_lag_min) was sitting in a different team's dashboard the whole time.
The fix is not "add more alerts". Adding more alerts makes the cascade worse — every downstream consumer fires, on-call rotations across four teams light up at once, and Slack becomes unreadable. The fix is to know the graph and let one event cover the entire affected subgraph. Why the obvious-sounding "just route alerts based on data dependencies" requires structural lineage and not just naming conventions: dependencies are not visible from metric names. risk-score-v2.prediction_drift does not contain a string that names partner_pricing.daily as a parent. The lineage edge lives in the feature-store transformation that consumed the feed, in the airflow DAG that orchestrated the join, in the dbt model that materialised the intermediate table. None of those are queryable from a Prometheus alertmanager rule. The lineage graph has to be a separate, machine-readable artefact, populated by the producers and consumers as they ship, and queryable at alert-evaluation time.
Anatomy of a lineage edge
The graph is the foundation; if the edges are wrong, the routing is wrong. A lineage edge has four parts: a source node identifier, a target node identifier, a transformation reference (what code or query produced the target from the source), and a propagation policy (how a contract violation in the source affects the target's contract). The first two are obvious; the second two are where most lineage systems quietly cheat.
The transformation reference is the URL or git-commit-hash of the code that performed the join, the dbt model name, the airflow task identifier, the feature-store transformation spec. When a producer's contract fails and the responder asks "what exactly is downstream and why?", the answer is a list of (target node, transformation reference) tuples — clickable, not just named — so the responder can read the join logic in 30 seconds and know whether the failure is blocking (the join uses the upstream as a join key, downstream is now wrong) or informational (the upstream is one of 14 features, downstream may degrade gracefully).
The propagation policy is the harder part. Not every upstream failure breaks every downstream the same way. Three policies cover most of practice: hard — the downstream contract is meaningless without the upstream (a fraud model that joins against the failing pricing feed produces meaningless predictions until the feed recovers; the model's own contract should not be evaluated on this batch, and the model's drift alarm should be suppressed during the window). degraded — the downstream still produces output but with reduced reliability (a recommender that uses pricing as 1 of 14 features will still rank items, but the rank is less trustworthy; the model's contract should be evaluated normally, but the responder should be told this dependency is currently failing). independent — the downstream is structurally unaffected (an analytics dashboard that aggregates total revenue does not depend on per-SKU pricing accuracy; no propagation, no suppression).
Why hard / degraded / independent and not a single boolean: a binary "is affected / is not affected" forces a false choice. Either every downstream alarm is suppressed (which mutes legitimate independent failures during the same window — the recommender genuinely has its own bug, but it is hidden under the parent suppression) or nothing is suppressed (which gives the page storm). Three policies match the three real engineering decisions: stop evaluating, evaluate but warn, evaluate normally. The producer team and consumer team negotiate the policy when they sign the contract; it lives in the YAML next to the freshness threshold.
A working lineage-aware alerter — runnable
The smallest end-to-end demonstration: a lineage graph, contracts on multiple nodes, a contract-violation event entering the graph, propagation according to per-edge policy, suppression of derivative alarms, and a single composite incident emitted. Save as lineage_alert.py and run.
# lineage_alert.py — lineage-aware alerting on a synthetic feature/model graph.
# pip install networkx pandas
import json
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import Literal
import networkx as nx
import pandas as pd
PropagationPolicy = Literal["hard", "degraded", "independent"]
@dataclass
class Node:
name: str
owner_team: str
contract_clauses: list[str] # which clauses this node has
on_call_channel: str
@dataclass
class Incident:
root_node: str
root_clause: str
started_at: datetime
affected_hard: list[str] = field(default_factory=list)
affected_degraded: list[str] = field(default_factory=list)
suppressed_alarms: list[str] = field(default_factory=list)
page_to: str = ""
informational_to: list[str] = field(default_factory=list)
# Build the graph: partner_pricing -> features -> models, plus an analytics arm.
g = nx.DiGraph()
nodes = [
Node("partner_pricing.daily", "data-partner-team", ["freshness", "completeness", "distribution", "schema"], "#partner-feed-oncall"),
Node("feature.price_per_sku", "feature-store-team", ["completeness", "distribution"], "#feature-store-oncall"),
Node("feature.discount_ratio", "feature-store-team", ["completeness", "distribution"], "#feature-store-oncall"),
Node("model.risk_score_v2", "fraud-team", ["prediction_drift", "p99_latency"], "#fraud-oncall"),
Node("model.recommend_v3", "rec-team", ["prediction_drift", "p99_latency"], "#rec-oncall"),
Node("dashboard.daily_revenue","analytics-team", ["completeness"], "#analytics-oncall"),
]
for n in nodes:
g.add_node(n.name, obj=n)
# Edges with propagation policies. Each edge: (parent, child, policy, transform_ref).
edges = [
("partner_pricing.daily", "feature.price_per_sku", "hard", "git@feature-store:transforms/price_per_sku.py@a3f1c"),
("partner_pricing.daily", "feature.discount_ratio", "hard", "git@feature-store:transforms/discount_ratio.py@a3f1c"),
("partner_pricing.daily", "dashboard.daily_revenue", "independent", "dbt:models/daily_revenue.sql@v412"),
("feature.price_per_sku", "model.risk_score_v2", "hard", "git@fraud:training/risk_v2.py@7d99e"),
("feature.price_per_sku", "model.recommend_v3", "degraded", "git@rec:serving/recommend_v3.py@22ab1"),
("feature.discount_ratio", "model.recommend_v3", "degraded", "git@rec:serving/recommend_v3.py@22ab1"),
]
for parent, child, policy, transform in edges:
g.add_edge(parent, child, policy=policy, transform=transform)
def propagate(graph: nx.DiGraph, root: str, clause: str, started_at: datetime) -> Incident:
"""Walk the graph from `root`, classify each descendant by worst-case policy."""
inc = Incident(root_node=root, root_clause=clause, started_at=started_at)
inc.page_to = graph.nodes[root]["obj"].on_call_channel
# BFS; track each descendant's *worst* policy along any path from root.
worst: dict[str, PropagationPolicy] = {}
for child in nx.descendants(graph, root):
worst[child] = "independent"
queue = [(root, "hard")] # root itself is "hard" w.r.t. itself
seen_edges: set[tuple[str, str]] = set()
while queue:
node, incoming_policy = queue.pop(0)
for _, succ, attrs in graph.out_edges(node, data=True):
edge_policy: PropagationPolicy = attrs["policy"]
# Combine: independent kills the chain, hard dominates degraded.
if incoming_policy == "independent" or edge_policy == "independent":
effective: PropagationPolicy = "independent"
elif incoming_policy == "hard" and edge_policy == "hard":
effective = "hard"
else:
effective = "degraded"
order = {"independent": 0, "degraded": 1, "hard": 2}
if order[effective] > order[worst[succ]]:
worst[succ] = effective
if (node, succ) not in seen_edges:
seen_edges.add((node, succ))
queue.append((succ, effective))
for desc, policy in worst.items():
if policy == "hard":
inc.affected_hard.append(desc)
# Suppress every clause the descendant could fire on for this incident.
for c in graph.nodes[desc]["obj"].contract_clauses:
inc.suppressed_alarms.append(f"{desc}:{c}")
elif policy == "degraded":
inc.affected_degraded.append(desc)
inc.informational_to.append(graph.nodes[desc]["obj"].on_call_channel)
# independent: no action, the descendant's own alarms fire normally.
return inc
# Simulate: partner_pricing.daily fails its freshness contract at 02:48 IST.
incident = propagate(g, "partner_pricing.daily", "freshness", datetime(2026, 4, 25, 2, 48))
print(json.dumps({
"page_to": incident.page_to,
"root": f"{incident.root_node}:{incident.root_clause}",
"started_at": incident.started_at.isoformat(),
"affected_hard": incident.affected_hard,
"affected_degraded": incident.affected_degraded,
"suppressed_alarms": incident.suppressed_alarms,
"informational_to": sorted(set(incident.informational_to)),
}, indent=2))
Sample run:
{
"page_to": "#partner-feed-oncall",
"root": "partner_pricing.daily:freshness",
"started_at": "2026-04-25T02:48:00",
"affected_hard": [
"feature.price_per_sku",
"feature.discount_ratio",
"model.risk_score_v2"
],
"affected_degraded": [
"model.recommend_v3"
],
"suppressed_alarms": [
"feature.price_per_sku:completeness",
"feature.price_per_sku:distribution",
"feature.discount_ratio:completeness",
"feature.discount_ratio:distribution",
"model.risk_score_v2:prediction_drift",
"model.risk_score_v2:p99_latency"
],
"informational_to": [
"#rec-oncall"
]
}
Read the output. The page goes to one place — #partner-feed-oncall, the team that owns the root failure — and only that channel is paged. Three nodes are classified hard-affected (feature.price_per_sku, feature.discount_ratio, model.risk_score_v2) because every edge in their path from the root is hard; their own contract alarms are explicitly suppressed for the duration of the parent incident. One node is degraded-affected (model.recommend_v3) because its incoming edges include a degraded policy; its own alarms still fire if it has independent issues, but the rec-team's Slack channel gets an informational notification that one of their dependencies is failing — context, not a page. The analytics dashboard is independent (the dbt model aggregates revenue and does not consume per-SKU prices), so it does not appear anywhere — its own alarms fire normally if anything goes wrong on its side.
Why the policy combination is independent-dominates-by-truncating but hard dominates degraded along the rest of the chain: a single independent edge in any path means the failure does not propagate beyond that point — the dbt revenue rollup uses a different join key and produces correct output regardless of partner-pricing freshness. After that edge, descendants are no longer affected. But for chains that do propagate, a single hard edge anywhere in the chain forces the worst-case classification — if the feature-store transform is hard-dependent on the feed, every model that hard-depends on the feature is also hard-affected, even if some intermediate edge was degraded. The combination rule is: take the strongest policy along the chain that has not been broken by an independent edge.
The dataclass surface is the contract. Node.contract_clauses is the list of alarm types this node can fire — knowing this is what makes suppression precise (we suppress exactly those clauses, not "all alerts from this team"). Edge.policy is the per-edge negotiation between producer and consumer; it lives in the consumer's repository (the rec-team decides whether their model is hard or degraded on a given feature) and the producer cannot override it. Incident.suppressed_alarms is the exact list of (node, clause) pairs that should not page during this incident; the alertmanager queries this list before deciding to fire any rule whose (node, clause) matches.
Suppression, not silencing — and the until clause
Suppression is dangerous. A team that learns to silence downstream alerts during parent incidents will eventually silence too much — the suppressed alarms include real, independent failures that happen to fire during the same window, and those go unnoticed. Three rules keep suppression honest.
Rule 1: suppression is per-clause, not per-node. When partner_pricing.daily:freshness fails, the suppression list should be exactly the clauses on downstream nodes that depend on freshness — not every clause those nodes have. model.risk_score_v2:prediction_drift is suppressed (because drift will be wrong while inputs are stale), but model.risk_score_v2:p99_latency is NOT suppressed (latency has nothing to do with the feed; if the model is also slow, that is a real, independent failure). The dataclass above conflates the two for brevity; production lineage systems track clause-level dependencies (feature.price_per_sku:completeness → model.risk_score_v2:prediction_drift) as a separate map.
Rule 2: suppression has an explicit expiry. Every suppression entry carries an until timestamp — typically started_at + max(2 × expected_recovery_time, 1 hour). If the parent incident is still open at expiry, the suppression must be re-confirmed by an on-call human; it does not auto-extend. This prevents the failure mode where a parent incident is forgotten in PagerDuty's "open" state for three days while every downstream alarm is silently muted. Production teams hit this within their first quarter of running lineage-aware alerting; the until clause is the cheapest defence.
Rule 3: suppression emits its own telemetry. Every suppressed alarm is logged with attributes (suppressed_by_incident_id, suppressed_at, suppressed_until, clause, would_have_fired_at). The on-call dashboard has a panel "currently suppressed alarms" that lists them in real time. Audit week reviews the suppression history: were the suppressions accurate (the alarms would indeed have been spurious symptoms of the root cause), or did real bugs hide behind them? Without that audit, suppression rots.
Where the lineage graph comes from — and why hand-curation fails
A lineage graph that engineers maintain by hand drifts within weeks. Edges are added when the consumer ships a new feature (the rec-team writes a new join, remembers to file a PR to the lineage-registry repo, the registry maintainer reviews and merges) — and forgotten when the consumer deprecates one. Within a quarter, the graph is missing 15-30% of edges, and the missing edges are exactly the ones nobody remembers — which means the lineage-aware alarm misses exactly the failures that surprise you.
Production lineage graphs are built by extraction from the systems that already know. Three sources cover most teams.
OpenLineage events from orchestrators. Airflow, dbt, Dagster, and most modern data platforms emit OpenLineage events on every run — START, COMPLETE, with input datasets and output datasets attached. A simple consumer of OpenLineage events builds the dataset-to-dataset edges automatically; every dbt run updates the graph. The OpenLineage spec is a JSON schema with inputs[] and outputs[] arrays; you point your collector at every orchestrator and the graph populates itself. Edges are tagged with the run's git commit and timestamp, which gives the responder the transformation reference for free.
Feature-store metadata. Tecton, Feast, Vertex AI Feature Store, and homegrown systems all maintain a registry of feature definitions, and every definition specifies its source datasets. A nightly job that reads the feature-store registry and emits dataset-to-feature edges keeps the feature side of the graph fresh. The serving layer's lookup calls (which feature does this prediction request use?) provide the feature-to-model edges, also nightly.
Code analysis as a fallback. Teams that run on a polyglot mess of one-off scripts, scheduled SQL, and notebook-driven training rely on AST-level analysis: a parser that walks the dbt project (dbt parse), the airflow DAG repo, and the model-training repos, looking for read_table(), pd.read_parquet(), feature_store.get(), and similar calls, building the edges from the source-string arguments. This is the most fragile of the three — string analysis misses dynamically-constructed table names — but it covers the long tail of legacy code that does not emit OpenLineage events.
The composite graph from all three sources is what the alarm reads. A nightly diff against yesterday's graph surfaces additions and deletions; sudden drops in edge count usually mean a source is broken (the OpenLineage collector lost its database, the feature-store snapshot failed) and the alarm system itself becomes less reliable in proportion. Treating the lineage graph as an SLO'd asset — its own freshness contract, its own completeness check, its own consumers — is the discipline that prevents lineage-aware alerting from quietly degrading into name-based alerting.
Common confusions
- "Lineage-aware alerting just means tagging alerts with their data source." It is not. Tagging tells the responder which dataset an alarm came from; it does not suppress the seven downstream alarms or route the page to the producer. A tag is metadata; lineage-aware alerting is a routing and suppression policy that consumes a queryable graph at alarm-evaluation time. Without the graph and the policy, you have a richer postmortem and the same 03:14 page storm.
- "Suppression means silencing — that is what we are trying to avoid." It does not. Silencing turns the alarm off; the responder never sees it happened. Suppression records every would-have-fired event and emits its own telemetry — a "currently suppressed" panel, an audit log, an explicit expiry. Suppression with audit is the opposite of silencing: every muted alarm leaves a trail that the team reviews. Conflating the two is what makes engineers reject lineage-aware alerting on principle; reading §3 of this chapter or the suppression literature distinguishes them.
- "The lineage graph is too expensive to query at every alarm evaluation." It is not, if you build it right. A typical production graph has 5,000–50,000 nodes and 20,000–200,000 edges; a NetworkX in-memory graph or a small Neo4j/JanusGraph instance answers
descendants(root)in single-digit milliseconds. The alertmanager queries the graph once per incident, not once per alarm evaluation — when a contract violation event arrives, you walk the graph to build the suppression list, then the alertmanager applies that list to every subsequent rule check until the incident closes. Cost is incident-rate-bound, not alarm-rate-bound. - "Independent edges mean the downstream is unaffected; nothing more to do." Independent edges still need to exist in the graph; otherwise the analyst who asks "what could be affected by this incident" will get a wrong answer (a missing edge looks the same as an
independentedge to the BFS, but means something different to a human). The discipline is: every actual edge is recorded, and the policy field distinguishesindependentfromhard/degraded. Missing edges are bugs;independentedges are explicit "we considered this and it does not propagate" decisions. - "Lineage-aware alerting solves alert fatigue." It reduces a specific class of fatigue (downstream-cascade fatigue from a single root cause) and does nothing for others (noisy alarm rules with bad thresholds, alarms that auto-resolve before anyone looks, alarms that fire 40 times for the same reason because the
for:clause is too short). It is one tool in the alert-fatigue toolkit, not the toolkit. See /wiki/alert-fatigue-as-a-production-failure for the broader discipline.
Going deeper
How OpenLineage and Marquez build the graph at runtime
OpenLineage is a JSON schema and HTTP API for emitting START and COMPLETE events from data-processing jobs. An Airflow DAG with the OpenLineage plugin emits one event per task with inputs: [{namespace, name}] and outputs: [{namespace, name}] arrays. Marquez is the canonical reference consumer — it ingests the events into a Postgres-backed graph and exposes a REST API for GET /lineage?nodeId=... that returns the descendants and ancestors. The graph in lineage_alert.py above maps cleanly onto a Marquez query: build the in-memory NetworkX graph by walking from the root via the Marquez API, then run the same propagate() logic. In production, the alertmanager either embeds the graph (refreshed nightly) or queries Marquez at incident-creation time; the embedded approach is faster, the query approach is fresher. Hybrid is most common: embed a 24-hour-old snapshot, query Marquez for any node added in the last 24 hours.
What changes for streaming pipelines
A streaming graph has a different temporal shape: lineage edges still exist (a Flink job consumes a Kafka topic and writes to another Kafka topic), but the contract violation is now a rate-of-bad-events rather than a failed run. The propagation logic survives — hard edges still suppress downstream contract violations on the affected windows — but the suppression until clause is no longer "until the next batch run"; it is "until N consecutive windows pass cleanly". For Hotstar's IPL streaming graph (ad-impressions → bid-evaluation → settlement, all Flink), the until is computed as time of last violation + 3 × the rolling window size. The dataclass extends with a pipeline_type: batch | streaming field and the suppression-renewal logic branches.
Cycles in the lineage graph
Lineage is supposed to be a DAG; cycles indicate a real problem in the data architecture (typically a feedback loop where a model's predictions are written back as features for tomorrow's model — common in recommender systems and fraud feedback loops). The graph builder must detect cycles and surface them as architecture warnings; the alertmanager treats cycles as "suppression scope = the entire connected component" because the propagation policy cannot be reliably classified inside a cycle. This is not a frequent issue (the typical Razorpay or Flipkart graph has 2–3 cycles among 30,000 nodes) but the cases that do occur are exactly the high-stakes feedback loops you most want to handle correctly. The practical defence: detect cycles on graph build (nx.simple_cycles), file a Jira ticket with the data-architecture team, and over-suppress conservatively until the cycle is broken.
Why the page goes to the producer, not the platform team
It is tempting to centralise: the platform team owns the lineage system, so all lineage-aware pages go to the platform team, who then routes manually. This produces a single bottleneck and trains the platform team to be a low-context router for problems they cannot fix. The discipline that scales is: the page goes to the team whose contract failed first (the producer of the root failure), the platform team owns only the lineage graph itself (its freshness contract, its completeness check, its API uptime), and the platform team is paged only when the lineage graph itself is broken. This mirrors the rung-to-team mapping from the data-quality SLO chapter (/wiki/data-quality-metrics-as-slos); the lineage system is one more node in the graph, with its own contract and its own owner.
Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install networkx pandas
python3 lineage_alert.py
# Expected output: one page to #partner-feed-oncall, three hard-affected nodes,
# one degraded-affected node (informational notification only), six suppressed
# alarms across the affected nodes. Tweak the edge policy on
# (partner_pricing.daily, feature.price_per_sku) from "hard" to "degraded" and
# rerun — the suppression list shrinks and the rec-team's page is no longer
# suppressed at all.
Where this leads next
The lineage graph is the connective tissue for the rest of Build 15. The next chapter, /wiki/model-drift-and-data-drift, uses the same graph to distinguish data drift (the input distribution moved, propagating through the lineage) from model drift (the model's behaviour changed independently of inputs). Without the graph, the two failure modes look identical on a prediction-distribution dashboard and responders waste 40 minutes ruling out the wrong one. With the graph, the responder asks "are any upstream contracts failing?" first; if no, the failure is genuinely model drift and a different playbook applies.
After that, /wiki/shadow-evaluation-and-canary-models extends the lineage graph one more rung — the canary model becomes a node with hard edges to the same features as the production model, and the shadow-evaluation contract joins the lineage-aware alarm system. A canary that violates its drift contract while production does not is a structural signal — same inputs, different outputs — that the lineage routing surfaces directly.
By the end of Build 15 the graph you populated in this chapter, the contracts from the previous chapter, the drift detectors from the next, and the shadow-evaluation pipeline compose into a single observability surface. Aditi's 03:14 page becomes an 02:48 page to the right team, with the seven affected dependents listed in the incident body, no cascade, and a documented audit trail of every alarm that was correctly suppressed and every one that was let through.
References
- OpenLineage specification — the JSON schema and event protocol for runtime lineage extraction; the foundation of any production lineage graph.
- Marquez project — the reference OpenLineage consumer; Postgres-backed graph + REST API for
descendants(node). - Astronomer & Datakin, "Lineage-aware alerting" engineering blog series — the operational pattern that inspired this chapter.
- Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022) — chapters on incident response and signal correlation.
- Andrew Jones, Driving Data Quality with Data Contracts (Packt, 2023) — the contract-as-SLO framing this chapter routes alarms through.
- Google SRE Workbook, "Implementing SLOs" — alerting on burn rate — the request-side alarm template these batch alarms replace.
- /wiki/data-quality-metrics-as-slos — internal: the per-node contract this chapter routes through the graph.
- /wiki/alert-fatigue-as-a-production-failure — internal: the broader discipline; lineage-aware alerting is one tool in it.