Wall: you can't debug what happened last Tuesday
Friday, 11:42 IST. Riya is the on-call data engineer at a Bengaluru fintech with about 280 DAGs in Airflow. The CFO has just walked into the data office holding a printout — the GST liability number on Tuesday's filing was ₹2.4 crore lower than Wednesday's recompute, and the GST officer wants an explanation by Monday. Riya opens the Airflow UI. Tuesday's gst_daily_filing DAG is green. Every task succeeded. Retries: zero. SLAs: met. The scheduler did its job. And yet the number is wrong, and Riya has no idea which of the 17 upstream tables, 4 source systems, and 230 transformation steps produced the bad row. The DAG view shows tasks. It does not show data. By 18:00 Riya has a wall of git blame, SELECT COUNT(*) queries, and a chat thread with three product managers, and is no closer to an answer than she was at noon.
A scheduler tells you that a task ran. It does not tell you what that task read, what it produced, who consumed the output, or which upstream change broke today's number. Past about 50 DAGs and 100 tables, "a task succeeded" stops being a useful signal — the question becomes "which row, in which column, in which table, was wrong, and why?", and no scheduler in 2026 answers that question. This is the wall. The next build (lineage, observability, data contracts) is the response.
The wall in one screenshot
The Airflow DAG view is a scheduler's view of the world. Every box is a task; every arrow is a task-to-task dependency. Green means the task's exit code was 0. Red means it wasn't. The view answers one question well — did the scheduler do its job? — and is silent on every question the on-call engineer actually has at 3 a.m.
The two graphs are related but not identical. A single task in the DAG can write to three tables. Two tasks in different DAGs can write to the same column. A task that succeeded can produce wrong output (a SELECT COUNT(*) > 0 test passed, but the values were stale). A task that failed and retried can leave behind a half-written table that the next task happily reads. The DAG is a graph of executions; the data is a graph of artefacts. Riya needs the second one and has the first one.
There is a deeper version of this gap. The DAG is a graph of what should happen (the code defines it) intersected with a graph of what did happen (the run history records it). Both graphs are tasks. Neither is data. Even a perfectly faithful run-history view answers "which tasks ran, in what order, with what duration". It does not answer "which row in which table was produced by which run of which task" — and that mapping is the only thing that lets you walk backwards from a wrong number to its cause.
When Airflow shows the run-history view in 2026, it shows a Gantt chart of task durations. The Gantt chart is genuinely useful for performance debugging — "task X is taking 4× longer than usual" — but it is silent on data correctness. A 60-second task that wrote the wrong rows looks identical to a 60-second task that wrote the right rows. Until the run-history view includes pointers to the data artefacts each task produced, plus a way to compare those artefacts across runs, the scheduler cannot answer the question Riya is asking.
What "debug last Tuesday" actually requires
Take the GST incident apart. To answer "why was the number wrong on Tuesday?", Riya has to answer five sub-questions, in order — and the scheduler answers exactly zero of them.
Sub-question 1: which table holds the bad number? Riya knows the final filing was wrong. She does not know which of the 17 candidate tables the wrong row lives in — gst_filing_daily, tax_calc_intermediate, merchant_tax_rates_snapshot, the joined orders_with_tax, or one of 13 others. Why this is hard without lineage: the GST filing is the output of a pipeline, but each row in the output is a sum across millions of input rows that came from a tree of joins. To narrow down "which table" the bad number originated in, you need to trace each output column back to the columns it was computed from — column-level lineage. Without it, you have to manually SELECT * from every candidate and compare across days, which is the next 4 hours of Riya's afternoon.
The candidate count is not bounded by what is in the DAG. The GST filing reads tax_rates_dim directly, but tax_rates_dim is itself written by a separate Lambda triggered by a webhook from the merchant-admin service — a write path the data team did not build and does not own. When Riya enumerates "the 17 candidate tables", she is enumerating only the ones she knows about. The actual count is whatever was reachable from the filing's SQL — including tables that were touched by triggers, materialised view refreshes, or replication from OLTP. Without lineage, Riya does not know what she does not know.
Sub-question 2: when did that table change? Once Riya guesses that merchant_tax_rates_snapshot is the suspect, she needs to know when its values last changed. Was the bad value written on Monday? On Tuesday morning before the GST job ran? Three weeks ago, and only now showing up because of an upstream recompute? The scheduler knows when tasks ran. It does not know when data was last modified. The warehouse may know — Iceberg tracks snapshots, Delta tracks versions, Snowflake has Time Travel — but only if someone explicitly queries the metadata. The scheduler does not surface "this row's value last changed on Mon at 19:43 IST".
Sub-question 3: which task produced that change? Even with a "last modified" timestamp, Riya needs to know which task wrote the row. The same table can be written by multiple tasks across multiple DAGs (in this fintech, tax rates can be updated by merchant_admin_dag, gst_rate_refresh_dag, or a manual SQL run from the data-platform team — who don't tag their writes). The scheduler logs INSERT INTO merchant_tax_rates_snapshot ... somewhere, but those logs are in three different places (Airflow task logs, the warehouse query history, and the merchant_admin service's stdout) and they don't agree on timestamps.
Sub-question 4: what did that task read? Once Riya identifies the writer task, she needs to know its inputs. If the task is "read from merchant_admin.tax_rate_changes and append to merchant_tax_rates_snapshot", she needs to know what was in merchant_admin.tax_rate_changes on Monday at 19:43 — which is itself a versioned table whose state she can no longer reconstruct without snapshotting. The scheduler does not record what data the task consumed; it records that the task ran. The chain of "input → transformation → output" — the basic unit of debugging — has to be reconstructed from query logs.
Sub-question 5: what other things downstream consumed the bad value? Even if Riya finds the single bad row, she needs to know what else in the system was contaminated by it. Did the marketing team's GST-aware reports use the same table? Did the merchant-facing dashboard pull from a derived view? Did a Spark job that ran at 04:00 cache the bad value into a feature store? Without downstream lineage, every fix is incomplete — the row is corrected in merchant_tax_rates_snapshot but the marketing dashboard is still showing the wrong number because nobody told it to refresh. Why "fix the source" is not enough: data flows in trees, not points. A bad row at the root contaminates every leaf that read it, and warehouses do not auto-invalidate downstream caches the way a CPU cache does. Each leaf has its own refresh schedule, its own cache, its own consumer; a fix at the source has to be paired with explicit invalidation messages to every downstream — which requires knowing who they are, which is the lineage problem all over again.
The scheduler answers none of these. The Airflow UI shows green boxes; the metadata DB shows task durations; the logs show stdout. The information Riya needs — the lineage of the bad row, the version history of the table, the consumers of the column — lives in a graph the scheduler does not maintain.
A 70-line debug session that takes 4 hours by hand
What Riya actually does to debug Tuesday's incident is the same script every data engineer writes from memory at 3 a.m., and it is uniformly painful.
# debug_last_tuesday.py — Riya's afternoon, transcribed.
# 70 lines. Took 4 hours. Found the bug at line 58.
import duckdb, datetime as dt, json
from pathlib import Path
WAREHOUSE = "warehouse.duckdb" # connection to the BI warehouse
INCIDENT_DATE = dt.date(2026, 4, 21) # Tuesday
RECOMPUTE_DATE = dt.date(2026, 4, 22) # Wednesday's correct value
con = duckdb.connect(WAREHOUSE, read_only=True)
# Step 1: confirm the discrepancy. Diff the two days at the leaf.
diff = con.execute("""
SELECT tax_slab, sum_18 AS tue_value
FROM gst_filing_daily WHERE filing_date = ?
""", [INCIDENT_DATE]).fetchall()
correct = con.execute("""
SELECT tax_slab, sum_18 AS wed_value
FROM gst_filing_daily_recompute WHERE filing_date = ?
""", [INCIDENT_DATE]).fetchall()
print(f"Tue: {diff}")
print(f"Wed (recompute): {correct}")
# Output: discrepancy is in slab 18%, ₹2.4 cr lower.
# Step 2: walk upstream. gst_filing_daily aggregates from tax_calc.
# What changed in tax_calc between Tue 02:00 (when gst ran) and Wed?
candidates = ["tax_calc", "orders_with_tax", "merchant_tax_rates_snapshot",
"tax_rates_dim", "tax_slab_lookup", "merchant_dim"]
for tbl in candidates:
n_tue = con.execute(f"SELECT count(*) FROM {tbl}").fetchone()[0]
print(f"{tbl}: rows now = {n_tue}")
# Output is a row count NOW. There is no row count "as of Tue 02:00".
# Riya cannot diff because she has no historical snapshots of these tables.
# Step 3: scan Airflow logs for any task that wrote any of these tables.
# This is grep through 230 task logs across 18 DAG runs.
import subprocess
for tbl in candidates:
out = subprocess.run(
["grep", "-r", f"INSERT INTO {tbl}",
f"/var/log/airflow/{INCIDENT_DATE}/"],
capture_output=True, text=True)
print(f"--- {tbl} ---")
print(out.stdout[:500] if out.stdout else "no writes found in airflow logs")
# Output: 4 tables have insert statements logged in airflow; 2 don't.
# The 2 that don't are presumably written by something OTHER than airflow.
# (Spoiler: merchant_tax_rates_snapshot is written by a Lambda that
# Riya doesn't even know exists yet.)
# Step 4: find the column-level diff.
# What columns of tax_calc actually feed into the wrong number?
# Riya re-reads the SQL by hand. The aggregation is:
# SELECT tax_slab, SUM(taxable_amount * rate) FROM orders_with_tax JOIN tax_rates_dim ...
# So she needs to check rate values for slab=18% on Tuesday.
rates = con.execute("""
SELECT slab, rate, valid_from, valid_to FROM tax_rates_dim
WHERE slab = '18%'
""").fetchall()
print(f"current tax_rates_dim 18%: {rates}")
# Output: rate = 0.18, valid_from = 2024-07-01, valid_to = NULL.
# Looks fine. But this is the table NOW. What was it on Tuesday?
# Step 5: there is no time-travel on tax_rates_dim. It's a regular table.
# Riya goes to the merchant_admin service git history.
git_log = subprocess.run(
["git", "log", "--since=2026-04-19", "--until=2026-04-22",
"--oneline", "merchant_admin/migrations/"],
capture_output=True, text=True, cwd="/srv/repos/merchant-admin"
)
print(git_log.stdout)
# Output: "ec1f3a2 Mon 19:43 UPDATE: tax_rates_dim slab=18 rate=0.18 -> 0.16 (revert)"
# Found it. A backout migration ran on Monday at 19:43 that briefly set
# the 18% slab to 16% and was reverted at 20:14. The gst_daily_filing job
# ran at Tue 02:00 — between the bad write and the next snapshot. The
# revert ran at Mon 20:14 but the warehouse copy of tax_rates_dim was
# only refreshed nightly, after 02:00. The four-hour gap was Riya's bug.
# What Riya types into the chat at 16:30:
# "Found it. Mon 19:43 someone pushed a migration that set the 18% GST slab
# to 16% by mistake, reverted at 20:14 — but the warehouse mirror of
# tax_rates_dim only refreshed at 02:30 Tue, AFTER gst_daily_filing read
# the broken value at 02:00. We need to (a) recompute Tuesday's filing,
# (b) add a freshness check on tax_rates_dim before gst job reads it,
# (c) talk to merchant-admin about not running prod migrations at 19:43
# on a Monday."
The four parts of the script that took the longest in real time were these. SELECT count(*) FROM {tbl} told Riya nothing — without historical snapshots of each candidate table, "row count now" is irrelevant. grep -r "INSERT INTO ..." scanned 230 task logs across 18 DAG runs because there is no index from "table I care about" to "tasks that wrote it". The git log on merchant-admin/migrations/ was the third place Riya looked, after she had already given up on the warehouse and the scheduler — and the actual bug was there. tax_rates_dim was a slowly-changing dimension with no time-travel; Riya had to reconstruct its Tuesday-02:00 state by reading git history of the migration that wrote it. Why this kind of detective work is unavoidable today: the bad row was produced by a system (merchant_admin) that the data team does not own and does not even monitor; the scheduler (Airflow) does not record cross-system writes; the warehouse (DuckDB / Snowflake / BigQuery in production) does not keep historical snapshots of dimension tables unless explicitly configured. Each of these is a separate fix, and together they constitute the next build — lineage, observability, and contracts.
What the next build adds — and what it does not
The next build (Build 5: lineage, observability, data contracts) attacks each of Riya's five sub-questions with a different mechanism. It is worth being precise about what each mechanism does and does not buy.
Column-level lineage turns "which of 17 tables holds the bad value?" into a graph query: "for the column gst_filing_daily.sum_18, which input columns feed it?". Tools like OpenLineage, Marquez, and dbt's model.depends_on build this graph by parsing SQL and recording the dependency edges. The graph answers Riya's first sub-question in seconds — sum_18 depends on tax_calc.taxable_amount and tax_rates_dim.rate; from there it depends on orders_with_tax.taxable_amount and tax_rates_dim again; and so on. The hard limit is cross-system writes: if tax_rates_dim is written by a Lambda that doesn't emit OpenLineage events, the graph stops at "tax_rates_dim" with no incoming edge. Riya is back to grep.
Table-format time travel (Iceberg's time_travel_query, Delta's VERSION AS OF, Snowflake's AT(TIMESTAMP => ...)) turns "when did the value change?" into a SQL query: SELECT rate FROM tax_rates_dim FOR TIMESTAMP AS OF '2026-04-21 02:00 IST'. The lakehouse builds (Build 12) ship this primitive at the storage layer; the warehouse vendors ship it at the engine layer. The hard limit is retention — most lakehouse deployments keep 30 days of snapshots; older incidents are gone forever. Tax filings have a 7-year retention requirement in India; the snapshot retention does not match.
Data contracts turn "did the producer break the consumer?" into a build-time error. A contract on tax_rates_dim would specify that rate must be in [0, 1], that valid_from <= valid_to, that no row is updated more than once per day, that the table is refreshed by merchant_admin.tax_rate_publisher and nobody else. A migration that violated the "rate in [0,1]" constraint would fail at write time, not at the next morning's GST filing. The hard limit is adoption: contracts cost effort to write and maintain, and most teams don't write them until after their first three GST-style incidents.
Freshness SLOs turn "did the upstream finish before the downstream read?" into an alert. The gst_daily_filing job's contract should be: tax_rates_dim must be fresh as of the previous business day, with a freshness SLO of 4 hours from the last write. The Monday-19:43 broken write would have triggered a fresh-write event; the 20:14 revert would have triggered another; the 02:00 GST job would have read a tax_rates_dim that had been written 6 hours ago and reverted 5h46m ago, and would have failed its freshness check. Riya's job would have failed at 02:00 with a meaningful error, not Friday at 11:42 with a CFO incident.
The composite of all four is what Build 5 ships — and it is not free. Lineage requires every system that touches data to emit lineage events; that is an integration project. Time travel requires storing extra snapshots; that is a cost. Contracts require human effort; teams pay it after the first incident, not before. Freshness SLOs require defining "how fresh is fresh enough"; that is a conversation between data and finance that nobody has had until Friday at 11:42.
The honest framing: Build 5 does not eliminate Friday-11:42 incidents. It changes what they cost. With lineage, the GST officer's question is answered Saturday morning instead of Monday afternoon, by a junior engineer with a graph query, instead of a senior engineer with grep. The wrong number still happened; what changed is the recovery curve — minutes-to-explanation, not days. Why "recovery time" is the right metric, not "incident count": the incident rate in data engineering is bounded below by the rate at which producers ship changes, and that rate is set by the business, not the data team. You cannot drive incidents to zero. You can drive recovery time to minutes — and that is the metric your CFO actually cares about, because the question they will ask is not "how often does this happen?" but "how fast can you tell me why it happened?".
A taxonomy of "green DAG, wrong data" failure modes
The Friday-11:42 incident is one of a small family of failure modes that schedulers cannot see. The family is worth listing explicitly, because each member has a different signature and a different remedy.
The silent partial-write. A task INSERT-UPSERTs 1.4M rows on Tuesday. On Wednesday it INSERT-UPSERTs 1.42M. The dashboard goes up. Nobody looks. In fact the Tuesday task crashed mid-write at row 200k, retried, completed only the second half — the first 200k rows are stale from yesterday. The DAG is green; the data is wrong; nobody knows because the row counts look fine. The remedy is data tests on per-partition row counts and value distributions, not just task exit codes.
The schema-change blast. An upstream service adds a currency column to its payments event. The downstream extract still works (it ignores unknown columns) but the new column is missing from the warehouse — and a dashboard filter on currency = 'INR' silently filters to zero rows because the column is now NULL for every row. The DAG is green; the chart is empty; the product manager only notices three days later. Remedy: schema contracts that fail loudly when an unknown column appears or a known column disappears.
The cross-system-write race. Two writers — the data pipeline and the merchant-admin Lambda — both write to tax_rates_dim. Their writes interleave. Both writes succeed; the order is determined by network jitter. The DAG sees its own write as green; the Lambda sees its own write as green; the table is in a state neither writer expects. Remedy: contracts that designate a single writer per column or per row range, enforced by the storage layer.
The freshness gap. A Spark job reads merchant_dim at 02:00 expecting a daily refresh, but the refresh job ran at 01:58 with a stale source and committed an empty table. The Spark job ran successfully against an empty table — exit code 0, output rows fewer than expected but non-zero. The dashboard shows last week's number because the join filtered most rows out. Remedy: freshness SLOs at the table level, with a fail-the-job semantics if the input table is older than the SLO.
The common shape: every member of this family is invisible to the scheduler because the scheduler's only signal is the task's exit code, and every member has at-least-once successful exit codes on the path to the wrong number. The scheduler is doing exactly its job — and that is the problem.
How big this wall really is
The wall has a measurable shape. At three Indian fintechs Riya's team has worked with, the metric "time from CFO question to data-team answer" was tracked for two quarters. Pre-Build-5 (DAGs and SLAs only): mean 3.8 hours, p95 11 hours. Post-Build-5 (lineage + time travel + contracts on critical tables): mean 22 minutes, p95 2.4 hours. The 10× speed-up is not in the easy questions (those were already 5 minutes); it is in the hard questions — the ones that involve cross-team writes, slowly-changing dimensions, and the gap between "task succeeded" and "data is correct".
A second metric is trust per incident. Before lineage, every CFO incident produces a request to "audit all the things" — typically 6–8 weeks of one engineer's time, ending in a slide deck nobody reads. After lineage, the same incident produces a 1-week scoped fix on the lineage edge that broke. The cost is paid up-front (building the lineage graph) and amortised across every future incident. Why this matters at the org level: data engineering's value to the business is downstream of trust in the numbers. Every wrong-on-Tuesday incident erodes that trust by an amount the data team cannot recover with engineering alone — they have to recover it with explanations, audits, and rebuilt processes. Build 5 is the technical investment that protects the political capital. A team that ships Airflow without lineage spends 30% of its calendar quarter on incident archaeology by year three; the same team with lineage spends 5%.
Why the scheduler can't help — even in principle
It is tempting to read the GST incident as "Airflow is missing a feature" and assume the next version will ship it. It will not, and the reason is structural. The scheduler is a control plane: it executes a finite graph of tasks defined in code. The data plane — what the tasks actually read and write — is everything that lives outside that graph. Lambdas, ad-hoc SQL, Spark notebooks, BI dashboards, manual psql sessions, the merchant-admin service writing to tax_rates_dim. None of these are tasks. None of them appear in the DAG. None of them participate in the scheduler's metadata model.
Even if the scheduler could see them, recording lineage at the task level is the wrong abstraction. A single task that runs spark.sql("INSERT INTO ... SELECT * FROM a JOIN b JOIN c") reads three tables and writes one — the task is one node, but the lineage edges are four. Push hard on this and you discover that lineage's natural unit is the query (or the transformation step within a query), not the task. Schedulers don't see queries; they see tasks.
The correct architectural separation is the one Build 5 introduces: schedulers (Airflow, Dagster, Prefect) decide what runs and when; catalogs and lineage systems (DataHub, OpenLineage, Atlan) record what data exists, who wrote it, and what reads it; query engines (Trino, Snowflake, BigQuery) emit lineage events for each query. The three layers federate at the edges. The scheduler emits run IDs; the query engines emit query IDs tagged with run IDs; the catalog stitches them together. Each layer stays small. The wall in this chapter is the moment a team realises they have only built the first of those three layers.
Three patterns of how teams hit this wall
The wall is not one cliff that all teams fall off at the same time. Three distinct patterns trigger it, and which pattern your team is in determines which Build-5 mechanism you should reach for first.
Scale-driven. This is Riya's team. The DAG count is high, tables are numerous, and the on-call engineer cannot navigate the data graph by hand. The first reaction is to add more SLAs — but SLAs only catch late tasks, not wrong tasks. The actual fix is column-level lineage so the on-call has a graph to walk instead of a wiki to skim. Teams in this pattern typically have an Airflow deployment that has aged 18–36 months and a data catalog project that someone proposed last quarter and nobody staffed.
Org-driven. A second team starts writing to your tables. At Razorpay, this is the moment when the merchant-onboarding team writes to merchants directly because their Lambda needs to update KYC status, and the payments team that thought they owned merchants discovers the cross-team write at 3 a.m. when a merchant's KYC status flips mid-batch and breaks settlements. The first reaction — "forbid cross-team writes" — fails politically; the second team has business reasons. The actual fix is data contracts that make the cross-team write legible (with an owner, a schema, a freshness SLO, an alert channel). The contract turns "the merchants table is mine" into "the merchants table is jointly owned, here is who writes which columns when, and here is what breaks if any of that changes".
Regulator-driven. RBI, SEBI, GST, or DPDP-Act compliance demands "for any row in your filing, show the source records and the transformations that produced it". This is the hardest of the three — it requires not just lineage and time travel but retention (the regulator's window is 7 years; lakehouse default is 30 days) and replayability (the regulator wants to see the SQL that produced the row, not just a list of inputs). The first reaction — snapshot every table nightly to S3 — is correct in spirit but expensive in storage and useless for queries. The actual fix is time travel at the table-format layer (Iceberg / Delta) plus run-level lineage that ties each row to a specific DAG run, plus a long retention policy on both. The fintechs that did this right in 2024–25 have audit response times measured in hours; the ones that didn't have audit response times measured in months.
The same team can hit two patterns at once. A growing fintech often hits scale and org simultaneously — the DAG count crosses 100 in the same quarter that a second team starts contributing data — and the wall feels twice as steep. The order to fix is: lineage first (so you can see the graph), contracts second (so you can constrain it), time travel third (so you can replay it). Reverse that order and you build a system you can replay but cannot navigate.
There is a fourth, subtler pattern: the acquisition-driven wall. A company acquires another company's data assets — a Bengaluru fintech buys a smaller payments competitor — and inherits 60 DAGs they did not write, 200 tables they do not understand, and a data team they cannot fully retain. The wall here is not a count of DAGs but a gap of context: the surviving engineers know how the system works in their head but cannot transfer the knowledge in the time the integration plan allows. Lineage and catalogs help, but only if they were already in place at the acquired company; building them retroactively against a system you didn't write is the hardest variant of this work.
A subtler reason all four pattern triggers feel similar from inside the data team: each one is a moment when the number of independent writers to a single table exceeds the team's working memory. With one writer per table, the whole graph fits in a head. With three writers per table across four teams, no single human knows the state of the system, and the only way to navigate is by querying a graph someone has already built. Lineage is the externalisation of that working memory; data contracts are the externalisation of the social rules around who can write what. Both are necessary because the team has outgrown the size at which informal coordination scales.
Common confusions
- "Airflow logs are observability." They are task logs — they tell you that a task ran and what it printed to stdout. They do not tell you what data the task read or wrote, what the values were, or who downstream consumed the output. Mistaking task logs for data observability is the most common reason teams think they have observability when they don't.
- "Lineage is the same as a DAG." A DAG is a graph of task executions in a single scheduler. Lineage is a graph of data artefacts across all systems — including the Lambda that nobody told the data team about, the manual SQL that the analyst ran in BigQuery, the BI tool's caching layer. Lineage extends beyond the scheduler's blast radius; the DAG does not.
- "Time travel solves debugging." It solves "what was the value at time T?" — which is one of the five sub-questions, not all five. Without lineage, you don't know which table to time-travel into. Without contracts, you don't know what a "valid" past value would have looked like. Time travel is necessary, not sufficient.
- "Data contracts are just schemas." A schema constrains the shape of the data (columns, types, nullability). A contract constrains the meaning — which producer owns the table, what the freshness SLO is, what business invariants hold (e.g.
rate IN [0, 1]), what the breakage policy is. A schema validates thatrateis a float; a contract validates that it is between 0 and 1 and was written bymerchant_admin.tax_rate_publisher. - "You only need lineage for the critical tables." True at first; false at scale. The CFO incident from the opening lived in
tax_rates_dim, which the data team did not consider critical until Friday at 11:42. Identifying critical tables in advance requires the same lineage graph you would build for full coverage. Most teams settle on "lineage everywhere, alerting only on critical paths". - "Build 5 replaces Build 4." It does not. Build 4 (the scheduler) is the control plane — what runs and when. Build 5 (lineage / observability / contracts) is the data plane observability — what the data is and who touches it. A production system needs both, and the two abstractions are orthogonal: a scheduler with no lineage is blind, a lineage graph with no scheduler has no events to consume.
Going deeper
Why "exit code 0" is a poor proxy for "data is correct"
The scheduler's only signal is the task's exit code. A Python task that returns 0 is "successful" by the scheduler's definition, regardless of whether the rows it wrote are correct, whether it wrote the right number of rows, or whether the columns have the expected distribution. A wrong-but-exit-0 task is the dominant failure mode at scale — exit-non-zero failures are caught by retries; exit-0-but-wrong failures land in production. The next build introduces data tests (Great Expectations, Soda, dbt tests) that run after the task and assert properties of the output, turning data-correctness into a second exit code that the scheduler can react to. The pattern is: every task ships a paired test, and the test's exit code is the one that decides whether downstream tasks see the output.
OpenLineage and the "events on every system" problem
OpenLineage is the emerging standard (2026) for lineage events: every system that reads or writes data emits a structured event with inputs, outputs, runId, parentRun, and facets. Airflow ships an OpenLineage provider; dbt ships one; Spark ships one; Trino ships one. The hard part is the long tail — the bash script that runs aws s3 cp, the Lambda that mutates a config table, the analyst with a psql shell. Each of these is a lineage gap, and each lineage gap is a future Friday-11:42 incident. Teams that take lineage seriously install OpenLineage agents at the infrastructure layer (a network proxy that records every SQL query, an S3 access log shipper) so that even the long tail leaves traces.
What this wall looks like in 2026 — and what changed in the last five years
In 2021 the wall was solved badly: each company built a homegrown lineage tool, none of them talked to each other, and the on-call experience was identical to Riya's afternoon. In 2026 there are three credible answers — OpenLineage (the open standard), DataHub (LinkedIn's open-source catalog), and the vendor catalogs (Atlan, Collibra, Alation) — and the integration cost has dropped from "build it yourself" to "install three providers and configure their endpoints". The change is real but uneven. Teams that started in 2024 with OpenLineage as a default have <30-minute mean-time-to-debug. Teams that bolted catalogs on after a Riya-style incident have a partial graph that lies on the cross-system writes — the worst of both worlds, because the team trusts the graph and the graph silently lies.
The cultural shift is more pronounced than the technical one. At the larger Indian fintechs (Razorpay, PhonePe, Cred, Zerodha) in 2026, candidates ask "do you have OpenLineage in production?" the way they used to ask "do you use Airflow or Dagster?". Lineage has become a hiring conversation — engineers who have worked with column-level lineage are unwilling to go back to grep, and an on-call rotation at a team without lineage is a recruiting liability. The team that ships lineage first inside a company tends to be the data-platform team, not analytics or ML; a platform team's lineage investment unlocks every downstream team's debugging, while an analytics team's lineage only helps one team. Companies that started with analytics-owned lineage usually rebuild it 18 months later as a platform service, paying twice for the same capability.
The "data observability" market and what each tool actually buys you
By 2026 there is a crowded market of tools claiming to solve this wall — Monte Carlo, Bigeye, Acceldata, Lightup, Soda, Datafold, plus the catalog vendors (DataHub, Atlan, Collibra) that ship overlapping features. Reading their landing pages you would think they all do the same thing. They do not. The market has split along three axes. Anomaly detection (Monte Carlo, Bigeye) flags when a metric drifts from its historical distribution — useful for the "the count of orders dropped 40% overnight" class of incident, useless for the "tax rate was 0.16 instead of 0.18 for 4 hours" class. Data testing (dbt tests, Great Expectations, Soda) lets you declare invariants per table — useful when you know what to test for, useless for novel breaks. Lineage + catalog (DataHub, OpenLineage, Atlan) builds the navigation graph — useful for every incident after you have already noticed something is wrong, but does not detect the break. Most teams need all three; what differs is the order. Riya's team would benefit most from lineage first (so the next CFO question is answerable in 20 minutes) and contracts second (so the bad-write at Mon 19:43 fails at write time, not at the GST filing 6 hours later).
Why streaming hits a steeper version of the same wall
Everything in this chapter has been framed in terms of batch DAGs — daily orders extract, GST filing, merchant snapshot — because that is where most teams hit the wall first. Streaming pipelines hit a steeper version of the same wall, faster. A Kafka topic with 4-day retention does not let you "look back at last Tuesday"; the messages are gone. A Flink job that processed a bad event has already committed the side-effects to a Postgres sink and a feature store; rolling back means recomputing 4 days of state from a checkpoint. Snowflake's Object Tagging and BigQuery's Policy Tags partially help in batch by attaching owner / PII / SLO metadata to columns and letting it inherit through query lineage — but for streaming there is no equivalent because the rows have aged out before the tags would inherit. The mechanisms in Build 5 — lineage, time travel, contracts — apply to streaming, but the retention dimension dominates: the lineage events for a streaming job have to be persistent (in Iceberg, in OpenSearch, anywhere) because the messages they describe are not. Why this is harder than batch: a batch job's inputs and outputs are tables that you can re-query; a streaming job's inputs are events that have aged out of the broker, and its outputs are derived state that has been mutated again. Reconstructing "what was true at 02:14 on Tuesday" requires both a lineage event for that processing window and a snapshot of the derived state at that point — neither of which the streaming framework gives you for free. Build 8 (stateful stream processing) returns to this when it introduces checkpoints and exactly-once semantics; the lineage layer on top of it is one of the active research areas in 2026.
Where this leads next
- Column-level lineage: why it's hard and why it matters — chapter 29, the first response to the wall: tracing every output column back to its inputs.
- Data catalogs and the "what does this column mean" problem — chapter 30, the catalog as the place where lineage, ownership, and meaning live together.
- Data contracts: the producer/consumer boundary — chapter 31, the social contract that prevents Monday-19:43 migrations from breaking Tuesday-02:00 jobs.
- Freshness SLOs: the data-eng analog of uptime — chapter 33, the operational metric that surfaces stale-but-green-DAG incidents at 02:01, not Friday-11:42.
The wall in this chapter is not "Airflow is bad" — Airflow is doing exactly its job, which is to schedule tasks. The wall is that scheduling is half the job, and the other half is data observability, which lives in a different layer. Build 5 is the construction of that layer. The next chapter starts with the simplest piece: the lineage graph itself, and why building one for SQL is harder than it sounds.
Riya's Friday-11:42 incident closed at 18:14 with a one-paragraph email to the CFO and a Confluence page titled "GST tax-rate revert: post-mortem". The post-mortem listed 11 action items. Four of them — column-level lineage on gst_filing_daily, a freshness SLO on tax_rates_dim, a data contract on the merchant-admin write path, and time-travel snapshots on the four most critical dimension tables — are the literal table of contents of Build 5. The wall is the moment a team writes that post-mortem, and Build 5 is the engineering response. Every chapter that follows is one of those four action items, taken seriously.
You will write that post-mortem too. Most data engineers do, eventually. The question is whether you write it before or after the CFO walks in.
The cheapest way to answer that question correctly is to read the next four chapters in order, and to apply each one to whichever of your tables is the next plausible candidate to surface a Friday-11:42 incident.
References
- OpenLineage specification — the 2026 standard for lineage events; the data model, facets, and event types every integration emits.
- DataHub (LinkedIn) — open-source data catalog with built-in lineage and ingestion from 50+ sources; the reference implementation for "catalog + lineage" as one product.
- Marquez — the OpenLineage reference backend; useful to read the source to understand how lineage events become a graph.
- Apache Iceberg time travel docs — the SQL syntax for "as of timestamp" queries that closes Riya's "when did the value change?" gap.
- Data Contracts: The Missing Foundation, by Andrew Jones (2023) — the canonical book on the producer/consumer boundary; especially chapter 4 on the "shift-left" of data quality.
- "Towards Observable Data Pipelines" — Petrella et al., VLDB 2024 — the academic framing of why scheduler-only observability is insufficient; useful for the precise definitions of data-quality vs task-quality signals.
- Airflow vs Dagster vs Prefect — chapter 26, the framework comparison that already hints at how Dagster's asset model partially addresses this wall.
- The three walls: no idempotency, no observability, no schedule — chapter 4, the chapter that named "no observability" as a wall in the abstract; this one names what it costs in concrete debugging time.