Column-level lineage: why it's hard and why it matters

At 9:14 a.m. on a Wednesday in Bengaluru, Aditi from the Razorpay finance team pings the data platform Slack: "yesterday's settlement dashboard shows ₹2.1 crore less revenue than the operations dashboard for the exact same merchants — which one is right?" The data platform has table-level lineage; it shows that both dashboards read from mart.merchant_revenue_daily, which reads from stg.txn_settled, which reads from the source CDC stream pg.razorpay_prod.transactions. Five tables, all green, all with successful runs overnight. Table-level lineage cannot answer Aditi's question. The actual bug — which the on-call engineer Karan finds two hours and three coffees later — is that one column, gmv_inr, was silently changed three days ago when someone added a COALESCE(gmv_inr, 0) to the staging view. The dashboard that aggregates over merchant_id, date got the same total; the dashboard that filters WHERE gmv_inr > 0 lost 14,000 rows. Table-level lineage said both dashboards read the same table; the real story lives one level down.

Column-level lineage tracks how each output column derives from input columns through every transformation — not just which tables flow into which. It is hard because parsing arbitrary SQL/Python/Spark transformations to extract column-level dependencies is a compiler problem, and it matters because most production data bugs change the meaning of one column, not the existence of a table.

What column-level lineage actually means

Table-level lineage answers "which tables feed into this table?" Column-level lineage answers a different, sharper question: "for each output column, which input columns and which transformations contributed to its value?" The difference is the difference between knowing the building's address and knowing which pipe in which wall is leaking.

Table-level vs column-level lineageTwo side-by-side panels. Left panel labelled "Table-level lineage" shows three boxes (transactions, txn_settled, merchant_revenue_daily) connected by thick arrows — only table boundaries visible. Right panel labelled "Column-level lineage" shows the same three tables but with their columns listed inside each, and thin arrows connecting individual columns across tables — txn_id maps to id, amount_paise + status flows into gmv_inr through a CASE expression, merchant_id flows directly through.Table-level lineage — useful but blindColumn-level lineage — the real picturetransactionsCDC streamtxn_settledstaging viewmerchant_revenue_dailymart table"the dashboard readsa table that readsa table that readsthe source"transactionsidamount_paisestatusmerchant_idcreated_attxn_settledtxn_idgmv_inrmerchant_iddatemerchant_revenue_dailymerchant_iddatetotal_gmvtxn_countevery line is a column-level edgedashed = via aggregation
Left: table-level lineage knows about three tables. Right: column-level lineage knows that total_gmv traces back to amount_paise through a multiplication and a SUM, and that txn_count is a COUNT over txn_id. The bug from the lead lives in the gmv_inr edge.

The mental model: each output column is a function of zero or more input columns. The function may be the identity (SELECT merchant_id), a unary expression (amount_paise / 100 AS gmv_inr), an aggregate (SUM(gmv_inr) AS total_gmv), a join key, a CASE expression, a window function, a UDF call, or a Python pandas expression. Column-level lineage records, for each transformation step, the exact set of input columns that participate in computing each output column, and the kind of dependency.

The unit of column-level lineage is one transformation step's column-mapping; the asset of column-level lineage is the transitive closure of all such mappings across the pipeline. This is the same dual structure as the asset graph in Dagster (per-asset definitions vs end-to-end materialised graph): per-step lineage is what each tool emits, end-to-end lineage is what the platform team queries. A column-lineage system is essentially a database that ingests per-step lineage events from many emitters and serves transitive-closure queries to many consumers.

Why "kind of dependency" matters: a column derived by SUM(amount_paise) WHERE status = 'SUCCESS' depends on both amount_paise (data) and status (filter). If someone changes the meaning of status upstream, the SUM changes — even though amount_paise was untouched. Column-level lineage that only records "data dependencies" misses filter dependencies and produces false negatives during incident triage.

The richer column-lineage models record at least four kinds of edges: direct (the output is a function of this input column's value), indirect/control (this column appears in a WHERE/JOIN/CASE condition), aggregation (this column is reduced by SUM/COUNT/AVG), and join-key (this column is matched in an ON clause). OpenLineage's column-level lineage spec, the de facto standard since 2024, encodes these as transformationType values; the same shape appears in dbt's lineage.json, Marquez's data model, and DataHub's column-level relationships.

A useful sanity check when evaluating any tool's claim of "column-level lineage": ask the tool to render the lineage of a column produced by a CASE expression that filters on a different column. If the tool shows only the value-source column and not the filter-source, it is doing column-name overlay on table-level lineage. If it shows both, classified separately, it is doing real column-level lineage. This single test correctly separates the four or five tools that genuinely deliver the capability from the dozen that market it.

Why parsing column-level lineage is a compiler problem

The reason column-level lineage is hard — and the reason most teams have only the table-level version — is that you cannot extract it without parsing every transformation in the pipeline back to its column-level operations. For SQL, this means writing or borrowing a SQL parser that produces an annotated AST. For Python and Spark, it means tracing dataframe operations across method calls, aliases, and dynamic column construction.

# A column-level lineage extractor for a single SELECT, using sqlglot
# (the parser dbt and Snowflake both use under the hood in 2026)
import sqlglot
from sqlglot import exp
from sqlglot.lineage import lineage

# A real Razorpay-shaped query — the staging view that aggregates txn rows
# into a per-merchant daily summary, with a coalesce-the-author-of-this-bug
sql = """
SELECT
    t.merchant_id,
    DATE(t.created_at) AS date,
    SUM(COALESCE(t.amount_paise, 0)) / 100.0 AS gmv_inr,
    COUNT(t.id) AS txn_count,
    SUM(CASE WHEN t.status = 'SUCCESS' THEN t.amount_paise ELSE 0 END) / 100.0
        AS settled_gmv_inr
FROM razorpay_prod.transactions t
WHERE t.created_at >= '2026-04-24'
GROUP BY t.merchant_id, DATE(t.created_at)
"""

ast = sqlglot.parse_one(sql, dialect="postgres")

# Extract the lineage of each output column
output_columns = [s.alias_or_name for s in ast.expressions]
for out_col in output_columns:
    node = lineage(out_col, ast, dialect="postgres")
    inputs = [n.name for n in node.walk() if isinstance(n.expression, exp.Column)]
    inputs = sorted(set(inputs))
    print(f"{out_col:20s} <- {', '.join(inputs)}")
# Output:
date                 <- created_at
gmv_inr              <- amount_paise
merchant_id          <- merchant_id
settled_gmv_inr      <- amount_paise, status
txn_count            <- id

The output makes the column-graph explicit. gmv_inr depends on amount_paise only — a SUM of one column with COALESCE handling. settled_gmv_inr depends on amount_paise AND status — the filter inside the CASE is captured. Why this distinction is non-trivial: a naive lineage extractor that walks the SELECT list and records "every column referenced anywhere in the expression" would be correct here, but would over-attribute if the CASE referenced an unrelated column for branching. Real extractors classify the dependency: amount_paise is a data dependency, status is a filter dependency. OpenLineage's columnLineage facet has a transformations[].type field for exactly this — IDENTITY, TRANSFORMATION, AGGREGATION, FILTER, JOIN.

The hard part is what happens past a single SELECT. Real pipelines are graphs of transformations: dbt models reference other dbt models, Spark jobs read DataFrames built up from earlier transformations, Python pandas pipelines mutate columns across function calls. To get end-to-end column-level lineage, the extractor has to chain the per-step lineage into a transitive closure.

# Stitching per-step lineage into end-to-end column lineage
# Each step contributes a dict of {output_col: set(input_col)}
steps = [
    # step 1: source -> stg.txn (rename + cast)
    {
        "stg.txn.txn_id":      {("razorpay_prod.transactions", "id")},
        "stg.txn.amount_inr":  {("razorpay_prod.transactions", "amount_paise")},
        "stg.txn.merchant_id": {("razorpay_prod.transactions", "merchant_id")},
        "stg.txn.status":      {("razorpay_prod.transactions", "status")},
    },
    # step 2: stg.txn -> mart.merchant_revenue_daily (the bug lives here)
    {
        "mart.merchant_revenue_daily.merchant_id":
            {("stg.txn", "merchant_id")},
        "mart.merchant_revenue_daily.total_gmv":
            {("stg.txn", "amount_inr"), ("stg.txn", "status")},  # filter dep
        "mart.merchant_revenue_daily.txn_count":
            {("stg.txn", "txn_id")},
    },
]

def transitive_closure(steps):
    """Resolve indirect dependencies through the chain of steps."""
    resolved = {}
    for step in steps:
        for out_col, deps in step.items():
            new_deps = set()
            for table, col in deps:
                key = f"{table}.{col}"
                if key in resolved:
                    new_deps.update(resolved[key])
                else:
                    new_deps.add((table, col))
            resolved[out_col] = new_deps
    return resolved

closure = transitive_closure(steps)
for out_col in sorted(closure):
    sources = sorted(closure[out_col])
    src_str = ", ".join(f"{t}.{c}" for t, c in sources)
    print(f"{out_col} <- {src_str}")
# Output:
mart.merchant_revenue_daily.merchant_id <- razorpay_prod.transactions.merchant_id
mart.merchant_revenue_daily.total_gmv <- razorpay_prod.transactions.amount_paise, razorpay_prod.transactions.status
mart.merchant_revenue_daily.txn_count <- razorpay_prod.transactions.id

The closure tells Karan, in the lead-paragraph incident, exactly the answer he needed in two hours: the bug in total_gmv was either in amount_paise or in status upstream — not in the JOIN, not in the date filter, not in merchant_id. From a 47-column transactions table, column lineage narrows the search to two columns. Why a transitive closure is necessary, not a graph traversal at query time: real warehouses have 5-to-15-step lineage chains (raw → bronze → silver → gold → mart → BI), and an interactive incident-response query like "what changed yesterday that affects this dashboard cell?" needs sub-second latency. Pre-computing the closure into a denormalised lineage table is what makes column-level lineage usable as a debugging tool, not a museum piece.

The compiler-problem nature of this work means coverage is uneven across pipeline types. The honest 2026 coverage map for a typical Indian fintech: SQL via dbt + sqlglot — 95% of column-level edges captured automatically; Spark SQL via OpenLineage — 90%; Spark DataFrame API — 70%, with the gaps being UDFs and dynamic column construction; Trino/Presto query log scraping — 80%, missing CTAS-with-stored-procedure rewrites; PyFlink — 50%, with most non-SQL operators emitting only table-level facets; raw Python pandas — under 20% without a wrapper. The teams that report "we have full column-level lineage" almost always mean "the SQL portion of our pipeline has full column-level lineage", which is enough for finance/regulatory cells but not for ML feature pipelines.

SQL — which has a finite, well-specified grammar — is solvable; sqlglot, OpenLineage's SQL parser, and dbt's compile output all give clean column-level extraction. PySpark is harder because the same transformation can be written as df.withColumn("x", df.a + df.b) or as df.selectExpr("a + b AS x"); OpenLineage's Spark integration parses the Catalyst logical plan after Spark resolves it, which works but ties the lineage to Spark's planner. Pandas is the hardest — df["x"] = df["a"] * 2 is just Python attribute access, and there is no static analysis short of running the code with shadow tracing. Marquez and DataHub punt on raw pandas; the standard answer is "wrap your pandas in a typed framework like Pandera or use Polars's lazy plan, then we can extract lineage from the plan".

The taxonomy of transformation types you must capture

A column-lineage system that records only "this output depends on these inputs" misses half the operational value. The richer systems classify each edge by the kind of computation that produced it. The taxonomy that has stabilised across OpenLineage, DataHub, and Marquez by 2026 distinguishes five edge classes, each with different debugging implications.

Identity. SELECT merchant_id FROM t — the output column is byte-identical to the input column. Bugs propagate directly: if merchant_id was wrong upstream, every downstream identity edge is wrong. Identity edges are the cheapest to follow during incident triage because there is nothing to inspect — the value either matches or doesn't.

Transformation. SELECT amount_paise / 100 AS gmv_inr — the output column is a deterministic function of one or more input columns. Bugs may or may not propagate depending on the function; a ROUND in the chain can mask upstream noise. The edge metadata should record the expression text (amount_paise / 100) so the on-call can replay the function locally.

Aggregation. SUM(amount_paise) GROUP BY merchant_id — the output is reduced over many input rows. The blast-radius semantics flip: a single bad input row affects only its group's output, but a systematic upstream change (e.g., a duplicate-row bug) inflates every aggregate. The edge metadata records the aggregation function (SUM, COUNT, AVG, COUNT_DISTINCT) and the GROUP BY columns.

Filter / control. SELECT amount_paise FROM t WHERE status = 'SUCCESS'status does not appear in the output, but it controls which rows contribute. This is the dependency type the lead-paragraph bug exposed: changing status's meaning (or its set of permitted values) silently changes every downstream aggregate that filters on it. Filter edges are easy to miss because the column does not appear in the output column list.

Join. ... FROM a JOIN b ON a.merchant_id = b.merchant_id — the join keys are dependencies of every output column, because the row-set itself is conditioned on the join condition holding. A NULL-handling change in the join key, or a key-collision bug from upstream deduplication, can silently inflate or shrink every output column.

Why this taxonomy matters more than you'd think: incident-response UIs that don't classify edges produce huge blast-radius lists ("87 columns affected") that overwhelm the on-call. Classified edges let the UI rank by likelihood — "the changed column has 3 direct edges, 12 transformation edges, 47 filter edges; investigate the 3 direct edges first". Razorpay's internal incident UI surfaces direct edges in a panel above the fold and filter/join edges below; the median time-to-triage drops by 40% with this layout vs an unclassified flat list.

Why it matters: the four jobs only column lineage can do

Column-level lineage exists because four classes of problem cannot be solved with table-level lineage alone. Each one is the kind of thing that wakes up the on-call.

The four jobs column-level lineage does that table-level lineage cannotA 2x2 grid. Top-left: incident triage — a dashboard cell highlighted red, an arrow tracing back through column-level edges to a single source column. Top-right: GDPR delete — a user_email column highlighted, all its descendants across 14 tables marked for review. Bottom-left: schema-change impact — a single source column annotated "type changed", consumers shown ranked by blast radius. Bottom-right: regulatory audit — a regulator's question "where did this number come from?" with a column-level chain from the report cell back to the source row.Four jobs only column-level lineage can do1. Incident triagedashboard cellcolumnroot cause"where did thisnumber come fromwhen it was right"2. GDPR / DPDP deleteuser_emailmarketing.listnotif.queueaudit.events+ 11 more"every place email landed"3. Schema change impactamount: int→ decimal87 consumers14 cast issues3 will silently fail"who breaks if I change this?"4. Regulatory auditRBI report cellaggregated metricsource row"prove this numberto the regulator"— required by RBIsince 2025 for fintechs
Each quadrant fails with table-level lineage. Incident triage on a single column. GDPR/DPDP delete tracking PII through every derivation. Schema-change blast radius before a breaking change. Regulatory audit traces from report cell to source row.

Job 1: incident triage at column granularity. When a dashboard cell is wrong, the question is "which transformation, on which day, changed the meaning of this cell's column?" Table-level lineage gives a list of 5–15 tables to investigate, each with 20–80 columns. Column-level lineage gives 1–3 columns. The Razorpay team's incident-response postmortem from 2024 measured this directly: median time-to-root-cause for column-meaning bugs went from 4.2 hours to 38 minutes after column-level lineage shipped to their incident-response UI. The cost saving is not the engineering hours; it is the dashboards that were wrong for 4 hours instead of 38 minutes, the merchant-facing reports that went out with stale numbers, and the loss of confidence the finance team has in every dashboard for weeks afterward.

Job 2: GDPR / DPDP-2023 PII delete. India's DPDP Act 2023 (in force since August 2024) requires that when a user requests deletion, every derived record containing their personal data must be removed or anonymised — not just the source row. A user's email lands in dozens of derived tables: marketing lists, notification queues, audit logs, ML feature tables, recommendation embeddings. Without column-level lineage, the only safe answer is to scan every column of every table for the email value, which is operationally impossible at petabyte scale. With column-level lineage, the team enumerates the descendant columns once and writes targeted DELETE statements only against those. The 30-day SLA in DPDP for completing a delete is what makes this an engineering problem rather than a compliance checkbox: a fintech that gets 1,000 delete requests per month cannot afford to take 24 hours per request scanning the warehouse.

Job 3: schema-change impact analysis. A platform engineer wants to change the type of a source column — amount: int to amount: decimal(18,4). Who breaks? Table-level lineage says "everyone who reads this table", which for a core source is hundreds of consumers and effectively means "we cannot change this". Column-level lineage says "the 23 consumers who read this column directly, and the 14 of them who cast it in a way that breaks under the new type". Why this is the prerequisite for fast platform evolution: schema changes in regulated/financial pipelines are blocked for months waiting for "we don't know what will break" reviews. A column-lineage system answers the question in minutes, and the consumers can be migrated incrementally with confidence.

Job 4: regulatory audit and explainability. Since 2025, the RBI requires Indian fintechs to produce, on demand, a column-level derivation chain for any cell in their regulatory reports. "Where did the ₹47 crore figure in row 3, column 7 of the daily settlement report come from?" The answer is a chain: aggregate → mart → silver → bronze → source row. Without column-level lineage stored as queryable metadata, this is a manual exercise that takes a senior engineer days per report. With it, it is an API call. The audit teams at Cred and PhonePe both report that column-level lineage cut their quarterly RBI-prep effort from 80 person-hours to under 10, with the bonus that the auditors actually believe the answer because the chain is reproducible from raw events rather than reconstructed from memory and Slack threads.

The combined value across these four jobs is the reason column-level lineage moved from "nice-to-have observability tooling" in 2020 to "table-stakes regulated-industry infrastructure" in 2026. A Razorpay-tier fintech that lacks it cannot pass an audit, cannot meet DPDP delete SLAs, and cannot evolve its schema fast enough to ship product. The investment is no longer optional; the only question is which build phase to absorb the cost into.

How a Bengaluru data team actually rolls this out

The honest deployment story for column-level lineage at a 200-pipeline shop is staged, not big-bang. The pattern that works — observed across Razorpay, Swiggy, Cred, and Meesho's data platform teams — is a four-phase rollout that takes 6–9 months and avoids the "we built lineage but nobody uses it" failure mode.

Phase 1 (weeks 1–6): SQL-only coverage via dbt + sqlglot. Every dbt project already produces a manifest.json with parsed column-level lineage. Pipe that into a graph store (DataHub or Marquez), expose a CLI: lineage trace mart.merchant_revenue_daily.total_gmv. Cover the 60% of the platform that lives in dbt. This is the cheap, high-impact start — it shows the team what column-level lineage feels like before they invest in the harder integrations.

Phase 2 (weeks 7–14): warehouse query log scraping. Snowflake's query_history, BigQuery's INFORMATION_SCHEMA.JOBS, Redshift's STL_QUERY — every modern warehouse logs the SQL of every query. Parse these with sqlglot, extract column-level edges, merge into the graph. This catches the ad-hoc CTAS queries, manual fixes, and non-dbt SQL that Phase 1 missed. Why query-log scraping is a force multiplier: it is the only way to discover lineage for queries that should not exist but do — a developer's local laptop running a one-off INSERT INTO mart.x SELECT FROM stg.y that bypassed dbt. Scraping the query log catches these and surfaces them to the platform team for migration.

Phase 3 (weeks 15–24): Spark and orchestration emitter integration. Wire OpenLineage into Airflow / Dagster / Spark / PyFlink. This is the hardest phase — every framework has its own lineage-emission story, and the column-level coverage varies. The pragmatic target: column-level coverage for Spark SQL operators; table-level for everything else; manual annotation for the 5–10 critical Python pipelines that produce regulated outputs.

Phase 4 (weeks 25–36): UI integration and adoption. The lineage graph in a database is invisible. Adoption requires that the on-call sees lineage edges inside the tools they already use: a sidebar in the BI tool showing the column's upstream chain; a Slack bot that posts lineage when a data-quality check fails; a pre-merge GitHub check that shows the blast radius of a SQL change. Without these, the graph is a museum piece.

The teams that skip Phase 4 — and there are many — end up with a beautifully complete lineage graph that the on-call never opens at 3 a.m. because it lives in a separate UI behind a separate login. Coverage is necessary but not sufficient; the lineage has to land where the work happens.

A specific anti-pattern to avoid in Phase 4: building a "lineage explorer" web app as the primary UI. These apps look impressive in product demos but are used by the platform team and almost no one else. The actual users — the BI analyst writing a SQL query, the on-call SRE reading a data-quality alert, the developer opening a pull request — are inside other tools. The integrations that move the needle are: (a) a lineage slash command in Slack that takes a column reference and posts the upstream chain inline; (b) a GitHub PR comment bot that flags the blast radius of any SQL change; (c) a column-tooltip in the BI tool that shows lineage on hover; (d) a CLI for the platform team that returns plain-text lineage chains for scripting. Build these four integrations and adoption will follow; build a beautiful exploratory UI and adoption will not. Why this is harder than the technical pieces: every phase 1–3 deliverable is a backend engineering project with measurable output. Phase 4 is a UX project that requires partnering with every tool team and building integrations into systems the data platform team does not own. The teams that fail at column-level lineage almost always fail at this last mile.

What goes wrong (real-world)

The "table-level lineage with column names sprinkled in" trap. A team that buys a vendor data-catalog tool sees "column-level lineage" in the feature list and assumes the problem is solved. It is not. Most vendor implementations do table-level lineage with column-name overlay — they show the column names but the edges are still table-to-table. The give-away: the tool cannot answer "which input columns contributed to this output column?" with anything more granular than "all input columns from upstream tables". Verify by tracing one column through a CASE expression. If the tool cannot tell you that the CASE branch column is a filter dependency, it is not real column-level lineage.

The discovery story across many Indian fintechs has been similar: a data team buys a vendor tool in year one based on the demo, discovers in year two that the column-level edges are unreliable, and ends up writing their own sqlglot-based extractor for the dbt portion of their pipeline while keeping the vendor tool for the UI. The vendor tools that have caught up to this — Atlan, Collibra, Manta — did so by integrating sqlglot or equivalent open parsers under the hood. The vendor tools that have not are still recommending "annotate your transformations manually with lineage hints", which is the column-level equivalent of writing your own database from scratch.

Pandas pipelines as black boxes. A team migrating a Razorpay reporting pipeline from SQL+dbt to a Python+pandas service loses 60% of their column-level lineage coverage overnight. The source column-meaning of every metric is now hidden inside Python function bodies that the lineage extractor cannot parse. The Razorpay platform team's answer in 2025 was a internal lint rule: pandas transformations in production must use a typed wrapper (@traced_transform decorator) that emits column-level lineage events to OpenLineage at runtime — turning a static-analysis problem into a runtime-observation problem.

Lineage drift after schema changes. A column lineage graph captured on Monday is stale by Friday if a developer renamed a column without updating the lineage extractor's snapshots. This is more common than you'd expect: dbt's lineage.json is regenerated on every dbt build, but if the build is skipped or the lineage tool reads from a stale cache, the displayed graph lies. The fix is to treat lineage as ephemeral — regenerate on every successful build, never on a cache older than the last warehouse change.

Over-attribution from JOINs and CASE. A naive extractor that treats every column referenced in a SELECT as contributing to every output column produces lineage edges from merchant_id to total_gmv (because merchant_id is in the GROUP BY). This is technically true but operationally noise — the on-call gets a candidate list of 12 columns to investigate when only 2 actually carry value. Real lineage tools classify each edge by transformationType and let the UI filter to direct/data dependencies first.

Forgetting that views and materialised views participate. Views have lineage too. A team that excludes views from their lineage extraction (because "they're not real tables") loses the ability to see that vw_settled_txn rewrites the meaning of gmv_inr. This is the actual bug from the lead paragraph — the COALESCE was added to a view, and the view was excluded from the lineage scan because it was tagged "infrastructure, not data". Treat views as first-class lineage nodes.

The "lineage but no diff" gap. A team has full column-level lineage but no way to ask "what changed between yesterday and today?" The graph is a snapshot, not a history. When Aditi pings about the ₹2 crore discrepancy, lineage tells Karan what feeds the column, not what about it changed when. Production-grade systems pair the lineage graph with a lineage version log — every regenerated lineage snapshot is diffed against the previous, and the diff is queryable. "Show me every column whose lineage changed in the last 7 days" is the actual incident-response query, and most lineage tools in 2026 are still catching up to delivering it cleanly.

Cross-system lineage gaps at handoff points. A Razorpay pipeline that goes Postgres → Debezium → Kafka → Flink → Iceberg → Trino → Looker has six handoff points where one tool's lineage emitter ends and another's begins. Each handoff is a place where column-level information can be lost — Debezium emits column-level CDC events but the Kafka topic schema may flatten them, Flink processes them but its lineage emitter only reports table-level for non-SQL operators, Iceberg's writes are captured but the connection back to the Flink output column is opaque. The fix is OpenLineage at every layer with consistent dataset naming, but the operational reality is that "we have lineage" usually means "we have lineage for SQL workflows" and the streaming half is dark.

The cross-system gap is also where most production lineage bugs live — not in any single tool's emitter, but in the joins between them. A field renamed at one layer but referenced by its old name in another, a dataset URI that drifts (s3://bucket/path vs s3a://bucket/path), a Kafka topic schema that splits a struct column into two flat columns. Each of these breaks the cross-tool join silently, leaving the lineage graph with two disconnected components that look fine in isolation but cannot answer end-to-end questions. Production lineage operations is mostly about catching and fixing these gaps before the on-call needs the answer at 3 a.m.

Common confusions

Going deeper

How sqlglot extracts column lineage from SQL

sqlglot, the Python SQL parser that became the de facto standard between 2022 and 2026, builds an AST where every column reference carries a fully-resolved table.column reference after the qualify pass. The lineage extraction (sqlglot.lineage.lineage) walks the AST from a target output column back to source columns, recording each intermediate transformation. The hard parts the parser handles: CTE resolution (a column referenced inside a CTE points to the CTE's source, which itself has lineage), aliasing (SELECT a AS b FROM t produces an edge from t.a to the output, labeled b), subqueries in FROM (each subquery is a sub-DAG), window functions (PARTITION BY and ORDER BY columns are filter-style dependencies on the output), and dialect quirks (Snowflake's IDENTIFIER, BigQuery's STRUCT field access, Postgres's lateral joins). The parts even sqlglot punts on: dynamic SQL constructed via string concatenation in stored procedures, UDF bodies written in JavaScript or Python, and any expression that depends on current_user() or other session-context functions.

OpenLineage's column lineage facet and the standardisation story

OpenLineage's column-level lineage facet, formalised in 2023, lets every emitter (Airflow, Spark, dbt, Trino, Flink) attach column-level edges to lineage events as part of the run metadata. The facet schema: each output column has a transformations array, where each transformation has a list of input fields, a type (DIRECT, INDIRECT), a subtype (TRANSFORMATION, AGGREGATION, FILTER, JOIN, GROUP_BY, WINDOW), and optionally the SQL/Python source text. This standardisation is why a Razorpay platform team in 2026 can use Airflow + Spark + dbt + Trino + Flink and get a single column-lineage graph in DataHub: every emitter speaks the same vocabulary, and DataHub merges them. Pre-OpenLineage (before 2022), every team built their own format; cross-tool lineage was a per-team integration project. After OpenLineage, it is a wiring exercise.

Bytecode-level lineage for pandas and PySpark

Static analysis of pandas is a dead end — the language is too dynamic. Two production approaches in 2026: (1) Runtime tracing — wrap pandas operations in a decorator that records (input_columns, operation, output_columns) at execution time. Open-source projects: mlinspect, Vamsa, tea-pandas. The cost: 5–15% runtime overhead, plus the lineage is a function of the run, not the code (so a conditional pipeline produces different lineage on different runs). (2) Plan-based extraction — for PySpark, Catalyst's logical plan after analysis exposes every column-level operation; the OpenLineage Spark integration walks the plan and emits column-level facets. For pandas, Polars's lazy frames have a similar planner, and Polars + OpenLineage support is now the standard pattern for new Python pipelines that want column-level lineage.

Column-level lineage at petabyte scale: storage and indexing

A column-level lineage graph at a Flipkart-scale data platform has order 10⁷ nodes (columns) and 10⁸–10⁹ edges (transformations). Storing this naively in a relational DB makes traversal queries — "all upstream columns of mart.orders.gmv" — too slow for interactive use. Production deployments in 2026 use one of three patterns: (a) Graph DB (Neo4j, JanusGraph, Amazon Neptune) — the most flexible, supports arbitrary multi-hop queries, but ops complexity is high. (b) Postgres with recursive CTEs and adjacency-list materialised views — works to ~10⁷ edges, struggles past that. (c) Pre-computed transitive closure as a denormalised "lineage cube" — every (output_col, source_col, hop_count, transformation_chain) row stored in a columnar store like ClickHouse or BigQuery; refreshed nightly. DataHub's approach is option (a) for the operational graph and option (c) for the analytical layer; this is the pattern most large Indian data platforms (Flipkart, PhonePe, Swiggy) settled on after 2024.

Why the Aadhaar / UIDAI data team has the hardest version of this problem

UIDAI's data platform feeds 200+ downstream consumers — GST, IT department, every public-sector bank, every fintech doing eKYC, every state's PDS system. Aadhaar's aadhaar_number column is the most heavily referenced PII column in the country: it appears in derived form in tens of thousands of downstream columns, often hashed, masked, or partially-revealed (last-4-digits). When the Supreme Court's 2018 Puttaswamy judgment ordered restrictions on which downstream uses are permitted, the UIDAI data team had to enumerate, with column-level precision, every derived column of aadhaar_number and classify each by use-case. This was a 14-month project. The lesson: column-level lineage is not a tool for normal incident response — at the scale of national-identity-or-payment infrastructure, it is the only mechanism by which legal compliance is achievable. The Indian data engineering community's heavy investment in OpenLineage from 2023 onward is partly a response to DPDP-2023 making this a baseline requirement for every fintech, not just UIDAI.

Where this leads next

Column-level lineage is the foundation for the rest of Build 5: data catalogs (chapter 30), data contracts (chapter 32), freshness SLAs at column granularity (chapter 33), and the broader "blast radius" tooling that turns lineage from a museum into an operational system. Once you have column-level edges, you can build incident-response UIs that highlight the suspect columns, GDPR/DPDP delete pipelines that target only the affected derived columns, and regulatory audit reports that prove a single number's provenance.

The chapters that follow in Build 5 each lean on column-level lineage as the substrate they sit on. Data catalogs become useful only when they can answer "what feeds this column"; data contracts are enforced at the column boundary; freshness SLAs are most meaningful at column granularity ("this column is at most 4 hours stale", not "this table"). Without column-level lineage, the rest of the build remains aspirational.

The deeper arc: lineage at the right granularity is the difference between a data platform that feels haunted ("something changed somewhere, we don't know what") and one that feels engineered ("a column changed, here it is, here is the blast radius, here is the fix"). A team that invests in column-level lineage early gets compounding returns; a team that defers it pays interest on every incident, every audit, every schema migration, for the life of the platform.

The strategic framing the senior engineers at PhonePe and Razorpay use when they pitch column-level lineage to leadership: it is not an observability tool, it is a change-velocity tool. Without it, every schema change in a regulated pipeline is a months-long review. With it, the same change is a one-week migration with a clear blast-radius report. The ROI is measured in how often the platform team can say "yes, we can change that" instead of "we don't know what will break". Multiply that across the 200+ schema changes a fast-moving fintech wants to make per year, and the column-level lineage investment pays back inside two quarters.

A practical rule for teams just starting Build 5: do not chase 100% column-level lineage coverage from day one. Cover the 60% of pipelines that live in dbt SQL within the first sprint, then expand outward only when an actual incident or audit forces the issue. The teams that try for full coverage as a phase-zero project usually spend a year in implementation and ship nothing usable; the teams that ship the dbt-only version in two weeks discover that 60% coverage already changes how their on-call rotation feels.

The honest measure of success is not "what fraction of pipelines have lineage" but "what fraction of incidents the on-call solves with lineage as the primary tool". The Razorpay platform team tracks this as the "lineage assist rate" — the percentage of P1 data incidents where the post-mortem credits column-level lineage as the single most useful tool in finding the root cause. Their assist rate climbed from 0% (lineage existed but nobody used it) at the start of 2024 to 71% by end of 2025, and the climb tracked exactly with the four-phase rollout above — the assist rate barely moved in Phase 1–3 and shot up only after Phase 4 landed the integrations into Slack and GitHub. The lesson the team took away: build the lineage graph for the engineers, but ship the integrations for the workflows. Both are required, in that order; either alone is half a project.

References

  1. OpenLineage column lineage facet specification — the de facto standard for column-level lineage events since 2024.
  2. sqlglot lineage module documentation — the Python SQL parser's column-lineage extraction API used by dbt and Snowflake internally.
  3. DataHub column-level lineage docs — the operational implementation patterns for storing and querying column-level lineage at scale.
  4. Marquez data model — the OpenLineage reference implementation's storage model for column-lineage edges.
  5. Manta Software, "What column-level lineage is and isn't" (2023) — a clear-eyed vendor-side write-up of the difference between true column-level and column-name overlay.
  6. DPDP Act 2023, India — Section 8 on data deletion rights — the regulatory driver for column-level PII tracking in Indian fintechs.
  7. Maxime Beauchemin, "The downfall of the data engineer" (2021) — the Airbnb-era essay that argued lineage was the missing primitive of data platforms.
  8. What is data lineage and why it's the heartbeat of debugging — chapter 28, the table-level lineage that this article builds on.