Schema drift: when the source changes under you
Kiran owns the analytics pipeline at a Pune-based logistics startup. On a Tuesday morning the BI team raises a ticket: the delivery_attempts dashboard is showing zero attempts for the last 18 hours, but the operations team can see deliveries happening on the road. Kiran checks the pipeline — green. The cursor advanced, the rows were ingested, the MERGE statement ran without errors. Every observability metric says the pipeline is healthy. The bug is two layers deeper: the upstream service team renamed attempt_count to attempts_total in a deploy on Monday evening, the source table now has both columns (the old one nullable, the new one populated), and the ETL's SELECT attempt_count returns NULL for every new row. The pipeline is technically running. The data it produces is technically wrong. No alert fired because no alert was looking at column shape.
Schema drift is the silent class of failure where the source's column set changes under a running pipeline — a new column appears, a type widens, a field is renamed, a nullable becomes non-nullable. Cursors and merge keys do not catch it because the rows still load; the values are just wrong, missing, or in the wrong column. The defence is a contract layer that checks shape on every run and fails loudly on unexpected change, paired with explicit policies for which drifts are auto-applied and which require a human.
Why schema drift is invisible to a healthy pipeline
A standard incremental load has three checks built into the orchestrator's view of the world. The cursor advanced (high-water mark moved forward). The MERGE returned a row count (writes happened). The job exit code was 0 (no SQL error). Airflow turns the task green if all three pass; Dagster does the same; cron + a custom monitoring shim does the same. The pipeline is "healthy" by every definition that the orchestrator can observe.
But all three checks pass when the schema drifts in three of the four common ways:
- Column added in source, ignored by pipeline:
SELECT col_a, col_bcontinues to work; the newcol_cis silently dropped at the SELECT boundary. The destination has no idea the column exists. Loss is silent. - Column renamed in source: if the old name still exists (most ALTER TABLE deploys keep the old column for backward compatibility),
SELECT old_namereturns NULL after the new column became the source of truth. Rows still flow; values are NULL. - Type widened in source (e.g.
INT→BIGINT,VARCHAR(50)→VARCHAR(255)): the pipeline's destination column is the old narrower type. Inserts that exceed the old range either truncate (silent) or fail with a row-level error that the bulk loader ignores inCONTINUE_ON_ERRORmode. - Column dropped in source: this one is loud. The
SELECT dropped_columnerrors, the job fails, an alert fires. This is the only drift mode that the standard pipeline notices on its own — and even this is silent if the SELECT usesSELECT *because the result set just has fewer columns and the downstream INSERT happily ignores the missing one.
Why a passing pipeline is the worst signal: monitoring "did the job succeed?" answers a different question from "is the data correct?". Schema drift breaks the second without affecting the first. The Kiran-shaped engineer reading the green dashboard genuinely believes the pipeline is healthy because every metric they have looks healthy. The only metric that would have caught the drift is "what columns did the source have on this run, and do they match what we expected?" — and that metric does not exist by default in any of cron, Airflow, dbt, or the warehouse's native ETL.
Where drift comes from — the four upstream causes
Drift is not random. It comes from four predictable kinds of upstream change:
Application deploys are the most common cause. A backend team adds a new field to an API response, runs a migration that adds a column to the OLTP table, and deploys. The data team finds out via the dashboard going wrong, not via a release note. At Razorpay-scale (200+ engineers across 30+ services) this happens roughly once a week per major source table. The deploy is not malicious — the application engineer correctly added a feature with a database column, ran their migration, ran their tests, deployed cleanly. It is just that the data pipeline is downstream of a contract that was never written down, so any change is by definition a "breaking" one for the pipeline even if the application-side test suite is green.
ORM / framework auto-migrations are the second cause. Django's makemigrations, Rails' db:migrate, Sequelize's auto-sync — all of them generate ALTER TABLE statements without a human writing the migration. The application engineer adds a field to the model class and the framework adds the column on next deploy. The data team has no review hook in this flow.
Source-system upgrades are the third cause. A vendor SaaS API (Salesforce, Stripe, Razorpay's own merchant API) versions its response schema. The pipeline that was hitting /v1/payments gets a new field in the response when the vendor rolls a minor version. Some vendors version cleanly (v1 stays stable, v2 is opt-in); many don't.
Manual hot-fixes are the fourth, most insidious cause. An on-call engineer logs into the production database to fix a customer-support escalation and runs ALTER TABLE policies ADD COLUMN refund_status TEXT. Six weeks later, finance asks a question that depends on the column, the analyst can't find it in the warehouse, and the team discovers there's a column in the OLTP that the pipeline never knew about. The fix is sometimes worse than the original incident: backfilling a 6-week-old column from a soft-deleted audit log is a multi-day project, and the resulting analytics gap is permanent because the soft-delete logic also dropped a few rows.
Of the four, application deploys are the easiest to catch (the deploy is observable in the company's release tracker; the migration is in version control). Manual hot-fixes are the hardest (no PR, no review, no log). The pipeline's defence has to assume all four can happen and not depend on the source team to tell you.
The frequency varies by company stage. A 5-engineer startup in pre-product-market-fit mode might see one drift event a month. A 200-engineer scale-up with multiple product teams shipping daily sees roughly one drift per major source table per week — at Razorpay scale, with 40+ source tables that the data team consumes, that's a drift event somewhere in the system roughly every business day. The pipeline that ignores drift will be wrong somewhere, somehow, every day; the only question is whether the data team finds it before the business does.
A second-order observation: the closer the source team is to the data team (organisational distance), the lower the drift rate. Co-located teams who eat lunch together coordinate schema changes informally; remote teams or teams in different time zones don't. The Bengaluru-vs-Bengaluru schema-drift rate at most Indian product companies is meaningfully lower than the Bengaluru-vs-Hyderabad rate, which is in turn lower than the Bengaluru-vs-Bay-Area rate. The contract layer is the engineering substitute for the lunch-table conversation that doesn't happen at scale.
Building a schema contract that runs every load
The fix is conceptually simple: every run, before reading data, the pipeline asserts that the source's columns match an expected list. If the source has unexpected columns (drift), it either auto-applies the change (allowed list of safe drifts) or fails the run with a clear message (everything else). The contract layer is 60 lines of code and lives between the cursor read and the MERGE.
# schema_contract.py
# A minimal schema contract that runs before every incremental load.
# Catches all four drift modes and emits an actionable alert.
import psycopg2, json, sys
from datetime import datetime
DSN = "host=oltp.internal dbname=logistics user=etl password=etl"
# Expected schema, version-controlled alongside the pipeline code.
# This is the contract.
EXPECTED = {
"delivery_attempts": {
"id": ("bigint", False), # (type, nullable)
"order_id": ("bigint", False),
"attempt_number": ("integer", False),
"attempts_total": ("integer", False),
"agent_id": ("bigint", True),
"outcome": ("varchar", False),
"attempted_at": ("timestamp", False),
"updated_at": ("timestamp", False),
}
}
# Drifts the pipeline will tolerate without a human.
SAFE_DRIFTS = {"new_nullable_column", "type_widening_int_to_bigint"}
def actual_schema(cur, table):
cur.execute("""
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_schema='public' AND table_name=%s
ORDER BY ordinal_position
""", (table,))
return {r[0]: (r[1], r[2] == "YES") for r in cur.fetchall()}
def diff(expected, actual):
drifts = []
for col, (exp_type, exp_nullable) in expected.items():
if col not in actual:
drifts.append(("missing_column", col, exp_type))
continue
act_type, act_nullable = actual[col]
if act_type != exp_type:
drifts.append(("type_change", col, exp_type, act_type))
if exp_nullable is False and act_nullable is True:
drifts.append(("nullability_loosened", col, "now nullable"))
for col in actual:
if col not in expected:
act_type, act_nullable = actual[col]
kind = "new_nullable_column" if act_nullable else "new_required_column"
drifts.append((kind, col, act_type))
return drifts
def assert_contract(table):
with psycopg2.connect(DSN) as conn, conn.cursor() as cur:
actual = actual_schema(cur, table)
drifts = diff(EXPECTED[table], actual)
if not drifts:
return # green path
unsafe = [d for d in drifts if d[0] not in SAFE_DRIFTS]
if not unsafe:
print(f"[contract] {table}: {len(drifts)} safe drift(s) auto-applied")
for d in drifts:
print(f" - {d}")
return
print(f"[contract] {table}: BLOCKING DRIFT — pipeline halted", file=sys.stderr)
for d in unsafe:
print(f" ! {d}", file=sys.stderr)
sys.exit(2)
if __name__ == "__main__":
assert_contract("delivery_attempts")
print("[contract] delivery_attempts: shape OK")
# Output (Tuesday morning, after the rename deploy):
[contract] delivery_attempts: BLOCKING DRIFT — pipeline halted
! ('missing_column', 'attempt_count', 'integer')
! ('new_required_column', 'attempts_total', 'integer')
Three lines do the load-bearing work. actual_schema() queries information_schema.columns for the source table's actual current shape — this is the single read that any contract layer is built around, and Postgres, MySQL, SQL Server, Snowflake, BigQuery all expose an equivalent catalog table. diff() compares expected vs actual along three axes (presence, type, nullability) and classifies each drift by kind. assert_contract() then partitions the drifts into safe and unsafe, auto-applies the safe ones, and exits non-zero on the unsafe ones — which is the signal the orchestrator (cron, Airflow, Dagster) needs to halt the run before the bad data lands.
Why the contract has to run before the cursor read, not after: a contract that runs after the load has already corrupted the destination. The whole point is to fail loudly enough that the run aborts before any rows are written. Some teams put the contract check at the end of the run "as an audit"; this defeats the purpose. The audit is fine as a second layer, but the first layer has to be the pre-read assertion that costs ~50ms and prevents bad data from reaching the warehouse at all.
The 60-line script is the floor — production-grade contract systems (Great Expectations, Soda Core, Monte Carlo) wrap it with versioning, history, drift-rate metrics, and PR-style review for accepting drifts. But the 60 lines do most of the work. Most teams that adopt a contract layer catch their first drift within two weeks.
A common extension: the contract layer also writes its observations to a schema_history table on every run, regardless of whether drift fired. The table has columns (table_name, run_at, columns_json, contract_version, drift_kinds). Over months this table becomes the source of truth for "when did this column appear?" and "how often does this source's schema change?" — questions that are otherwise answered by archaeology through old git commits or by guessing. The Razorpay data-platform team uses this history as the input to their quarterly contract-review meeting: tables with high drift rates get extra scrutiny on their owning team's release process, and tables with zero drift in 12 months get their contract relaxed (some checks downgraded from block to continue) because the historical evidence supports it.
Why "auto-apply safe drifts" is a real category and not laziness: drift policy is a spectrum from "everything is a manual review" (correct but slow — every column add blocks the pipeline until a human approves) to "everything auto-applies" (fast but unsafe — type changes silently corrupt). The middle ground says: drifts that cannot break existing queries (a new nullable column does not affect the SELECT statement that reads the existing columns; an INT→BIGINT widening cannot truncate existing values) are safe to auto-apply. Drifts that can break (rename, type narrowing, required column added, column dropped) must block. The set of safe drifts is small and decidable.
What the contract catches that data tests cannot
A common reaction to this chapter from teams that already have data tests (dbt tests, Great Expectations expectations, custom assertions in the SELECT statement) is "we already check our data; do we need a separate contract layer?" The answer is yes, because data tests and schema contracts catch different failure classes.
Data tests check value-level invariants on the columns the pipeline knows about. not_null says "the values in this column should never be NULL"; unique says "no two rows have the same value here"; accepted_values says "every value should be in this list". They run after the data is loaded and they check the values you have.
Schema contracts check column-level invariants on the table the pipeline reads from. They run before the data is loaded and they check the columns you have. The difference matters because the failure modes of the two checks do not overlap.
A renamed source column produces a stream of NULL values in the pipeline's destination column. The data test not_null fires — eventually, after the bad data has landed and the test has run. By then the rows are in the destination, the dashboard has been wrong for some hours, and a downstream consumer may have already exported a CSV. The schema contract fires before the load; the data test fires after. Both signals are useful, but the contract catches the failure earlier in the chain.
A new column added to the source that the pipeline doesn't read produces no failed data test at all — the test runs over the columns that were loaded, and the new column wasn't loaded. The data team only discovers the column exists when someone goes looking for it, often months later. The schema contract fires on the first run after the column appears.
The right architecture has both layers. The contract is the load-blocking gate; the data tests are the post-load audit. Failing the contract means halt the run; failing a data test means alert and triage but don't necessarily roll back. Most production warehouses end up with this two-layer pattern after their first contract incident.
Why halting on contract failure is correct but halting on data-test failure is not: the contract failure means the data the pipeline is about to load is wrong by structure — every row will be miscolumned. The right action is to stop before any rows land, because letting bad rows in just creates a backfill problem on top of the original problem. A data-test failure means a small fraction of rows violate an invariant — the row count for a column might be 0.4% NULL when the test says 0% NULL. Halting on this is overkill; the right action is to alert, quarantine the bad rows, and let the rest of the load complete. The two layers correspond to two different kinds of "wrong" and deserve two different responses.
The drift-policy decision matrix
Once the contract layer exists, the team has to write down which drifts are safe to auto-apply and which require a human. The matrix below is what most Indian-fintech data teams converge on after their first three or four drift incidents:
The matrix is per-team policy, not a universal answer. The Razorpay payments-pipeline team blocks on every drift including new nullable columns — too high a correctness bar to auto-apply anything. The Swiggy delivery-events team auto-applies everything additive and only blocks on type changes — they ship features fast and the cost of a 4-hour ingest delay outweighs the cost of a one-day window where the new column is absent from analytics. The Cred rewards-pipeline team uses a hybrid: auto-apply for low-tier tables, manual review for everything tagged pii=true.
The decision the team has to write down: for each drift kind, what is the policy? Once written down and committed to git, the contract layer enforces it on every run. Without that document, every drift turns into an ad-hoc Slack thread at 11 a.m. on a Tuesday.
Why the matrix is per-team and not universal: the cost of a "false block" (pipeline halts on a benign change, downstream consumers are stale until a human approves) is different for different teams. A consumer-tech team where stale-by-an-hour breaks the product treats a false block as an incident; a B2B-SaaS team where stale-by-a-day is fine treats a false block as a rounding error. The same drift (new_required_column) is "alert and continue" for the first team and "alert and block" for the second. The matrix encodes the team's appetite for false blocks vs false acceptances, and that appetite is a real product decision.
A second engineering choice the matrix hides: who gets paged when the policy fires? The default is "the pipeline owner" — but the pipeline owner can't fix a rename, only the upstream team can. Mature contract systems route the alert to the upstream team's on-call rotation by reading an owner: field from the source's catalog entry. The Cred contracts platform, the Razorpay data-platform team's internal "lakehouse-contracts" service, and the Swiggy "Sherlock" tool all do this. The pattern is: drift detected → contract layer halts the pipeline → alert routed to the upstream team's PagerDuty/OpsGenie → upstream team either rolls back the deploy or files a contract-change PR → contracts team reviews → pipeline resumes. The whole loop completes in under 2 hours when it works; the alternative (no contract, dashboard goes wrong, analyst escalates) takes 3 days median.
Common confusions
- "Schema drift means the source's data drifted." No. Data drift (the distribution of values changes) is a different thing — that's a problem for ML monitoring and freshness checks. Schema drift specifically means the shape of the source changed: columns added/removed/renamed, types changed. Conflating them is the most common terminology error in data observability tooling.
- "If the cursor advanced and rows landed, the load was correct." The cursor only proves rows moved through the pipeline; it cannot prove the columns the rows have are the columns you expected. A renamed column produces a stream of rows where the value of the column you SELECTed is uniformly NULL — the cursor advances cleanly, the rows land, the values are useless.
- "The destination warehouse will reject incompatible writes, so I don't need a contract." Sometimes —
INSERTof a string into an integer column does fail at the warehouse. But many drift modes don't trigger a write error: a new column in the source is silently dropped at the SELECT, a NULL value loaded into a previously-required column is permitted by every warehouse, a widened source type is silently truncated when written to a narrower destination column withCONTINUE_ON_ERROR. The warehouse is not your contract layer. - "Auto-applying schema changes to the destination is too risky." It depends on the change. Adding a nullable column to the destination is purely additive — no existing query breaks because no existing query references the new column. Widening a type from INT to BIGINT has no value-range hazard. The risky drifts (rename, narrowing, required-column-added) are exactly the ones the policy should block on; the safe ones are exactly the ones it should auto-apply.
- "dbt's
on_schema_changesetting handles drift." dbt's setting handles drift on the model side (what to do when the SELECT statement's output columns change), not on the source side (what happens when the source table the model reads from changes). The source-side drift is exactly what the contract layer this chapter built catches; dbt's setting is the next layer down. Both are needed. - "We have data tests, so drift is covered." Data tests (uniqueness, not-null, accepted-values) check value-level invariants on existing columns. They cannot catch a missing column — a column that no longer exists is not in the test's input. The shape check has to come before the value check. Some teams try to detect drift after the fact by diffing rowcounts of non-null values; this works occasionally but has a hours-to-days lag and an indirect failure mode (you see "more NULLs than expected" and have to backtrack to "the column was renamed"). After-the-fact monitors are a useful second layer; they are not a substitute.
Going deeper
Schema registries and the streaming generalisation
In Kafka-based streaming systems (Build 7+) the schema problem is solved at the message-broker layer by a schema registry — a service that holds the canonical Avro / Protobuf / JSON Schema for every topic and refuses to publish messages that violate it. Confluent Schema Registry is the canonical implementation; Apicurio and AWS Glue Schema Registry are alternatives. The producer registers its schema on first publish, the consumer reads the registered schema on subscribe, and any incompatible change is rejected at the registry boundary.
The registry supports compatibility modes: BACKWARD (new schema can read old data — adding optional fields is fine, removing fields is not), FORWARD (old schema can read new data — removing fields is fine, adding required fields is not), FULL (both directions). Most production teams use BACKWARD for the consumer-driven contract pattern, where the producer can add optional fields freely but cannot break existing consumers. The batch-pipeline contract layer this chapter built is the SQL-world equivalent of BACKWARD compatibility — the pipeline tolerates additive change and blocks on subtractive change.
The mature shape of an Indian-fintech data platform in 2026 has both: a schema registry for the streaming pipelines (Kafka topics governed by Confluent Schema Registry) and a SQL-layer contract for the batch pipelines (dbt source freshness + Soda contracts). They serve the same purpose at different layers of the stack.
Detecting renames automatically: the heuristic match problem
A column rename looks to the contract layer like two simultaneous drifts: the old column went missing and a new column appeared. Distinguishing "this is a rename of an existing column" from "this is a drop and an unrelated add" is the rename-detection problem.
Some contract systems try heuristic matching: if the dropped column and the new column have the same data type and similar distribution of values, it's probably a rename. Monte Carlo's "schema change" detector does a version of this. The heuristic is unreliable — two columns of type VARCHAR(255) with overlapping values can easily be unrelated — and the consequence of a wrong heuristic is data corruption (rows mapped to the wrong destination column).
The honest answer: don't auto-detect renames. Always block on the simultaneous-drop-and-add pattern, force a human to either confirm "yes, this is a rename — apply this mapping" or "no, this is unrelated — drop the old column and add the new one as separate operations". The Razorpay contracts team learned this after a heuristic auto-mapped tax_amount (rupees) onto a freshly-added tax_basis_points (basis points, 10000× scale) and produced a week of incorrect financial reports.
Postgres pg_event_trigger as an upstream signal
Postgres exposes a system-level hook called pg_event_trigger that fires on DDL statements (CREATE TABLE, ALTER TABLE, etc.). A small extension on the OLTP database can subscribe to these events, write the DDL into an audit table, and the contract layer can poll the audit table to detect drift the moment it happens — rather than waiting for the next pipeline run.
The pattern is rare in pure-batch teams (the polling lag is not a meaningful improvement when the pipeline runs every 4 hours anyway) but useful in real-time-analytics teams where a 15-minute pipeline that ingested rows after a rename has 15 minutes of bad data to backfill. The Cred analytics team installed this in 2024 after a rename deploy at 11 p.m. caused 6 hours of incorrect rewards-attribution data; the pipeline now blocks within 2 minutes of the DDL and the on-call engineer is paged with the full ALTER TABLE statement in the alert.
The same idea exists in MySQL (audit_log plugin), in Snowflake (ACCESS_HISTORY view), and in BigQuery (Cloud Audit Logs). Cross-database, the pattern is "subscribe to the source's DDL stream and validate the contract reactively, not just on every batch run".
The "schema diff" as a first-class git artefact
Mature data teams version-control their EXPECTED schema as a YAML file alongside the pipeline code. The contract layer reads the YAML; any change to the YAML is a PR that has to be reviewed; the diff between the source's actual schema and the YAML is a first-class object that shows up in PR descriptions ("this PR accepts a new nullable column refund_status TEXT NULL introduced by the OLTP team in deploy #4128").
The dbt sources YAML and the OpenLineage facets format are the two industry conventions for this. dbt's freshness: and loaded_at_field: aren't schema contracts in the strict sense, but the surrounding columns: block is — every source column is declared, with optional tests: and meta: annotations. The contract layer this chapter built can read directly from the dbt sources YAML to know what shape to expect. The benefit of the YAML-versus-Python representation: the schema is reviewable as a diff, the change history is visible in git log, and downstream consumers can read the same file to see what guarantees the pipeline gives.
Soft-deletes and the is_active column trap
A subtle subclass of schema drift involves semantic drift in an existing column rather than the column set itself. The Flipkart catalogue team learned this the hard way in 2023: the products table had a column is_active BOOLEAN that the data team treated as "is this product available for sale?". Six months in, the application team added a soft-delete pattern using a new column deleted_at TIMESTAMP and changed the meaning of is_active from "available for sale" to "approved by the merchant team", which is a different concept. The column's type and nullability did not change. The contract layer would not have caught this because at the schema level nothing changed — only the meaning of values shifted.
The defence at the schema level is impossible (you cannot detect a semantic change from information_schema). The defence at the data level is a value-distribution check — the contract layer also samples the column's value distribution every run and fires an alert when the distribution shifts beyond a tolerance. If is_active was 92% true historically and suddenly drops to 41% true after a deploy, the alert fires even though the schema is identical. Great Expectations' "expect_column_proportion_of_unique_values_to_be_between" and Soda Core's "anomaly score" both implement this. The pattern is: schema contract for column-shape drift; value-distribution contract for semantic drift; both run every load.
The cost of a drift incident: what teams actually pay
The internal post-mortems from Indian fintech and consumer-tech companies in 2024–2025 give a rough cost distribution for drift incidents. The numbers are approximate but consistent across teams:
- Detection time without a contract layer: median 3 days (the dashboard goes wrong, an analyst notices, escalates, the data engineer investigates, identifies the rename, fixes the SELECT). Worst observed: 6 weeks (a finance reconciliation in quarterly close discovered the drift).
- Detection time with a contract layer: median 4 hours (the next pipeline run blocks, the on-call engineer triages, the upstream team is paged, the contract is updated, the pipeline resumes).
- Engineering cost of a drift incident: 1–4 person-days for a typical case, 2 person-weeks for the worst case where the bad data has to be backfilled across multiple downstream tables.
- Engineering cost of running a contract layer: 1 person-week of one-time setup, ~2 hours/week of ongoing maintenance (reviewing accepted drifts, triaging alerts).
The cost arithmetic is clear: the contract layer pays for itself after the first incident. Teams that have lived through a 6-week-undetected drift incident put the contract layer in place permanently; teams that haven't tend to view it as over-engineering. The honest engineering judgement is to put it in place before the first incident, not after.
An honest observation: most teams misclassify their own appetite at first. The team's stated policy says "block on new_required_column" but the on-call engineer who gets paged at 11 p.m. with a blocked pipeline tends to flip the policy to continue to clear the alert and go back to bed. Three months later half the matrix has drifted toward "continue" and the team has the same silent-drift problem they started with. The discipline is to keep the policy stable across pages — if the matrix says block, the right action is to wake up the upstream team's on-call, not to weaken the policy. Mature teams enforce this with a rule that drift policy can only be changed by a PR with a code-owner approval, never by an on-call ack action.
A subtle implication of the matrix is that safe drifts auto-apply DDL on the destination — the contract layer doesn't just permit a new nullable column, it issues ALTER TABLE warehouse.delivery_attempts ADD COLUMN refund_status TEXT NULL to make the destination match. This is the "self-healing" behaviour mature contract platforms describe. The risk is a destination DDL that fires on every run because the contract was misconfigured, accumulating columns the source briefly added and then dropped; the mitigation is a per-day rate limit on auto-applied DDL and a weekly summary of all auto-applied changes for the team to review. Razorpay's lakehouse-contracts team caps auto-applied DDL at 3 changes per source per day; anything beyond that escalates to manual review even if the change kind is in the safe list.
Where this leads next
- Cursors, updated_at columns, and their lies — chapter 13, the four ways the cursor itself can betray you (drift is a fifth)
- Late-arriving data and the backfill problem — chapter 14, the time-axis sibling of this chapter's column-axis problem
- Full refresh vs incremental: when to pick which — chapter 16, the load-strategy framing that drift partially constrains
- Append-only vs mutable source tables — chapter 15, the source-shape framing
- Data contracts and lineage at scale — Build 5, where this chapter's 60-line contract grows into a contracts platform with lineage, ownership, and SLA enforcement
Build 5 is where the contract pattern becomes a first-class platform — every source has a declared schema, every consumer has a subscription, and changes propagate through a lineage graph so a backend engineer who renames a column sees the list of 14 downstream dashboards that will break before they merge the PR. The 60-line script in this chapter is the seed; the platform Build 5 describes is the tree it grows into.
References
- Confluent Schema Registry — the canonical reference for streaming-side schema enforcement, including the BACKWARD/FORWARD/FULL compatibility modes.
- Great Expectations: schema validation — open-source contract layer with first-class schema-drift expectations.
- Soda Core: data contracts — YAML-driven contracts, a popular choice in dbt-shaped warehouses.
- dbt sources YAML — the de facto industry format for declaring source schemas alongside pipeline code.
- Postgres event triggers — the upstream-side signal for reactive drift detection.
- Monte Carlo: schema change detection — commercial implementation with rename heuristics and lineage propagation.
- Append-only vs mutable source tables — the previous chapter, which established the source-shape axis that drift extends over time.
- Apache Iceberg: schema evolution — the table-format-level treatment of additive and rename evolution rules.
The honest summary: the pipeline that survives schema drift is not the one with the smartest cursor or the most elaborate retry policy — it is the one that asserts the source's column shape before reading the data and fails loudly when the shape changes. The cost of the assertion is 50ms per run and 60 lines of code; the cost of skipping it is a Tuesday morning Slack thread that starts "the dashboard is showing zero" and ends, three days later, with a backfill ticket and a post-mortem. Every Indian fintech data team that has been through that cycle once writes a contract layer; the question is whether you write yours before or after the first incident.
A practical exercise: pick the source table in your warehouse that is most coupled to a fast-moving application team's deploys. Query its information_schema.columns, paste the output into a YAML file, commit it. Wire a 30-line contract check into the pipeline that reads from it. Wait two weeks. Most teams that try this catch at least one drift in the first month — usually a new nullable column the application team added without telling anyone, but sometimes a rename or a type widening that would have silently corrupted next quarter's analytics. The exercise pays for itself the first time it fires.
The longer-term destination, once the contract is in place and the team has lived with it for a quarter, is a culture shift in the upstream team's deploy process. The application engineers stop merging schema-changing PRs without a data-contract-update companion PR, the contracts become reviewable artefacts in the same way the API spec is, and "schema drift" stops being a category of incident. The 60-line script is not the end state; it is the lever that moves the conversation from "the dashboard broke at 3 a.m." to "the data team and the application team coordinate schema changes the same way the application team coordinates API changes". That coordination is the actual win — the script is just the wedge.