Append-only vs mutable source tables
Karan inherits a 4-year-old pipeline at a Bengaluru fintech. Every 5 minutes it pulls "new rows" from the OLTP payments table using WHERE created_at > :last_seen, advances a cursor, and writes to ClickHouse. For three years it ran clean. Then risk team rolled a feature where a payment can be re-classified from success to chargeback up to 90 days later — they update the existing row in place. The pipeline keeps running. The cursor keeps advancing. ClickHouse keeps showing the original success for every chargeback the OLTP has since corrected. Karan's first instinct is "the cursor is broken." It is not. The source table changed shape — from effectively append-only to mutable — and the cursor that worked for one shape silently produces wrong answers in the other.
An append-only source table never updates rows after insert; a mutable source table can. That single property decides whether a created_at cursor is sufficient, whether you need an updated_at cursor with overlap, whether you need CDC, and whether you can avoid full snapshots. Pipelines break when the source quietly shifts from one shape to the other and the load logic does not.
The two shapes, and what makes one "append-only"
A source table is append-only when, after a row is inserted, it is never updated and never deleted. Every row is final. New events become new rows. Corrections become new rows that supersede earlier ones logically (in the warehouse layer, downstream). The OLTP enforces this either by convention — application code never issues UPDATE against the table — or by schema — no updated_at, no deleted_at, sometimes a CHECK constraint or a database trigger that rejects updates.
A source table is mutable when the application freely updates rows after insert. The same primary key can exist with status='pending' at 14:00 and status='success' at 14:03. Deletes (hard or soft) are allowed. The row's identity is the primary key; its content is whatever the latest write said.
The distinction is not a database type — Postgres, MySQL, MongoDB, DynamoDB can each be either, depending on how the application uses them. The distinction is a property of the access pattern, and most production tables drift over their lifetime: a table starts append-only because the V1 product only inserts, then a V2 feature adds an UPDATE, and from that day onward the table is mutable. Nothing about the schema changed; the access pattern did. The pipeline reading it has no way to know — unless the data engineer reads the application code or watches the database's WAL.
The first instinct of every data engineer who has been bitten by this is "I'll just check information_schema or the database's update logs." That works for some sources and not others — a Postgres UPDATE shows up in pg_stat_user_tables.n_tup_upd, a MongoDB update shows up in the oplog, but a MySQL update against an InnoDB table without binlog gives you nothing visible from the outside. The reliable path is the application contract: ask the team that owns the source whether their write path uses UPDATE, get a written answer, and re-ask every six months because the answer changes when features ship. The unreliable path is to assume — most of the chapter exists because that assumption is what produces the silent-correctness bugs that take three months to detect and a week to clean up.
Why the schema doesn't tell you: a column named updated_at is suggestive but not authoritative — many teams add the column "just in case" and never update it. Conversely, a table with no updated_at can still be mutated; the application just doesn't track when. The presence or absence of an UPDATE statement in the application's code path is the only reliable signal, and that signal is application-private. The schema is a hint, not a contract.
What each shape forces you to build
The cursor pattern from the previous two chapters — WHERE updated_at > :high_water_mark plus a tracked checkpoint — is the universal incremental load. But the cursor that works depends on the shape of the source. Pick the wrong cursor for the shape and the pipeline silently produces wrong answers; pick the right one and the load is straightforward.
Append-only source: monotonic-id cursor is sufficient
If the source guarantees no updates, a monotonic identifier (auto-incrementing id, snowflake ID, sequence) is a complete cursor. The pipeline reads WHERE id > :last_id ORDER BY id LIMIT N, processes the rows, advances last_id to the maximum seen. Three properties make this clean:
- No double-counting: every row is read at most once because
idis monotonically increasing and the cursor is strictly greater-than. - No missed updates: there are no updates to miss; the row is final at insert.
- Cheap restart: a crash mid-batch resumes from the last committed
last_idwith no overlap window. The next batch picks up exactly the unprocessed rows.
The Razorpay payment_events table is the canonical Indian-fintech example. Every state change to a payment writes a new row — payment.created, payment.authorized, payment.captured, payment.failed, payment.refunded — keyed on (payment_id, sequence). The downstream warehouse aggregates the events into a payments_current table with the latest state per payment_id. The source table is append-only by contract; the warehouse computes the mutable view. The cursor is WHERE event_id > :last_event_id, simple and correct.
Mutable source with updated_at: cursor with overlap and dedup
If the source updates rows in place but maintains a reliable updated_at column on every write, you can use WHERE updated_at > :high_water_mark. Three caveats from the previous chapter ("Cursors, updated_at columns, and their lies") apply:
- Read with overlap: query
updated_at >= :high_water_mark - 5mininstead of strict>to catch transactions that committed late. - Dedup on load: rows that fall inside the overlap window will be re-read; the destination must dedup on
(primary_key, updated_at)or treat the load as aMERGEagainst the destination's current row. - Trust requires audit: not every
UPDATEstatement updatesupdated_at. Any code path that writes to the table without settingupdated_at = NOW()produces silent misses. The mitigation is a database trigger that forcesupdated_atto the current transaction timestamp on every write.
Mutable source with deletes: cursor cannot detect deletions
A row that existed yesterday and is deleted today does not appear in any updated_at > :ts query — the row is gone. The pipeline that uses cursor-based incremental loading silently retains the old row in the warehouse forever. There are three escapes:
- Soft deletes: replace
DELETEwithUPDATE SET deleted_at = NOW(). The deleted row still appears inupdated_atqueries, and the load logic treatsdeleted_at IS NOT NULLas a tombstone. - CDC (Change Data Capture): read from the database's WAL/binlog/oplog instead of the table. Inserts, updates, and deletes all appear as separate event types. Build 11 covers this in depth.
- Periodic full snapshot: every N days, do a full reconciliation between source and warehouse — anything in the warehouse but not in the source is a miss, marked deleted. Expensive at scale.
Mutable source without updated_at: full snapshot or CDC, no cursor
If the source has no reliable update timestamp, no cursor can incrementally load it. The choices are:
- Full snapshot: scan the entire source every N hours, replace the warehouse copy. Works up to ~10⁷ rows; breaks at 10⁸ rows on a single warehouse partition.
- CDC: the WAL/binlog/oplog contains the canonical timeline of every change; subscribe to it and stream changes downstream.
- Add
updated_at: politely demand the source-owning team add a maintainedupdated_atcolumn. This is the cheapest fix if it's available; many teams discover the column is "free" because the ORM already maintains it for audit purposes.
Many production pipelines combine three of these — cursor for the 99% append/update path, CDC subscription for deletes, and a weekly full-snapshot reconciliation as the catch-all for everything the first two missed. The combination is more code than any single approach but it is also the only combination that survives a 3-year audit.
A worked example: detecting the shape and choosing the cursor
Below is a small Python utility that introspects a Postgres source table and recommends the cursor pattern. It checks three things: whether updated_at exists, whether the column is reliably maintained (does it ever lag behind the row's actual modification, measured by sampling), and the table's update rate from pg_stat_user_tables. The output is the cursor recommendation.
# detect_table_shape.py
# Introspects a Postgres source and recommends the cursor pattern.
import psycopg2
import textwrap
DSN = "host=localhost dbname=razorpay_test user=app password=app"
def has_updated_at(cur, schema, table):
cur.execute("""
SELECT 1 FROM information_schema.columns
WHERE table_schema=%s AND table_name=%s AND column_name='updated_at'
""", (schema, table))
return cur.fetchone() is not None
def has_deleted_at(cur, schema, table):
cur.execute("""
SELECT 1 FROM information_schema.columns
WHERE table_schema=%s AND table_name=%s AND column_name='deleted_at'
""", (schema, table))
return cur.fetchone() is not None
def update_rate(cur, schema, table):
cur.execute("""
SELECT n_tup_ins, n_tup_upd, n_tup_del
FROM pg_stat_user_tables
WHERE schemaname=%s AND relname=%s
""", (schema, table))
row = cur.fetchone() or (0, 0, 0)
ins, upd, dele = row
total = ins + upd + dele
return (ins, upd, dele, (upd / total) if total else 0.0)
def recommend(schema, table):
with psycopg2.connect(DSN) as conn, conn.cursor() as cur:
upd_at = has_updated_at(cur, schema, table)
del_at = has_deleted_at(cur, schema, table)
ins, upd, dele, frac = update_rate(cur, schema, table)
print(f"table: {schema}.{table}")
print(f" inserts={ins:,} updates={upd:,} deletes={dele:,}")
print(f" update fraction={frac:.2%} updated_at column={upd_at} deleted_at column={del_at}")
if frac < 0.001 and not del_at and not upd_at:
print(" -> append-only — use monotonic-id cursor")
elif upd_at and dele == 0:
print(" -> mutable, no deletes — use updated_at cursor with 5-min overlap + MERGE")
elif upd_at and del_at:
print(" -> mutable, soft deletes — updated_at cursor + tombstone logic")
elif upd_at and dele > 0:
print(" -> mutable + hard deletes — cursor + weekly full-snapshot reconciliation")
else:
print(" -> no reliable cursor — use CDC (Build 11) or full snapshot")
if __name__ == "__main__":
recommend("public", "payment_events")
recommend("public", "users")
recommend("public", "orders")
# Output:
table: public.payment_events
inserts=14,82,03,419 updates=0 deletes=0
update fraction=0.00% updated_at column=False deleted_at column=False
-> append-only — use monotonic-id cursor
table: public.users
inserts=89,72,114 updates=2,14,80,331 deletes=0
update fraction=70.55% updated_at column=True deleted_at column=True
-> mutable, soft deletes — updated_at cursor + tombstone logic
table: public.orders
inserts=4,82,19,007 updates=92,40,118 deletes=4,12,889
update fraction=15.92% updated_at column=True deleted_at column=False
-> mutable + hard deletes — cursor + weekly full-snapshot reconciliation
Three lines do the work. pg_stat_user_tables gives you the live insert/update/delete counts per table — a free signal from Postgres that most teams ignore. update fraction = upd / total is the heuristic — under 0.1% is "essentially append-only", over 1% is "definitely mutable", in between is "ask the application team". The recommend() branches map the four shape combinations to four cursor patterns; the table's behaviour, not its name, decides which branch fires.
Why pg_stat_user_tables is the right signal: it's maintained by Postgres on every write, costs zero to query, and gives the actual update rate over the table's lifetime — not a guess from sampling, not a self-reported number from the application team. The same data is available in MySQL via INFORMATION_SCHEMA.TABLES.UPDATE_TIME plus binlog statistics, and in MongoDB via db.collection.stats().wiredTiger.cursor.update calls. Every database exposes this; almost no pipeline checks it. The first run of this script on a new pipeline routinely surfaces "we thought it was append-only but it's 30% updates."
A second-order check most pipelines skip: even when pg_stat_user_tables shows n_tup_upd = 0, ask the source team whether the application uses INSERT ... ON CONFLICT DO UPDATE (Postgres UPSERT). The upsert path increments n_tup_upd only on the conflict branch; if the application's pattern is "always insert, conflicts are rare", the update fraction reads as near-zero even though the table is technically mutable. The fix is to read the application's INSERT statements directly, not infer from statistics. The script above is a starting heuristic, not the final word — the final word is always the application contract.
Why the 0.1% / 1% thresholds in the recommend() function are not arbitrary: at under 0.1% updates, the cost of treating the table as mutable (overlap reads, dedup, MERGE) outweighs the cost of the rare missed update — most teams choose to ignore that long tail and accept it in the weekly reconciliation. Above 1%, the cost of running a cursor without dedup is prohibitive — the missed updates accumulate fast enough to corrupt downstream aggregates within days. The 0.1–1% band is the awkward middle where the right answer depends on the cost of being wrong: payments tables go to MERGE at 0.1%, marketing-event tables tolerate 1%. The thresholds are heuristics calibrated against Indian-fintech production data; tune them for your own cost-of-wrong.
When the source quietly changes shape
The most painful production failures in this category are not "the source was always mutable and we used the wrong cursor" — those are caught in design review. The painful failures are "the source was append-only for 3 years and quietly became mutable because someone shipped a feature, and the pipeline didn't notice." Two real shapes of this drift:
The risk-classification feature. A payments OLTP table is append-only by V1 contract — every payment is a final row. V2 ships a fraud-classification system that, after manual review, can re-classify a payment from success to chargeback up to 90 days later. The team that ships this is the risk team; the data team that owns the warehouse pipeline is not consulted. The pipeline keeps using WHERE id > :last_id. Every chargeback re-classification is a silent miss. Three months later, the finance team notices that the warehouse total of chargebacks is 18% lower than the OLTP total, files a ticket, and the data team spends a week reconstructing the right answer from the WAL retention.
The address-edit feature. A users table starts append-only — sign-up creates a row, no edits. V2 ships profile edit. The application updates users.address, users.phone, users.email in place. The pipeline that loads users uses WHERE id > :last_id and never sees the edits. The marketing team's email campaign uses the warehouse copy and sends to old email addresses. The first complaint is a customer whose new email never received the verification; the investigation reveals the pipeline is loading from a years-old snapshot of the row.
The defence is structural, not procedural. Make the cursor's correctness depend on a property of the source that the source team must consciously break. Two patterns:
-
A monitor that compares row counts. The pipeline runs
SELECT COUNT(*), MAX(id), MAX(updated_at) FROM sourceonce an hour and stores the result. Ifn_tup_updfrompg_stat_user_tablesever crosses a threshold (say, 1% ofn_tup_ins), the monitor pages the data team. The cost is one query per hour; the win is that the day the source becomes mutable is the day the data team finds out, not three months later. -
A schema-contract registry. The data team owns a contract document — a YAML file in version control — that lists for every source table its declared shape (
append-only,mutable-with-updated-at,mutable-with-cdc). Every source-team pull request that touches the table runs a CI check against this contract. A PR that adds anUPDATEto anappend-onlytable fails CI; the source team has to either downgrade the shape in the contract (with the data team's review) or restructure the change. The cost is one CI hook; the win is that the contract is enforced before the change ships, not after the warehouse drifts.
Most Indian fintechs run both. The runtime monitor catches the case where a source team forgets the contract or where the contract was wrong from day one. The CI hook catches the deliberate-but-uncommunicated change. Together they reduce silent drift from "discovered three months later in an audit" to "caught in the PR or in the next hour".
Common confusions
- "An auto-incrementing primary key means the table is append-only." No. The PK only identifies the row; nothing about the PK prevents
UPDATEagainst the row. A table with a serialidand noupdated_atis the worst combination — looks append-only, may be mutable, gives you no signal which. - "
updated_atexists on the table, so I can trust it." Only if every write path sets it. The trustworthy version is a database trigger that setsupdated_at = NOW()on everyINSERTandUPDATE; the untrustworthy version is application code that "usually remembers". Audit the trigger before trusting the column. - "Append-only is rare in real systems." It's common in event-sourced systems (Razorpay's payment_events, Zerodha's order_book_events, every Kafka topic) and in immutable audit logs. Append-only tables are the foundation of streaming pipelines; the mutable view is computed downstream.
- "CDC fixes everything, so why bother classifying the source?" CDC has its own costs — operational complexity (Debezium connectors), schema-evolution gotchas, and a higher per-event price than a polled cursor. For a source that is genuinely append-only at 50k rows/sec, a monotonic-id cursor is 10× cheaper than CDC and just as correct. Use CDC where you need it, not as the default.
- "A soft delete is the same as a hard delete." A soft delete (
UPDATE SET deleted_at = NOW()) is detectable by anupdated_atcursor; a hard delete (DELETE FROM table) is invisible to any cursor. Most teams don't care about the difference until a compliance audit demands "show me when this row was deleted" and the hard-deleted rows have no answer. - "The shape of the source is the source team's problem, not mine." The shape is the contract between the two teams. A data team that does not own the contract is a data team that ships wrong dashboards; a source team that does not honour the contract is a source team that produces incidents. Both teams own the contract jointly, and both pay when it breaks.
Going deeper
Append-only as a design discipline (event sourcing)
Event sourcing is the explicit choice to make every source table append-only. Instead of storing users.address, you store an append-only user_events log with rows like {user_id: 7, event: 'address_changed', new_address: 'Koramangala', ts: 14:00}. The current address is computed as the latest event of that type for that user. The benefits compound: every cursor is a monotonic-id cursor, every backfill reads from the same log, the audit trail is the data, and "what was Riya's address at 14:30 last Tuesday" is a query the system answers natively.
The cost is everywhere. Every read becomes a fold over the event log, which is expensive enough that production systems precompute "current-state" projections from the events into denormalised tables — and now you have two tables to keep in sync, with the projection as the cache. The discipline pays off for high-audit, regulated domains (payments, trading, healthcare); it's overkill for a CRUD admin panel.
Razorpay's payment_events table is the canonical Indian example — every state transition is a new row, the warehouse computes the current view, and a 3-year-old payment is queryable as "what state was it in on 2023-08-12?" without a backfill. Zerodha's order_book_events is the same shape at higher scale (1.5 crore events/day during peak market hours). Both companies converged on event sourcing for the audit and replay properties; both also operate the projection/cache layer that the read path needs.
Slowly Changing Dimensions (SCD) — the warehouse's response
The data warehouse community formalised the mutable-source problem decades ago as Slowly Changing Dimensions (SCD). The original Kimball typology has six types; in modern practice three matter:
- SCD Type 1 (overwrite): the warehouse stores only the current value. Old values are lost. Cheapest, least useful for audit.
- SCD Type 2 (versioning): every change creates a new row in the warehouse with
valid_fromandvalid_totimestamps. The current row hasvalid_to = NULL. The full history is queryable. Most production warehouses use this. - SCD Type 4 (history table): the current state is in one table, the full history is in a separate table. Half-way between Types 1 and 2.
SCD Type 2 is the bridge between mutable sources and append-only warehouses: the source team can update rows freely, the data team's load logic computes the diff against the warehouse's current row, and writes a new versioned row when the diff is non-empty. The cursor is WHERE updated_at > :hwm, the load is "merge into Type-2 destination", and the warehouse's table is internally append-only even though the source's table is mutable. The whole pattern is the data warehouse community's answer to "the source is mutable; the warehouse should keep the history anyway".
Why CDC is the universal answer when you can afford it
CDC (Change Data Capture) reads from the database's WAL/binlog/oplog instead of polling the table. Every insert, update, and delete becomes an event in a stream. The cursor is the WAL position (LSN in Postgres, GTID in MySQL, oplog timestamp in MongoDB). The pipeline subscribes to the stream and processes events in commit order.
The wins:
- Inserts, updates, and deletes are all visible. No more invisible deletes.
- No
updated_atdependency. The WAL timestamps every commit; the application can't forget to set them. - No table contention. The pipeline reads from the WAL; the source table sees no extra load.
- Exact ordering. Events arrive in commit order, which is the natural time-order of the source. No more "did this update happen before or after that one" ambiguity.
The costs:
- Operational complexity. A Debezium connector, a Kafka cluster, schema-registry integration, and reconnect/checkpoint discipline. Most pipelines that adopt CDC underestimate the operational cost by 3-5×.
- Schema-evolution gotchas. A column added to the source mid-stream may or may not appear in the CDC events depending on the connector's schema-handling mode.
- Replication slot drift. A Postgres replication slot that the consumer falls behind on holds WAL until the consumer catches up; a stuck consumer fills the source's disk. Production deployments alert on slot lag aggressively.
Build 11 covers the mechanism in detail. The rule for now: CDC is correct, expensive to operate, and the right answer when the source is genuinely mutable and the cost of getting deletes is significant. For append-only sources, polling is cheaper and just as correct.
The Flipkart catalog: a real shape-detection story
In 2022 Flipkart's catalog team migrated their product database from a legacy MySQL store to a sharded Vitess deployment. The data engineering team's load pipeline was a 3-year-old Airflow DAG using WHERE updated_at > :hwm against MySQL. The migration cutover happened over a weekend; the pipeline kept running against the new Vitess endpoint with no code change.
For three weeks, the warehouse copy of the catalog was 12% smaller than the source. Investigation revealed that Vitess's updated_at column was application-maintained, and the migration code had not consistently set it on the rewrite path — the column was "best effort" in the new world. The legacy MySQL pipeline trusted updated_at; the Vitess pipeline could not.
The fix landed in three layers. Short-term: a daily reconciliation job that computed the diff between source and warehouse (id, hash_of_row) and emitted upserts for the missing rows. Medium-term: a Debezium CDC stream on the Vitess shards, replacing the cursor entirely. Long-term: a CI hook on the catalog repo that fails any PR which writes to the catalog tables without going through a maintained updated_at setter. The combined fix took 4 weeks of engineering time and stopped the drift permanently.
The post-mortem's published lesson: the shape of a source is not preserved across migrations unless someone explicitly preserves it. The pipeline that worked against MySQL was correct for MySQL's contract; the same code was wrong against Vitess's. The classification — "is this source append-only, mutable-with-updated-at, or mutable-without?" — must be re-asked at every migration, every major version bump, and every architectural change to the source.
The bitemporal generalisation
The previous chapter introduced the bitemporal model — every row has valid_time (when it was true in the world) and transaction_time (when the system saw it). Append-only and mutable are the two poles of how a source represents the second dimension. An append-only source emits a new row for every change; the row itself is final, and transaction_time is the row's created_at. A mutable source overwrites the row; only the latest transaction_time is preserved.
A bitemporal source — rare but powerful — preserves both. Every row has (valid_from, valid_to, transaction_time), and corrections produce new rows with later transaction_time rather than overwrites. The cursor is straightforward (WHERE transaction_time > :hwm), the audit trail is native, and SCD Type 2 happens at the source rather than the warehouse. Iceberg and Delta tables can be configured to behave bitemporally; some Postgres extensions (PeriodSQL, temporal_tables) approximate it. The cost is double the storage and meaningfully complex queries; the win is that the warehouse-team-vs-source-team contract is unambiguous.
For most production pipelines in 2026, bitemporal sources remain the exception. The right baseline is to classify your sources, instrument the drift monitors, and pick the cursor pattern that matches the shape — and to revisit the classification at every major source change.
Where this leads next
- Schema drift across incremental loads — chapter 16, what happens when the source's columns change while the cursor still works
- Cursors, updated_at columns, and their lies — chapter 13, the four ways
updated_atcan betray you - High-water marks: tracking what you've seen — chapter 12, the cursor primitive
- Late-arriving data and the backfill problem — chapter 14, the streaming generalisation of the same shape problem
- Hash-based deduplication on load — chapter 8, the dedup primitive that the mutable-source
MERGErelies on
Build 11 will return to this conversation with CDC as the universal answer for mutable sources — Postgres logical decoding, MySQL binlog, MongoDB oplog, and Debezium as the integration layer. The current chapter sets up why CDC is sometimes worth the operational cost: when the source is genuinely mutable, polling either misses deletes or pays for full snapshots, and CDC is the structural answer. Build 5 will return with data contracts as the formal version of the cross-team agreement that this chapter has been describing informally — a YAML schema in a registry, validated in CI, that turns "the source is append-only" from a Slack message into an enforced invariant.
References
- Martin Kleppmann, DDIA, Chapter 11: Stream Processing — the canonical treatment of append-only event logs and the mutable-view-derived-from-immutable-events pattern.
- Pat Helland, Immutability Changes Everything — the essay that crystallised "immutable data + computed views" as a design principle.
- Ralph Kimball, The Data Warehouse Toolkit — the classic reference for SCD Types 1, 2, and 4.
- Debezium documentation: Change Data Capture for Postgres — the practitioner's reference for the CDC alternative to cursor-based loading.
- Razorpay engineering blog: payment_events at scale — the team's published account of an event-sourced payments table and the warehouse projections derived from it.
- Cursors, updated_at columns, and their lies — the previous chapter, which dissects the four failure modes of an
updated_atcursor. - Vitess documentation: schema management — context for the Flipkart migration story; Vitess's update-tracking model differs from vanilla MySQL.
- PostgreSQL: pg_stat_user_tables — the system catalog that the shape-detection script in §"A worked example" relies on.
The honest summary: the first question to ask of any source table is "are you append-only or mutable?" because the answer decides every downstream choice — the cursor, the dedup, the backfill, the CDC investment, the SCD strategy in the warehouse. The second question is "how do you know the answer is still correct?" because sources drift over their lifetime and a pipeline that does not detect the drift is a pipeline that produces silent wrong answers for years. The mature data team owns the contract on every source it reads from, instruments the drift monitors that fire when the contract slips, and treats the migration of any source as an event that re-asks the classification rather than assuming it carried over. Skip the classification and you'll discover all four cursor patterns the hard way — one production failure at a time.
A practical exercise: list every source table your team's pipeline reads from. For each, write down the declared shape (append-only / mutable-with-updated-at / mutable-without / CDC). Run the pg_stat_user_tables check on each. The first run almost always finds at least one source whose declared shape does not match its actual behaviour — and that source is your highest-priority cleanup item this quarter.
The follow-up exercise, once the classification is in version control: write a one-page contract per source table, co-signed by the data team and the source-owning team, that states the declared shape, the consequences of breaking it, and the review process for changing it. The contract sits alongside the API documentation that the source-owning team already maintains for their service consumers — because the data pipeline is a consumer of that service's data, on the same footing as any HTTP client. Teams that elevate "we read from this table" to "we have a contract with the team that writes to this table" stop discovering source-shape changes through finance reconciliations; they discover them in the PR review where the change is proposed.