Data contracts: the producer/consumer boundary
On a Friday afternoon at Razorpay, the payments team renames txn_amt_paise to amount_paise in the payments.transaction table — a clean refactor that passes their unit tests, ships through CI, and rolls to production. By 9 p.m. that evening, the analytics team's revenue dashboards are showing zeros across every merchant tier; by 11 p.m., the reconciliation pipeline against the GST filings has stalled because the column it joins on no longer exists; by midnight, the on-call data engineer is reading the payments team's PR description for the first time and discovering that "minor cleanup" reached past three teams. Nobody did anything wrong. The producer changed a column they owned. The consumer read a column they did not own. The break sat in the gap between them, where no contract said either party owed the other anything in particular. Multiply this gap across a 200-engineer org and you get a data platform that gets harder to change every week — until somebody decides that the producer's right to refactor and the consumer's right to a stable schema have to be written down, signed, and machine-enforced.
A data contract is a producer's promise about what their data will look like — fields, types, semantics, freshness, allowed evolution — written down in a versioned schema and checked in CI before the producer's change ships. The consumer reads against the contract, not against the producer's current implementation. The contract makes refactoring safe and breakage loud, instead of leaving the boundary to Slack archaeology after midnight.
Why the boundary needs a contract at all
The architectural fact that makes data engineering different from application engineering is that the producer of data and the consumer of data live in different teams, ship on different cadences, and are reviewed by different code-owners. The payments team owns payments.transaction because they write to it; the analytics team owns the revenue mart because they read from it; the ML team owns the fraud-features table because they materialise from both. Three teams, three on-calls, three product roadmaps — and one shared table whose schema is the thinnest piece of glass between them.
In an application-engineering setting, the equivalent boundary is the API contract — POST /payments/v1/charge returns a JSON body whose schema is documented, versioned, and tested in CI. Nobody at Flipkart would dare rename a JSON field on a public API without a deprecation window, a v2 endpoint, and a migration plan. The data layer used to skip all of this because the producer and consumer happened to share a database, and the schema was assumed to be a private implementation detail of the producer's service. Why this assumption broke: the moment a CDC connector or a warehouse loader starts reading a producer's table, the table's schema becomes a public interface — but unlike an HTTP API, nothing in the system communicates that fact back to the producer's CI pipeline. The producer keeps treating it as private; the consumer treats it as a stable contract; the gap between those two beliefs is where every "the dashboard is broken" Friday-night incident lives.
A data contract closes the gap by making the implicit interface explicit. The producer's CI now runs a contract check before any schema change ships: does the proposed change break the registered contract? If yes, the change is either rejected, downgraded to a backwards-compatible variant (add a column instead of rename, deprecate-then-remove instead of drop), or escalated through a contract-evolution review where the consumer signs off. The consumer's pipeline now reads against the contract — declares which fields it consumes, what types it expects, what it does if a field is null — and the platform refuses to deploy a consumer pipeline that references fields the contract does not promise. Both sides become more disciplined; both sides become safer; the boundary stops being a guess.
The reason this didn't exist as a category until 2022 is that the warehouse-vs-application split was newer than people realised. Through the 2010s, most "analytics" was downstream of nightly batch dumps, and the latency was so high that schema changes propagated through Slack faster than through pipelines. By 2022, with CDC stitching OLTP changes to streaming pipelines in seconds, the latency collapsed — and so did the time available for human coordination. Chad Sanderson's "Data Contracts 101" essays in 2022, alongside the Convoy and PayPal teams' production deployments, reframed this as the missing primitive of modern data platforms — and by 2026, most Indian fintechs at Razorpay-scale had at least a partial contract layer in production.
Anatomy of a contract
A contract is not just a schema. A schema says what the bytes look like; a contract says what the bytes mean, who promises them, what guarantees come with them, and how they can change. The minimum viable contract has six sections, all version-controlled together as one artifact.
# contracts/payments/transaction.v3.yaml
# A data contract for the payments.transaction table at Razorpay.
# Owned by the payments service team; consumed by analytics, ML, and finance.
contract:
name: payments.transaction
version: 3.2
owner_team: payments-platform
on_call_slack: "#payments-oncall"
ownership_email: payments-platform@razorpay.com
created: 2024-08-12
updated: 2026-04-15
schema:
fields:
- name: id
type: UUID
nullable: false
description: "Globally unique transaction identifier; idempotent across retries."
- name: merchant_id
type: UUID
nullable: false
foreign_key: merchants.merchant.id
description: "The merchant receiving this payment."
- name: amount_paise
type: INT64
nullable: false
description: "Amount in paise. INR currency assumed unless 'currency' is set."
constraints: ["amount_paise > 0", "amount_paise < 10000000000"]
- name: currency
type: ENUM
values: [INR, USD, EUR, GBP, AED, SGD]
nullable: false
default: INR
- name: status
type: ENUM
values: [INITIATED, AUTHORIZED, CAPTURED, REFUNDED, FAILED]
nullable: false
- name: created_at
type: TIMESTAMP
nullable: false
description: "UTC. The timestamp at which the merchant called /charge."
guarantees:
freshness:
p50: 30s
p95: 5min
p99: 15min
completeness:
p99_per_hour: 99.95
uniqueness: id
ordering: not_guaranteed # CDC reorders within a partition
semantics:
amount_paise: |
The pre-tax, pre-fee amount the merchant requested.
Refunds emit a separate row with status=REFUNDED, NOT an in-place update.
For currency != INR, amount_paise is the smallest unit of that currency.
status: |
Terminal statuses are CAPTURED, REFUNDED, FAILED.
A row may transition INITIATED -> AUTHORIZED -> CAPTURED (or FAILED) over
minutes; consumers should join on (id) and pick the latest by created_at.
evolution:
policy: backwards_compatible_only
allowed:
- add_optional_field
- widen_enum_values
- widen_numeric_type
disallowed:
- remove_field
- rename_field
- narrow_enum_values
- change_field_type
- tighten_nullability
deprecation_window_days: 90
governance:
pii_class: B
retention_days: 2555
dpdp_purpose: payment_processing
rbac:
read: ["analytics", "ml-fraud", "finance-eng"]
write: ["payments-platform"]
The six sections do orthogonal jobs. schema lists the fields and their types — what most teams already had. guarantees is the new piece that turns a static type list into an operational promise: freshness percentiles, completeness floor, uniqueness key, ordering semantics. Why guarantees belong in the contract and not somewhere else: the consumer needs to know whether id is actually unique before they write a JOIN that assumes it is. If uniqueness lives in a separate freshness-monitoring tool that the consumer doesn't read, they'll write the JOIN, hit duplicates in production, and discover the truth at 2 a.m. The contract is the one place where every fact a consumer needs to write correct downstream code lives together. semantics captures the "what does this mean?" answers that schema cannot encode — what is the difference between INITIATED and AUTHORIZED, when does a row appear, what does refund look like. evolution explicitly enumerates which changes are allowed without a major version bump and which require coordination. governance carries the PII class, retention, and access-control facts that downstream pipelines must respect.
The contract is a single document that the producer, the consumer, and the platform all read. It is checked into git alongside the producer's code, version-tagged on every change, and rendered into the catalog (chapter 30) so that consumers can discover it. When the producer wants to ship a change, the contract diff is part of the PR — and if the diff violates the evolution policy, CI rejects the PR before the schema change reaches the database.
Enforcing a contract in CI
The contract on its own is just a YAML file. What makes it operationally real is the CI gate: every PR that touches the producer's schema runs a contract-check that compares the proposed change against the registered contract and rejects violations.
# scripts/check_contract.py
# Run in CI on every PR that modifies a database migration or contract YAML.
# Exits 0 if the proposed schema is compatible with the registered contract.
import sys, yaml, subprocess, json
from pathlib import Path
ALLOWED_BACKWARDS = {"add_optional_field", "widen_enum_values", "widen_numeric_type"}
def load_contract(path):
return yaml.safe_load(Path(path).read_text())["contract"]
def proposed_schema_from_migration(migration_sql):
"""Extract the fields the migration would produce by parsing recent alters."""
# In production: parse with sqlglot or a pg_parse wrapper.
# Here we shell out to a helper that returns a normalised JSON schema.
out = subprocess.check_output(["./bin/dump_schema_after", migration_sql])
return json.loads(out)
def diff_schemas(old, new):
old_fields = {f["name"]: f for f in old["fields"]}
new_fields = {f["name"]: f for f in new["fields"]}
diffs = []
for name in old_fields:
if name not in new_fields:
diffs.append(("remove_field", name))
elif old_fields[name]["type"] != new_fields[name]["type"]:
diffs.append(("change_field_type", name))
elif old_fields[name]["nullable"] and not new_fields[name]["nullable"]:
diffs.append(("tighten_nullability", name))
for name in new_fields:
if name not in old_fields:
kind = "add_optional_field" if new_fields[name]["nullable"] else "add_required_field"
diffs.append((kind, name))
return diffs
def check(contract_path, migration_sql):
contract = load_contract(contract_path)
new_schema = proposed_schema_from_migration(migration_sql)
diffs = diff_schemas({"fields": contract["schema"]["fields"]}, new_schema)
policy = contract.get("evolution", {}).get("policy", "backwards_compatible_only")
failures = []
for kind, name in diffs:
if policy == "backwards_compatible_only" and kind not in ALLOWED_BACKWARDS:
failures.append(f" - {kind} on '{name}' violates {policy}")
if failures:
print(f"CONTRACT VIOLATION on {contract['name']}:")
for f in failures: print(f)
print(f"\nIf this is intentional, bump the major version and open a")
print(f"contract-evolution PR with the consuming-team approvers tagged.")
sys.exit(1)
print(f"OK: {len(diffs)} compatible change(s) on {contract['name']}")
if __name__ == "__main__":
check(sys.argv[1], sys.argv[2])
# Sample run on a PR that renames `txn_amt_paise` to `amount_paise`:
$ python scripts/check_contract.py contracts/payments/transaction.v3.yaml \
db/migrations/20260415_rename_amt.sql
CONTRACT VIOLATION on payments.transaction:
- remove_field on 'txn_amt_paise' violates backwards_compatible_only
- add_required_field on 'amount_paise' violates backwards_compatible_only
If this is intentional, bump the major version and open a
contract-evolution PR with the consuming-team approvers tagged.
The walkthrough hits four mechanisms. load_contract() reads the YAML and gives the registered, currently-promised shape — the source of truth for what consumers depend on right now. proposed_schema_from_migration() runs the migration against an ephemeral schema (Postgres in a Docker container, or a pg_dump --schema-only after applying the migration to a clone) and returns the would-be schema as JSON. Why this matters: simulating the migration in CI catches violations before the migration touches a real database, so the producer team learns about the contract break in their PR review instead of in production at 9 p.m. The cost is a few seconds of CI time; the value is an entire class of incident that never happens.
diff_schemas() classifies every change as one of a small enumerated set: remove_field, change_field_type, tighten_nullability, add_optional_field, add_required_field. The classification is what the policy engine reasons about — it doesn't try to be clever about column-name similarity (rename detection is a separate problem for a separate tool); it just enumerates the diff in primitive terms.
The policy gate maps the enumerated diffs against evolution.policy. The default policy backwards_compatible_only permits only the safe set. Producers who want to ship a breaking change explicitly bump the major version, which switches the contract to a new file with a new version number, which requires consumer-team approvers in the PR — turning a Friday-afternoon refactor into a coordinated migration. The escape hatch exists; it just isn't accidental.
The CI gate is what gives the contract teeth. A contract that lives in the catalog but doesn't gate merges is a wishlist; a contract that fails the producer's PR is enforcement. The Razorpay payments team adopted contracts in 2024 and reported that the gate caught 47 silent-breaker PRs across 12 producer teams in the first quarter. Violation rates dropped 80% in quarter two — not because producers stopped trying, but because they internalised the rules and started shipping backwards-compatible changes by default. The gate trains the team while enforcing.
Where contracts touch the runtime
A contract is defined statically and gated in CI, but its operational value lands in the runtime — at the moment a consumer reads or a producer writes. There are three surfaces where the runtime intersects the contract.
At write time on the producer side, schema-aware producers (Kafka with a Schema Registry, Iceberg writers with their schema property, Postgres check constraints) reject writes that don't conform. The cost is the schema-validation tax — about 30µs per event for Avro on Kafka. On a UPI-scale firehose at PhonePe (~3,000 tx/sec) it's invisible; on a high-fanout clickstream at Flipkart (>500k events/sec), the team caches schemas locally and validates only on version change, dropping the tax to under 1µs per event.
At read time on the consumer side, schema-aware consumers compare the incoming payload's schema against the version they were compiled against. If producer is now v3.5 and consumer expects v3.2, it either (a) auto-upgrades because v3.5 is backwards-compatible, (b) fails fast on incompatible skew, or (c) opts into permissive mode where unknown fields are ignored. A critical reconciliation pipeline picks fail-fast; a logging pipeline picks permissive.
At runtime in the pipeline orchestrator, contract checks become part of the DAG itself. A dbt model declares its dependency on contract version 3.x; when dbt parses the model, it errors if the producer is now on 4.x and the model hasn't been updated. dbt's meta: block in schema.yml is where this declaration lives in 2026, and dbt-cloud's Explorer pulls the contract metadata into its lineage view so a producer can see which downstream models would break under a proposed change.
Drift detection is the fourth surface. The contract says currency is one of [INR, USD, EUR, GBP, AED, SGD]. On April 14 2026, the producer started emitting JPY because a new merchant onboarded — the CI gate didn't catch it (no schema change), but a sampling job that checks recent rows against the contract's enum constraints flagged it within minutes, opened a Jira against the payments team, and quarantined JPY rows until the contract is updated.
The three layers compose. The CI gate prevents most breakage at PR time. Write-time validation is the safety net for "the producer's app code drifted from the schema". Read-time validation catches "the consumer was deployed before the producer evolved". Drift detection is the asynchronous sampler that finds the bugs the other three missed. Why all four are needed: each catches a different class of failure with different latency. CI catches deliberate schema changes immediately; write-time catches accidental ones at the next event; read-time catches version mismatches at consumer-deploy time; drift catches semantic changes that look syntactically valid. Skipping any one of these leaves a category of incident open — and the categories are independent, so layering them is what makes the contract actually load-bearing.
How contracts get adopted (and where they fail)
A contract layer is an organisational change as much as a technical one. The technical pieces — YAML schema, CI gate, runtime validators, drift detector — are shippable in a quarter. The organisational pieces — getting producer teams to write contracts, getting consumer teams to declare contract dependencies, getting the platform team to operate the registry — take a year.
The adoption pattern that works, observed across Razorpay, Cred, and Swiggy, is: pick a single high-value boundary first (usually the OLTP-to-warehouse seam where every analytics pipeline lands), write contracts for the 10–20 most-consumed tables, ship CI gates with a 90-day grace period during which violations warn but don't block, then flip the gate to blocking after the grace period. The grace period is critical — it gives producer teams time to fix existing latent violations without the gate becoming the team that broke production. After the OLTP boundary is contracted, expand to event streams (Kafka topics), then to ML feature stores, then to BI dataset definitions. Each expansion borrows the substrate of the previous one.
The producer doesn't know they're a producer. A microservice writes to its own database; CDC is wired up by the data-platform team; the producing team learns about it when the contract gate fires on their PR. They push back hard, because the contract is imposed rather than authored. The fix is to make the contract a producer-side artifact owned by the producing team — the producer writes the contract, the platform team reviews it, the gate fires in the producer's CI. Contracts that live in the data-platform team's repo are never adopted; contracts in the producer's repo become part of how that team ships.
Contracts written but not consumed. A team writes a contract, ships the gate, assumes their work is done. Six months later, the contract drifts from the actual table because nobody updated it on minor changes. The discipline: make contract drift itself a violation — a daily job compares registered contract to actual schema and fails loudly if they diverge. If they don't match, one of them is wrong, and the gate forces the producer to choose which.
Third: governance theatre. Teams adopt contracts because a regulator's checklist asked for them, populate minimally for the audit, never integrate at runtime. The next audit fails because the auditor asks "show me a contract violation caught in CI last quarter". The fix is identical to the catalog fix: integrate the contract into operational workflows so it cannot go stale without breaking real work.
Fourth: contracts that try to specify everything. A contract listing every business invariant becomes a 600-line YAML that nobody reads. The discipline that scales is to keep the contract focused on schema, types, semantics, and non-negotiable business rules — push richer rules into layer-specific tools (Soda/dbt-test for row quality, the metric layer for semantic invariants, Unity Catalog/Ranger for governance). The contract is the producer's promise about shape; it isn't every analytical assumption a consumer might make.
Common confusions
- "A schema is the same as a contract." A schema describes the bytes; a contract describes the bytes plus the freshness guarantee, the semantics, the evolution policy, the ownership, and the governance class. A schema lives in the database; a contract lives in version control alongside the producer's code, with a CI gate enforcing it.
- "A contract is the same as a data quality test." Soda or Great Expectations tests check whether a specific row passes a rule (
amount > 0); a contract is the structural promise that the columnamountexists, has type INT64, and won't be renamed without a major version bump. Tests catch row-level bad data; contracts catch interface-level breakage. - "A contract is the same as the catalog entry." The catalog (chapter 30) is the searchable index; the contract is the specification. The catalog displays the contract — usually rendered from the same YAML — but it isn't the source of truth. The contract lives in git; the catalog ingests from it.
- "Contracts are just for streaming pipelines." Originally yes; the Confluent Schema Registry pattern from 2014 was Kafka-specific. By 2026, contracts cover OLTP→warehouse seams, batch dbt models, ML feature stores, and BI dataset definitions equally. Anywhere a producer/consumer boundary exists across teams, a contract is the reusable primitive.
- "A contract eliminates the need for testing." A contract enforces the structural and policy-level guarantees; row-level data quality (nulls, ranges, distributions) still needs runtime tests. Contracts and tests compose, they don't substitute. A team running contracts without dbt-test or Soda still ships bad data; a team running tests without contracts still ships breaking schema changes.
- "Once we adopt contracts, we don't need the catalog." The catalog is the discovery layer (find the dataset), the contract is the specification layer (what does the dataset promise), the lineage is the topology layer (where did it come from). Each answers a different question; production data platforms run all three together.
Going deeper
The Confluent Schema Registry and why streaming pipelines got contracts first
The Confluent Schema Registry, shipped in 2014, was the first widely-deployed contract system in data engineering — even though it wasn't called that at the time. Every Kafka producer registered its Avro/Protobuf schema; every consumer fetched it by ID embedded in the message; compatibility checks (BACKWARD, FORWARD, FULL, NONE) prevented producers from registering breaking schemas. The pattern worked because Kafka's binary serialisation forced schema-awareness at the byte level — you literally couldn't read an Avro message without the schema. The OLTP and warehouse worlds avoided contracts for another decade because SQL rows are self-describing, so the forcing function was absent. By 2024, the same mechanics ported to OLTP via Debezium's contract-aware mode: pick a compatibility mode at registration, gate on registration, evolve with version bumps, fail loudly on consumer-side skew. The streaming world learned early because Avro punished the alternative; the rest learned late because SQL forgave it.
How contracts compose with column-level lineage
Column-level lineage (chapter 29) and contracts solve adjacent problems. Lineage tells you which downstream columns depend on a producer's column; contracts tell you what the producer promises about that column. The two compose powerfully: when a producer proposes a contract change, the platform can compute the lineage closure to enumerate every consumer that would be affected, surface the contract diff to each of those consumers, and require sign-off from each consuming team's on-call before the change merges. This is the "blast radius gate" pattern — rejected by the contract gate alone if no consumer has signed off, accepted only with explicit downstream approval. Razorpay shipped this pattern in late 2025 and reported that the average breaking-change PR went from 3 days of Slack negotiation (in the pre-contract world) to a structured 2-day approval cycle with all the right approvers tagged automatically. Lineage finds the consumers; the contract gives them something concrete to approve or reject.
Contracts in the Iceberg / Delta lakehouse world
Iceberg and Delta both store schema-evolution metadata as first-class table metadata, separate from the data files. An Iceberg table's schema history is queryable as a system table (SELECT * FROM my_table.history), and the supported evolution operations (ADD COLUMN, RENAME COLUMN, WIDEN TYPE) are explicitly enumerated in the spec. This makes Iceberg/Delta tables a natural substrate for contracts — the table itself enforces a subset of the contract's evolution policy. The remaining work is to write a contract YAML that points at the table, gate the producer's PR against the contract's evolution policy plus the Iceberg-allowed operations, and surface the contract metadata in the catalog. PhonePe's 2026 lakehouse migration used this pattern: every Iceberg table had a contract, the contract was rendered into Unity Catalog's data-product view, and the producer's CI ran against both the YAML and the Iceberg evolution rules. The double-gate caught a few cases where the Iceberg spec permitted a change the team's contract policy didn't (widening a non-nullable column's type was Iceberg-legal but contract-forbidden in their stricter policy).
What the data-mesh discourse got right and wrong about contracts
Zhamak Dehghani's 2019–2020 data-mesh framing popularised "data products" — datasets owned by domain teams, with contracts and SLAs at the product boundary. The framing got the ownership model right: producers own their data, contracts are the product spec, consumers depend on the contract not on the implementation. What the discourse got wrong was implying this requires a federated org structure with full domain autonomy. The contract pattern is valuable in any organisation — federated, centralised, hybrid — because the producer/consumer gap exists wherever teams ship at different cadences. The mesh contribution was naming the problem; the contract pattern is the implementation, and it works in any org shape.
Contracts as the foundation for DPDP compliance
By 2026, DPDP-2023 enforcement has reached the stage where regulators ask "show me how a deletion request propagated to every consumer of that data". Contracts answer this: every PII-tagged column has a known set of declared consumers, and a deletion request fans out as a graph traversal along the contract dependency edges. The pattern emerging at Indian fintechs is to make DPDP's purpose-binding requirement a first-class contract field — dpdp_purpose: payment_processing — and have the platform refuse to let a consumer query the data for any other purpose unless they declare and justify the new use. The contract becomes the substrate for purpose-aware governance.
Where this leads next
The next chapter (32) covers the data-quality test layer that contracts compose with — Great Expectations, Soda, dbt-test — for the row-level checks contracts don't replace. Chapter 33 covers freshness SLAs, the operational guarantee that lives inside the guarantees block of every contract and is monitored continuously. Chapter 34 covers the alerting layer that surfaces contract violations and SLA misses to the right on-call.
- Data catalogs and the "what does this column mean" problem — chapter 30, the discovery layer that renders contracts to consumers.
- Column-level lineage: why it's hard and why it matters — chapter 29, the dependency graph that contracts compose with for blast-radius gates.
- Schema drift: when the source changes under you — chapter 17, the failure mode contracts directly prevent.
The framing senior platform engineers at Razorpay use for contracts: they are not bureaucracy; they are the cheapest possible coordination cost. Without a contract, every schema change is a Slack thread, a meeting, and three weeks of "did we tell everyone?". With a contract, the same change is editing a YAML and getting a one-line approval from each downstream team — a process that scales to 50 producers and 200 consumers without becoming anyone's full-time job. The contract is the difference between a platform that gets harder to change every week and one that gets easier to change every week.
A practical bar: pick a random producer team and ask what happens if they propose a column rename today. If they describe a CI gate failing on their PR with a path to a contract-evolution PR, contracts are working. If they describe a Slack thread or "we'd just merge it and see what breaks" — contracts haven't crossed from infrastructure into culture.
References
- Chad Sanderson, "Data Contracts 101" (2022) — the essay series that named the pattern and reframed it as the missing primitive.
- Confluent Schema Registry documentation — the streaming-side ancestor of every contract system; the compatibility modes and gate mechanics translated to the rest of data engineering a decade later.
- Convoy data-contracts CLI — open-source CLI for schema-based contract enforcement in CI; one of the first reference implementations outside the streaming world.
- PayPal "Data Contracts at PayPal" (2023) — production deployment write-up that informed many subsequent enterprise rollouts.
- DPDP Act 2023, India — the regulatory driver for purpose-binding fields in contracts.
- Apache Iceberg schema evolution spec — the lakehouse-native evolution rules that contracts compose with.
- Zhamak Dehghani, "How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh" (2019) — the data-product framing that popularised contract thinking.
- Data catalogs and the "what does this column mean" problem — chapter 30, the discovery layer contracts render through.