Quality tests: row-level, table-level, referential
It is 2:14 a.m. at PhonePe's data platform office in Bengaluru. The freshness SLO is green. The DAGs are green. The dashboard is green. And yet the morning fraud-ops team will discover, eight hours from now, that 1.4 lakh transactions were silently scored as merchant_category = 'unknown' because a producer release the previous evening renamed mcc_code to merchant_mcc and the silver-layer transform now coalesces every row to NULL. The pipeline ran. The data landed. The freshness lag is 41 seconds. The data is stale-free and wrong. The on-call engineer that night had no quality tests on the silver table — only freshness — and the cost of that gap is a regulatory anomaly report filed to NPCI by 11 a.m. the next morning.
Quality tests are runnable assertions on the data itself, organised into three families: row-level (does this row obey its column-shape rules?), table-level (does this whole table obey its aggregate rules?), and referential (do rows in this table line up with rows in another?). Each family catches a different bug, runs on a different cadence, and fails differently — and a pipeline without all three accepts garbage as long as the garbage is fresh and well-shaped.
The three families and what each one catches
Freshness tells you the data is recent. Quality tests tell you the data is right. Those are different problems with different mechanisms, and lumping them under "data quality" — a single phrase used to mean five things — is how teams ship a freshness monitor and call the work done. The three families below are not arbitrary categories; they correspond to the three places where bugs actually enter a warehouse.
Row-level tests check column-shape rules on each row independently. amount > 0, currency IN ('INR','USD','SGD'), email MATCHES /.+@.+/, event_time <= NOW(), mcc_code NOT NULL. They run during ingest, before the row is committed to the silver table — or immediately after, in the post-load test pass. They catch the class of bug where a single row arrives malformed: a Razorpay producer ships a release that puts the rupee amount in paise but the contract says rupees, and 100x-inflated amounts start arriving. Every single row is wrong by the same factor. Why row-level checks must run on every load: the cost of letting a row through is unbounded — that row joins to a fact table, gets aggregated into a daily total, and propagates through the warehouse. The cost of rejecting a row to a quarantine table is a few KB and one alert. The asymmetry says check on entry, not at the end.
Table-level tests check aggregate properties of the whole table — COUNT(*), COUNT(DISTINCT txn_id) = COUNT(*), SUM(amount), null_rate(merchant_id) < 0.001. They run per partition, after a partition is closed. They catch a fundamentally different bug class: every individual row passed its row-level checks, but the table as a whole is wrong. The classic example is the 4 a.m. Zerodha trade-tick load that drops half the partitions because the source S3 prefix changed and the loader is reading from the old prefix. Every individual row is well-formed; the table just has 50% of its rows missing. Row-level tests cannot detect this because they only see one row at a time.
Referential tests check relationships across tables — every txn.merchant_id exists in dim_merchant.merchant_id, no refund references a non-existent payment, the snapshot of dim_user at time T joins cleanly to fact_orders at time T. They run after a merge or transformation that brings two tables into contact. They catch the bug class where each individual table is internally fine but they have drifted with respect to each other — the dimension table was loaded yesterday, the fact table was loaded today, a new merchant was added in between, and now 3% of today's transactions reference a merchant the dimension table has never heard of. Row-level and table-level tests on either table individually pass; the joined view is broken.
The categorisation matters because the three families have different failure modes and different blast radii. A row-level failure is a row-shaped problem — quarantine the row, alert if the rate exceeds a threshold, keep loading. A table-level failure is a partition-shaped problem — the whole partition is suspect, halt downstream consumption, page someone. A referential failure is a join-shaped problem — the data exists, but the lookup or aggregation that uses both tables will produce nonsense, so block any consumer query that joins them until the dimension catches up. Treating all three as "the data quality alert" and routing them to the same Slack channel is how teams either page on noise or under-respond to a regulatory-grade incident.
Implementing the three families with a runnable test runner
The shape that has stabilised across teams — dbt tests, Great Expectations, Soda — is a declarative spec (YAML or JSON) plus a runner that translates each test into a SQL query, executes it, and records pass/fail. The minimal version below fits in 80 lines and is enough to ship to a staging warehouse.
# quality_runner.py — read a test spec, run each test as SQL, record pass/fail.
# Usage: python quality_runner.py tests/payments_settled.yaml
import sys, yaml, psycopg2, json, datetime, os
PG_DSN = os.environ["WAREHOUSE_DSN"]
def row_level_sql(table, test):
"""Each row-level test compiles to: SELECT COUNT(*) of failing rows."""
col, kind = test["column"], test["kind"]
if kind == "not_null": return f"SELECT COUNT(*) FROM {table} WHERE {col} IS NULL"
if kind == "positive": return f"SELECT COUNT(*) FROM {table} WHERE {col} <= 0"
if kind == "in_set": vals = ",".join(f"'{v}'" for v in test["values"])
; return f"SELECT COUNT(*) FROM {table} WHERE {col} NOT IN ({vals})"
if kind == "regex": return f"SELECT COUNT(*) FROM {table} WHERE {col} !~ '{test['pattern']}'"
if kind == "in_past": return f"SELECT COUNT(*) FROM {table} WHERE {col} > NOW()"
raise ValueError(f"unknown row-level test: {kind}")
def table_level_sql(table, test):
"""Table-level tests compile to a single scalar comparison."""
kind = test["kind"]
if kind == "row_count_between":
return (f"SELECT COUNT(*) FROM {table}", lambda c: test["min"] <= c <= test["max"])
if kind == "unique":
col = test["column"]
return (f"SELECT COUNT(*) - COUNT(DISTINCT {col}) FROM {table}", lambda c: c == 0)
if kind == "null_rate_below":
col, thr = test["column"], test["threshold"]
return (f"SELECT 1.0 * SUM(CASE WHEN {col} IS NULL THEN 1 ELSE 0 END) / COUNT(*) FROM {table}",
lambda r: r < thr)
raise ValueError(f"unknown table-level test: {kind}")
def referential_sql(parent, child, fk_child, pk_parent):
"""Referential test: every child.fk has a matching parent.pk."""
return f"SELECT COUNT(*) FROM {child} c LEFT JOIN {parent} p ON c.{fk_child} = p.{pk_parent} WHERE p.{pk_parent} IS NULL"
def run_tests(spec_path):
spec = yaml.safe_load(open(spec_path))
table = spec["table"]
conn = psycopg2.connect(PG_DSN); cur = conn.cursor()
results = []
for t in spec.get("row_level", []):
cur.execute(row_level_sql(table, t)); failing = cur.fetchone()[0]
results.append({"family": "row", "name": t["name"], "failing_rows": failing,
"passed": failing == 0, "severity": t.get("severity", "error")})
for t in spec.get("table_level", []):
sql, predicate = table_level_sql(table, t)
cur.execute(sql); val = cur.fetchone()[0]
results.append({"family": "table", "name": t["name"], "value": float(val),
"passed": predicate(val), "severity": t.get("severity", "error")})
for t in spec.get("referential", []):
cur.execute(referential_sql(t["parent"], table, t["fk"], t["pk"])); orphans = cur.fetchone()[0]
results.append({"family": "ref", "name": t["name"], "orphans": orphans,
"passed": orphans == 0, "severity": t.get("severity", "error")})
return results
if __name__ == "__main__":
out = run_tests(sys.argv[1])
failed = [r for r in out if not r["passed"]]
print(json.dumps({"timestamp": datetime.datetime.utcnow().isoformat() + "Z",
"total": len(out), "failed": len(failed), "results": out}, indent=2))
sys.exit(1 if any(r["severity"] == "error" and not r["passed"] for r in out) else 0)
The companion spec for payments_settled is a tight YAML file:
# tests/payments_settled.yaml
table: warehouse.fact.payments_settled
row_level:
- {name: amount_not_null, column: amount, kind: not_null, severity: error}
- {name: amount_positive, column: amount, kind: positive, severity: error}
- {name: currency_in_set, column: currency, kind: in_set, values: [INR, USD, SGD], severity: error}
- {name: phone_well_formed, column: payer_phone, kind: regex, pattern: '^\+91[0-9]{10}$', severity: warn}
- {name: event_time_in_past, column: event_time, kind: in_past, severity: error}
table_level:
- {name: hourly_volume_in_band, kind: row_count_between, min: 80000, max: 600000, severity: error}
- {name: txn_id_unique, column: txn_id, kind: unique, severity: error}
- {name: mcc_null_rate, column: mcc_code, kind: null_rate_below, threshold: 0.001, severity: error}
referential:
- {name: merchant_id_exists, parent: warehouse.dim.merchants, fk: merchant_id, pk: merchant_id, severity: error}
- {name: payment_method_exists, parent: warehouse.dim.payment_methods, fk: method_id, pk: method_id, severity: warn}
# Sample run — one referential failure on a fresh load:
$ python quality_runner.py tests/payments_settled.yaml
{
"timestamp": "2026-04-25T08:14:22Z",
"total": 11,
"failed": 1,
"results": [
{"family": "row", "name": "amount_not_null", "failing_rows": 0, "passed": true, "severity": "error"},
{"family": "row", "name": "amount_positive", "failing_rows": 0, "passed": true, "severity": "error"},
{"family": "row", "name": "currency_in_set", "failing_rows": 0, "passed": true, "severity": "error"},
{"family": "row", "name": "phone_well_formed", "failing_rows": 38, "passed": false, "severity": "warn"},
{"family": "row", "name": "event_time_in_past", "failing_rows": 0, "passed": true, "severity": "error"},
{"family": "table", "name": "hourly_volume_in_band", "value": 142884.0, "passed": true, "severity": "error"},
{"family": "table", "name": "txn_id_unique", "value": 0.0, "passed": true, "severity": "error"},
{"family": "table", "name": "mcc_null_rate", "value": 0.0004, "passed": true, "severity": "error"},
{"family": "ref", "name": "merchant_id_exists", "orphans": 217, "passed": false, "severity": "error"},
{"family": "ref", "name": "payment_method_exists", "orphans": 0, "passed": true, "severity": "warn"}
]
}
$ echo $?
1
The piece worth dwelling on is the single-row failing-count pattern at the bottom of row_level_sql. Each row-level test compiles to "count rows that fail the predicate" and the test passes iff the count is 0. This pattern is what dbt and Great Expectations both adopt because it has two production-friendly properties: it produces a number (so the alert can say "217 rows failed" rather than just "test failed"), and it can be inverted into a quarantine query (replace SELECT COUNT(*) with SELECT *) to actually inspect the offending rows. Why "count failing rows" beats "all rows pass": when 217 of 142,884 rows fail, the right response is rarely "halt the pipeline". The right response is "quarantine 0.15% of rows, alert if the rate climbs above 0.5%, keep loading". A boolean pass/fail throws away the rate, and the rate is what tells you whether to page or to ticket.
run_tests is the orchestrator. It opens one connection (one query per test, all serialised) and produces a structured result object that downstream tooling — alerting, dashboards, audit logs — consumes. Production setups parallelise the queries (warehouses are happy with concurrent reads on different tables) and cache results so re-running the suite after a fix doesn't re-execute the passing tests. The severity field lets the runner exit non-zero only on error-severity failures; warn failures show up in the report but don't block the pipeline. The phone-regex failure above is warn because Razorpay payments come in from international payers occasionally and the regex is intentionally India-strict — a warning rate of 0.03% is normal, a rate of 5% would mean the upstream stopped formatting Indian numbers correctly.
The exit code semantics are the integration point with the orchestrator. Airflow / Dagster / Prefect run quality_runner.py as a task downstream of the load task. Exit 0 = continue. Exit 1 = the load is suspect, halt the downstream transforms, alert. The exit code is what makes the runner a circuit breaker: a quality failure stops the bad data from propagating to the silver layer, the marts, and the dashboards. Why the circuit-breaker pattern matters here specifically: in a warehouse, propagation is one-way and irreversible — once the silver table has the bad rows, the marts have copies, the dashboards have cached the marts, and rolling back means the painful UNLOAD-and-reload sequence. Halting at the bronze/silver boundary with a quality-test failure costs one delayed dashboard refresh; letting the bad data through costs a multi-hour incident.
Where the three families fail in production
The clean three-family model collides with real warehouses in five recurring ways. The first is row-level test cardinality blowup. A check like "amount > 0" runs in 200 ms on a 1 lakh row table; on a 10 crore row table after six months of unsampled history, the same check runs in 90 seconds, and the suite of 30 row-level tests now takes 45 minutes to run. The fix is to scope row-level tests to the partition that just landed (WHERE load_date = CURRENT_DATE) rather than the whole table. The cost is that you can't catch historical drift with the same suite — but historical drift is what reconciliation jobs catch, on a different cadence.
The second mess is the table-level threshold trap. "Row count between 80,000 and 600,000 per hour" sounds reasonable until Diwali, when payment volume goes 4× and the upper bound trips for six hours straight. The team raises the upper bound to 2,000,000 and now the threshold is so loose that a missing partition with 50% of normal volume slides through unnoticed. The real fix is to compute the band dynamically — between (avg_last_28d * 0.5) and (avg_last_28d * 2.5) — so the threshold tracks the seasonal pattern. The static-threshold version of this test is structurally unable to catch volume drops on high-traffic days, and most teams discover this the morning after their first festival outage.
The third mess is referential tests across SCD-2 tables. A slowly-changing-dimension table for merchants stores multiple historical rows per merchant_id, each valid for a window. The naive test "every fact.merchant_id exists in dim_merchant.merchant_id" passes on cardinality alone, but the right test is "every fact row joins to the merchant snapshot valid at the fact's event_time" — a temporal-join test that's harder to write and slower to run. Skipping the temporal version produces silent SCD drift: a merchant's mcc_code changed in February, the fact table from January joins to the new SCD row, and the analytics on January data are wrong by 4%. The fix is to run referential tests as LEFT JOIN dim_merchant d ON f.merchant_id = d.merchant_id AND f.event_time BETWEEN d.valid_from AND d.valid_to, which is the single most subtle and most-skipped test in the data warehouse.
The fourth mess is quarantine versus halt. A team writes 30 row-level tests, ships them as severity: error, and the next merchant that registers with a UAE phone number breaks the regex test, fails the load, and the on-call gets paged at 3 a.m. for what is in fact a benign new feature. The lesson is that severity is a real configuration knob and it has to be set deliberately: row-level tests on optional fields go warn, row-level tests on financial fields go error, and quarantine — sending failing rows to a side table rather than rejecting the whole load — is the production-grade default for error rows on financial pipelines. The naive halt-on-failure suite ships fast but burns the on-call quickly; the quarantine pattern is the version that survives in production.
The fifth mess is the test-suite-as-config-creep. Six months in, the spec for one table has 90 tests in it, the runner takes 12 minutes, and nobody knows which tests have ever caught a real bug. The discipline is to track per-test failure history — which tests have failed in the last 90 days? — and to retire tests that have never fired in a year. A test that has never failed is not free; it is paying ongoing cost (runtime, cognitive load, false-positive risk) for a defence against a bug that may not exist. PhonePe's data platform does a quarterly review of its test suite and retires roughly 8–12% of tests each quarter, replacing them with new tests informed by recent incidents. The suite stays roughly the same size; the content changes to track the actual bug surface.
Common confusions
- "Row-level tests are the same as schema validation." Adjacent, not equivalent. Schema validation checks that the column exists and has the right type — that the row has an
amountcolumn of typenumeric. Row-level tests check that the value is sane — thatamount > 0. A row that passes schema validation can haveamount = -1and still load; the row-level test is what rejects it. Schema is the structural contract; row-level tests are the semantic contract. - "Table-level tests are slow, so run them weekly." They have to run per partition, on the partition that just landed, in the same DAG run as the load. Running them weekly turns a quality system into a forensic system — you find out the Tuesday partition was bad on Friday, after three days of dashboards have shown bad numbers. The runtime concern is a scoping problem (
WHERE partition = '2026-04-25'), not a cadence problem. - "Referential integrity is the database's job." It is in OLTP; it is not in the warehouse. Snowflake, BigQuery, Redshift, ClickHouse — none of them enforce foreign keys, even when you declare them. The warehouse declarations are hints to the query optimiser, not constraints. If your data model needs referential integrity, you write a referential test for it.
- "If freshness is green, the data is good." Freshness only measures the lag between event time and ingest time. The data can be fresh and 100% wrong — every value scaled by 100x because of the paise/rupee bug, every
merchant_idset to NULL because of a producer rename. Freshness is one dimension of quality; it is not the dimension that catches semantic bugs. - "More tests = better quality." A 200-test suite with 30 false-positive tests teaches the on-call to ignore the alert channel, which is worse than a 50-test suite with 0 false positives. The metric to optimise is bug catches per false alarm — track which tests have caught real bugs in the last quarter, retire the tests that haven't.
- "Quality tests replace data contracts." They enforce them. The contract (chapter 31) says "amount is non-negative INR rupees, ≤ 10 crore, with paise rounded to 2 decimal places". The quality tests are the runnable assertions of those clauses. The contract is the negotiated agreement; the tests are the daily verification that the agreement holds.
Going deeper
How dbt, Great Expectations, and Soda differ in their model
dbt tests are SQL-first and table-bound — every test is a SELECT that returns failing rows; if the count is non-zero, the test fails. The model is minimal and integrates cleanly with dbt's existing model graph (tests run after the model that produced the table). Great Expectations is Python-first and richer in test types — it has 50+ built-in expectations (expect_column_values_to_match_strftime_format, expect_column_kl_divergence_to_be_less_than) and a "data docs" UI that renders test results as HTML. Soda is YAML-first and operationally focused — its check spec is the simplest of the three, and its runner has built-in alerting integrations. The choice between them is mostly about the team's surrounding stack: dbt teams use dbt tests; Python-first teams use GE; teams that already have a YAML-driven pipeline framework gravitate to Soda. None of the three are wrong; the wrong move is to mix two of them in the same warehouse and have alerts arriving from both.
The reconciliation pattern: row counts across systems
A class of test that doesn't fit neatly into the three families is cross-system reconciliation: the count of rows in the OLTP system at time T should equal the count of rows in the warehouse for the same time window, ±X%. This is what catches the entire-batch-dropped class of bug — the warehouse load skipped a partition, but the row-level and table-level tests on the loaded partition all pass because they run on what's there. The reconciliation test runs on what's missing. The implementation is a query against the source system (SELECT COUNT(*) FROM payments WHERE created_at::date = CURRENT_DATE) and a query against the warehouse with the same predicate, with a tolerance band for in-flight rows. Razorpay runs reconciliation every 6 hours against its core OLTP cluster; PhonePe runs it hourly against its CDC stream's high-water mark. The reconciliation tests have caught more "silent partition drops" than any other test class, by a wide margin.
Distribution drift as a quality signal
A table-level test that goes beyond simple aggregates is distribution drift detection: compare today's distribution of a column (mean, p50, p95, p99, distinct-count) against the trailing 28-day distribution, and alert when today's value sits more than 3 standard deviations from the trailing mean. This catches the class of bug where the data is well-shaped but the content has shifted — a producer release that changed the default value of a field, a merchant cohort that stopped contributing, a fraud-attack pattern that's pushing the amount distribution rightward. It is more expensive to compute than a fixed-threshold test (you need historical comparison data) and it has a higher false-positive rate (genuine business changes look like drift), but it is the only test class that catches "the data looks fine, the numbers feel wrong" complaints from analysts. Most teams add distribution-drift checks for the top 10% of business-critical columns and skip them for the rest.
Quality tests in the streaming world
Streaming pipelines (Build 8) have their own version of all three families, but the cadence collapses: row-level tests run as a Filter operator on the stream, table-level tests run on tumbling windows, referential tests run as stream-table joins. The model is conceptually identical to batch — assertions on data — but the implementation lives inside the stream-processing graph rather than as a separate runner. Apache Flink's BroadcastState pattern is commonly used to maintain a side stream of test results, and downstream consumers can subscribe to it the same way they'd subscribe to an alerting topic. The trade-off is that streaming quality tests have to be cheap (you can't run a 90-second SQL query per micro-batch); the cheap tests are usually a strict subset of the batch tests, and the catch-rate is correspondingly lower.
When to gate vs when to quarantine
The decision tree for a failing row-level test has three options: gate (halt the whole load and alert), quarantine (move the row to a side table and continue loading the rest), or flag and pass (load the row but mark it). Gate is right for tests where any single failure means the whole batch is wrong — schema mismatch, version drift, a currency value of XYZ (because that means the producer is broken). Quarantine is right for tests where some-percentage failure is normal — phone-regex failures from international users, optional-field nulls. Flag-and-pass is right for tests that are diagnostic only — email_format_uncommon for analytics, no business decision depends on it. Mapping each test to one of these three explicitly, in the spec, is what turns an alert-storm into a maintainable on-call rotation.
Where this leads next
The next chapter (35) covers anomaly detection — the ML-flavoured cousin of distribution-drift testing, useful for the patterns that hard thresholds miss. Chapter 36 closes Build 5 with incident response: when a quality test breaches a tier-1 SLO at 3 a.m., what does the runbook say?
- Freshness SLOs: the data-eng analog of uptime — chapter 33, the lag-side counterpart to quality tests.
- Data contracts: the producer/consumer boundary — chapter 31, the negotiated agreement that quality tests enforce daily.
- Schema registries and the evolution problem — chapter 32, the structural layer that row-level tests depend on for stable column shapes.
- Column-level lineage: why it's hard and why it matters — when a quality test fails, lineage is what tells you which downstream consumers are now suspect.
Quality tests are the smallest unit of executable trust in a warehouse. Without them, the warehouse is a place where data lands and is assumed to be correct because the pipeline turned green. With them, the warehouse is a place where every dataset has a runnable contract, and every load is a verified transaction. The cultural shift — from "the load succeeded" to "the load passed its tests" — is what makes a data team a credible peer to the SRE team, the security team, and the product team. The bar a team should hold itself to: pick a random tier-1 table and ask "what tests run on every load, and when did they last fail?" If the answer is a list of names with timestamps in the last 30 days, the discipline is in place. If the answer is "we'll add tests once we have time" — that's the answer the team that filed the NPCI anomaly report gave six months ago, too.
References
- dbt tests documentation — the canonical reference for SQL-first test specs in a dbt project, including the four built-in tests (
unique,not_null,accepted_values,relationships) and custom singular/generic test patterns. - Great Expectations documentation — Python-first expectation library with 50+ built-in expectations and a data-docs HTML renderer; richer than dbt tests but heavier to operate.
- Soda Core documentation — YAML-first quality framework with a runner, alerting, and check-as-code patterns.
- Niels Claeys, "The Six Pillars of Data Observability" (Datafold, 2022) — practical breakdown of how quality tests fit alongside freshness, lineage, and schema monitoring.
- Andy Petrella, "Fundamentals of Data Observability" (O'Reilly, 2023) — book-length treatment of the test families with chapter-length examples for SCD referential checks.
- Apache Griffin documentation — open-source data-quality framework with batch + stream-mode test execution.
- Data contracts: the producer/consumer boundary — chapter 31, the document where the assertions that quality tests enforce are written down and version-controlled.
- Freshness SLOs: the data-eng analog of uptime — chapter 33, the lag-side counterpart that quality tests must run alongside.