Quality tests: row-level, table-level, referential

It is 2:14 a.m. at PhonePe's data platform office in Bengaluru. The freshness SLO is green. The DAGs are green. The dashboard is green. And yet the morning fraud-ops team will discover, eight hours from now, that 1.4 lakh transactions were silently scored as merchant_category = 'unknown' because a producer release the previous evening renamed mcc_code to merchant_mcc and the silver-layer transform now coalesces every row to NULL. The pipeline ran. The data landed. The freshness lag is 41 seconds. The data is stale-free and wrong. The on-call engineer that night had no quality tests on the silver table — only freshness — and the cost of that gap is a regulatory anomaly report filed to NPCI by 11 a.m. the next morning.

Quality tests are runnable assertions on the data itself, organised into three families: row-level (does this row obey its column-shape rules?), table-level (does this whole table obey its aggregate rules?), and referential (do rows in this table line up with rows in another?). Each family catches a different bug, runs on a different cadence, and fails differently — and a pipeline without all three accepts garbage as long as the garbage is fresh and well-shaped.

The three families and what each one catches

Freshness tells you the data is recent. Quality tests tell you the data is right. Those are different problems with different mechanisms, and lumping them under "data quality" — a single phrase used to mean five things — is how teams ship a freshness monitor and call the work done. The three families below are not arbitrary categories; they correspond to the three places where bugs actually enter a warehouse.

Three families of quality testsA diagram with three vertical stacks. The leftmost stack is labelled "row-level": each row is checked against column rules like not-null, type, range, and regex. The middle stack is labelled "table-level": the whole table is checked against aggregate rules like row count in range, primary key uniqueness, distribution drift. The rightmost stack is labelled "referential": pairs of tables are checked against join-shape rules like foreign-key coverage, parent-child cardinality, snapshot consistency. Arrows point from each stack to a label on the right reading "what bug this catches". Three families. Three bug classes. Three cadences. row-level one row at a time not_null(amount) amount > 0 currency in {INR, USD} phone matches /^\d{10}$/ event_time <= now() catches: producer bugs, type drift, encoding errors cadence: every load table-level whole table at once row_count between 1L–5L unique(txn_id) sum(amount) within ±5% null_rate(mcc) < 0.1% distinct(merchant) ≥ 8000 catches: dropped batches, duplicate loads, drift cadence: per partition referential across tables every txn.merchant_id exists in dim_merchant no orphan refunds snapshot t aligns with dim snapshot at t catches: join breakage, SCD drift, late-loads cadence: post-merge a quality regime ships all three; skipping one accepts a whole bug class
Row-level tests look at one row in isolation; table-level tests look at the whole table; referential tests look at rows across two or more tables. Each catches bugs the others structurally cannot.

Row-level tests check column-shape rules on each row independently. amount > 0, currency IN ('INR','USD','SGD'), email MATCHES /.+@.+/, event_time <= NOW(), mcc_code NOT NULL. They run during ingest, before the row is committed to the silver table — or immediately after, in the post-load test pass. They catch the class of bug where a single row arrives malformed: a Razorpay producer ships a release that puts the rupee amount in paise but the contract says rupees, and 100x-inflated amounts start arriving. Every single row is wrong by the same factor. Why row-level checks must run on every load: the cost of letting a row through is unbounded — that row joins to a fact table, gets aggregated into a daily total, and propagates through the warehouse. The cost of rejecting a row to a quarantine table is a few KB and one alert. The asymmetry says check on entry, not at the end.

Table-level tests check aggregate properties of the whole table — COUNT(*), COUNT(DISTINCT txn_id) = COUNT(*), SUM(amount), null_rate(merchant_id) < 0.001. They run per partition, after a partition is closed. They catch a fundamentally different bug class: every individual row passed its row-level checks, but the table as a whole is wrong. The classic example is the 4 a.m. Zerodha trade-tick load that drops half the partitions because the source S3 prefix changed and the loader is reading from the old prefix. Every individual row is well-formed; the table just has 50% of its rows missing. Row-level tests cannot detect this because they only see one row at a time.

Referential tests check relationships across tables — every txn.merchant_id exists in dim_merchant.merchant_id, no refund references a non-existent payment, the snapshot of dim_user at time T joins cleanly to fact_orders at time T. They run after a merge or transformation that brings two tables into contact. They catch the bug class where each individual table is internally fine but they have drifted with respect to each other — the dimension table was loaded yesterday, the fact table was loaded today, a new merchant was added in between, and now 3% of today's transactions reference a merchant the dimension table has never heard of. Row-level and table-level tests on either table individually pass; the joined view is broken.

The categorisation matters because the three families have different failure modes and different blast radii. A row-level failure is a row-shaped problem — quarantine the row, alert if the rate exceeds a threshold, keep loading. A table-level failure is a partition-shaped problem — the whole partition is suspect, halt downstream consumption, page someone. A referential failure is a join-shaped problem — the data exists, but the lookup or aggregation that uses both tables will produce nonsense, so block any consumer query that joins them until the dimension catches up. Treating all three as "the data quality alert" and routing them to the same Slack channel is how teams either page on noise or under-respond to a regulatory-grade incident.

Implementing the three families with a runnable test runner

The shape that has stabilised across teams — dbt tests, Great Expectations, Soda — is a declarative spec (YAML or JSON) plus a runner that translates each test into a SQL query, executes it, and records pass/fail. The minimal version below fits in 80 lines and is enough to ship to a staging warehouse.

# quality_runner.py — read a test spec, run each test as SQL, record pass/fail.
# Usage: python quality_runner.py tests/payments_settled.yaml
import sys, yaml, psycopg2, json, datetime, os

PG_DSN = os.environ["WAREHOUSE_DSN"]

def row_level_sql(table, test):
    """Each row-level test compiles to: SELECT COUNT(*) of failing rows."""
    col, kind = test["column"], test["kind"]
    if kind == "not_null":   return f"SELECT COUNT(*) FROM {table} WHERE {col} IS NULL"
    if kind == "positive":   return f"SELECT COUNT(*) FROM {table} WHERE {col} <= 0"
    if kind == "in_set":     vals = ",".join(f"'{v}'" for v in test["values"])
                             ; return f"SELECT COUNT(*) FROM {table} WHERE {col} NOT IN ({vals})"
    if kind == "regex":      return f"SELECT COUNT(*) FROM {table} WHERE {col} !~ '{test['pattern']}'"
    if kind == "in_past":    return f"SELECT COUNT(*) FROM {table} WHERE {col} > NOW()"
    raise ValueError(f"unknown row-level test: {kind}")

def table_level_sql(table, test):
    """Table-level tests compile to a single scalar comparison."""
    kind = test["kind"]
    if kind == "row_count_between":
        return (f"SELECT COUNT(*) FROM {table}", lambda c: test["min"] <= c <= test["max"])
    if kind == "unique":
        col = test["column"]
        return (f"SELECT COUNT(*) - COUNT(DISTINCT {col}) FROM {table}", lambda c: c == 0)
    if kind == "null_rate_below":
        col, thr = test["column"], test["threshold"]
        return (f"SELECT 1.0 * SUM(CASE WHEN {col} IS NULL THEN 1 ELSE 0 END) / COUNT(*) FROM {table}",
                lambda r: r < thr)
    raise ValueError(f"unknown table-level test: {kind}")

def referential_sql(parent, child, fk_child, pk_parent):
    """Referential test: every child.fk has a matching parent.pk."""
    return f"SELECT COUNT(*) FROM {child} c LEFT JOIN {parent} p ON c.{fk_child} = p.{pk_parent} WHERE p.{pk_parent} IS NULL"

def run_tests(spec_path):
    spec = yaml.safe_load(open(spec_path))
    table = spec["table"]
    conn = psycopg2.connect(PG_DSN); cur = conn.cursor()
    results = []
    for t in spec.get("row_level", []):
        cur.execute(row_level_sql(table, t)); failing = cur.fetchone()[0]
        results.append({"family": "row", "name": t["name"], "failing_rows": failing,
                        "passed": failing == 0, "severity": t.get("severity", "error")})
    for t in spec.get("table_level", []):
        sql, predicate = table_level_sql(table, t)
        cur.execute(sql); val = cur.fetchone()[0]
        results.append({"family": "table", "name": t["name"], "value": float(val),
                        "passed": predicate(val), "severity": t.get("severity", "error")})
    for t in spec.get("referential", []):
        cur.execute(referential_sql(t["parent"], table, t["fk"], t["pk"])); orphans = cur.fetchone()[0]
        results.append({"family": "ref", "name": t["name"], "orphans": orphans,
                        "passed": orphans == 0, "severity": t.get("severity", "error")})
    return results

if __name__ == "__main__":
    out = run_tests(sys.argv[1])
    failed = [r for r in out if not r["passed"]]
    print(json.dumps({"timestamp": datetime.datetime.utcnow().isoformat() + "Z",
                      "total": len(out), "failed": len(failed), "results": out}, indent=2))
    sys.exit(1 if any(r["severity"] == "error" and not r["passed"] for r in out) else 0)

The companion spec for payments_settled is a tight YAML file:

# tests/payments_settled.yaml
table: warehouse.fact.payments_settled
row_level:
  - {name: amount_not_null, column: amount, kind: not_null, severity: error}
  - {name: amount_positive, column: amount, kind: positive, severity: error}
  - {name: currency_in_set, column: currency, kind: in_set, values: [INR, USD, SGD], severity: error}
  - {name: phone_well_formed, column: payer_phone, kind: regex, pattern: '^\+91[0-9]{10}$', severity: warn}
  - {name: event_time_in_past, column: event_time, kind: in_past, severity: error}
table_level:
  - {name: hourly_volume_in_band, kind: row_count_between, min: 80000, max: 600000, severity: error}
  - {name: txn_id_unique, column: txn_id, kind: unique, severity: error}
  - {name: mcc_null_rate, column: mcc_code, kind: null_rate_below, threshold: 0.001, severity: error}
referential:
  - {name: merchant_id_exists, parent: warehouse.dim.merchants, fk: merchant_id, pk: merchant_id, severity: error}
  - {name: payment_method_exists, parent: warehouse.dim.payment_methods, fk: method_id, pk: method_id, severity: warn}
# Sample run — one referential failure on a fresh load:
$ python quality_runner.py tests/payments_settled.yaml
{
  "timestamp": "2026-04-25T08:14:22Z",
  "total": 11,
  "failed": 1,
  "results": [
    {"family": "row", "name": "amount_not_null", "failing_rows": 0, "passed": true, "severity": "error"},
    {"family": "row", "name": "amount_positive", "failing_rows": 0, "passed": true, "severity": "error"},
    {"family": "row", "name": "currency_in_set", "failing_rows": 0, "passed": true, "severity": "error"},
    {"family": "row", "name": "phone_well_formed", "failing_rows": 38, "passed": false, "severity": "warn"},
    {"family": "row", "name": "event_time_in_past", "failing_rows": 0, "passed": true, "severity": "error"},
    {"family": "table", "name": "hourly_volume_in_band", "value": 142884.0, "passed": true, "severity": "error"},
    {"family": "table", "name": "txn_id_unique", "value": 0.0, "passed": true, "severity": "error"},
    {"family": "table", "name": "mcc_null_rate", "value": 0.0004, "passed": true, "severity": "error"},
    {"family": "ref", "name": "merchant_id_exists", "orphans": 217, "passed": false, "severity": "error"},
    {"family": "ref", "name": "payment_method_exists", "orphans": 0, "passed": true, "severity": "warn"}
  ]
}
$ echo $?
1

The piece worth dwelling on is the single-row failing-count pattern at the bottom of row_level_sql. Each row-level test compiles to "count rows that fail the predicate" and the test passes iff the count is 0. This pattern is what dbt and Great Expectations both adopt because it has two production-friendly properties: it produces a number (so the alert can say "217 rows failed" rather than just "test failed"), and it can be inverted into a quarantine query (replace SELECT COUNT(*) with SELECT *) to actually inspect the offending rows. Why "count failing rows" beats "all rows pass": when 217 of 142,884 rows fail, the right response is rarely "halt the pipeline". The right response is "quarantine 0.15% of rows, alert if the rate climbs above 0.5%, keep loading". A boolean pass/fail throws away the rate, and the rate is what tells you whether to page or to ticket.

run_tests is the orchestrator. It opens one connection (one query per test, all serialised) and produces a structured result object that downstream tooling — alerting, dashboards, audit logs — consumes. Production setups parallelise the queries (warehouses are happy with concurrent reads on different tables) and cache results so re-running the suite after a fix doesn't re-execute the passing tests. The severity field lets the runner exit non-zero only on error-severity failures; warn failures show up in the report but don't block the pipeline. The phone-regex failure above is warn because Razorpay payments come in from international payers occasionally and the regex is intentionally India-strict — a warning rate of 0.03% is normal, a rate of 5% would mean the upstream stopped formatting Indian numbers correctly.

The exit code semantics are the integration point with the orchestrator. Airflow / Dagster / Prefect run quality_runner.py as a task downstream of the load task. Exit 0 = continue. Exit 1 = the load is suspect, halt the downstream transforms, alert. The exit code is what makes the runner a circuit breaker: a quality failure stops the bad data from propagating to the silver layer, the marts, and the dashboards. Why the circuit-breaker pattern matters here specifically: in a warehouse, propagation is one-way and irreversible — once the silver table has the bad rows, the marts have copies, the dashboards have cached the marts, and rolling back means the painful UNLOAD-and-reload sequence. Halting at the bronze/silver boundary with a quality-test failure costs one delayed dashboard refresh; letting the bad data through costs a multi-hour incident.

Where the three families fail in production

The clean three-family model collides with real warehouses in five recurring ways. The first is row-level test cardinality blowup. A check like "amount > 0" runs in 200 ms on a 1 lakh row table; on a 10 crore row table after six months of unsampled history, the same check runs in 90 seconds, and the suite of 30 row-level tests now takes 45 minutes to run. The fix is to scope row-level tests to the partition that just landed (WHERE load_date = CURRENT_DATE) rather than the whole table. The cost is that you can't catch historical drift with the same suite — but historical drift is what reconciliation jobs catch, on a different cadence.

The second mess is the table-level threshold trap. "Row count between 80,000 and 600,000 per hour" sounds reasonable until Diwali, when payment volume goes 4× and the upper bound trips for six hours straight. The team raises the upper bound to 2,000,000 and now the threshold is so loose that a missing partition with 50% of normal volume slides through unnoticed. The real fix is to compute the band dynamically — between (avg_last_28d * 0.5) and (avg_last_28d * 2.5) — so the threshold tracks the seasonal pattern. The static-threshold version of this test is structurally unable to catch volume drops on high-traffic days, and most teams discover this the morning after their first festival outage.

The third mess is referential tests across SCD-2 tables. A slowly-changing-dimension table for merchants stores multiple historical rows per merchant_id, each valid for a window. The naive test "every fact.merchant_id exists in dim_merchant.merchant_id" passes on cardinality alone, but the right test is "every fact row joins to the merchant snapshot valid at the fact's event_time" — a temporal-join test that's harder to write and slower to run. Skipping the temporal version produces silent SCD drift: a merchant's mcc_code changed in February, the fact table from January joins to the new SCD row, and the analytics on January data are wrong by 4%. The fix is to run referential tests as LEFT JOIN dim_merchant d ON f.merchant_id = d.merchant_id AND f.event_time BETWEEN d.valid_from AND d.valid_to, which is the single most subtle and most-skipped test in the data warehouse.

Referential tests against SCD-2 tables need temporal joinsA timeline shows three SCD-2 versions of merchant M-901: version 1 valid Jan 1 to Feb 14 with mcc 5411, version 2 valid Feb 15 to Mar 31 with mcc 5499, version 3 valid Apr 1 onwards with mcc 5732. Below the timeline are three fact rows: a Jan 20 transaction, a Feb 28 transaction, and an Apr 10 transaction. Arrows show that a naive merchant_id join would match all three to whichever SCD row is current; a temporal join routes each fact row to the SCD version valid at its event_time. SCD-2 referential test: join on time, not just on id SCD v1: mcc=5411 Jan 1 – Feb 14 v2: mcc=5499 Feb 15 – Mar 31 v3: mcc=5732 Apr 1 – facts Jan 20 txn should join v1 Feb 28 txn should join v2 Apr 10 txn should join v3 naive join on merchant_id alone routes all three facts to v3 — silent 4% drift on Jan/Feb analytics temporal join on event_time BETWEEN valid_from AND valid_to is the test that matters
The most-skipped quality test in the warehouse: that a fact row joins to the correct historical version of its dimension. Naive id-only joins quietly accept the latest dimension values for old facts.

The fourth mess is quarantine versus halt. A team writes 30 row-level tests, ships them as severity: error, and the next merchant that registers with a UAE phone number breaks the regex test, fails the load, and the on-call gets paged at 3 a.m. for what is in fact a benign new feature. The lesson is that severity is a real configuration knob and it has to be set deliberately: row-level tests on optional fields go warn, row-level tests on financial fields go error, and quarantine — sending failing rows to a side table rather than rejecting the whole load — is the production-grade default for error rows on financial pipelines. The naive halt-on-failure suite ships fast but burns the on-call quickly; the quarantine pattern is the version that survives in production.

The fifth mess is the test-suite-as-config-creep. Six months in, the spec for one table has 90 tests in it, the runner takes 12 minutes, and nobody knows which tests have ever caught a real bug. The discipline is to track per-test failure history — which tests have failed in the last 90 days? — and to retire tests that have never fired in a year. A test that has never failed is not free; it is paying ongoing cost (runtime, cognitive load, false-positive risk) for a defence against a bug that may not exist. PhonePe's data platform does a quarterly review of its test suite and retires roughly 8–12% of tests each quarter, replacing them with new tests informed by recent incidents. The suite stays roughly the same size; the content changes to track the actual bug surface.

Common confusions

Going deeper

How dbt, Great Expectations, and Soda differ in their model

dbt tests are SQL-first and table-bound — every test is a SELECT that returns failing rows; if the count is non-zero, the test fails. The model is minimal and integrates cleanly with dbt's existing model graph (tests run after the model that produced the table). Great Expectations is Python-first and richer in test types — it has 50+ built-in expectations (expect_column_values_to_match_strftime_format, expect_column_kl_divergence_to_be_less_than) and a "data docs" UI that renders test results as HTML. Soda is YAML-first and operationally focused — its check spec is the simplest of the three, and its runner has built-in alerting integrations. The choice between them is mostly about the team's surrounding stack: dbt teams use dbt tests; Python-first teams use GE; teams that already have a YAML-driven pipeline framework gravitate to Soda. None of the three are wrong; the wrong move is to mix two of them in the same warehouse and have alerts arriving from both.

The reconciliation pattern: row counts across systems

A class of test that doesn't fit neatly into the three families is cross-system reconciliation: the count of rows in the OLTP system at time T should equal the count of rows in the warehouse for the same time window, ±X%. This is what catches the entire-batch-dropped class of bug — the warehouse load skipped a partition, but the row-level and table-level tests on the loaded partition all pass because they run on what's there. The reconciliation test runs on what's missing. The implementation is a query against the source system (SELECT COUNT(*) FROM payments WHERE created_at::date = CURRENT_DATE) and a query against the warehouse with the same predicate, with a tolerance band for in-flight rows. Razorpay runs reconciliation every 6 hours against its core OLTP cluster; PhonePe runs it hourly against its CDC stream's high-water mark. The reconciliation tests have caught more "silent partition drops" than any other test class, by a wide margin.

Distribution drift as a quality signal

A table-level test that goes beyond simple aggregates is distribution drift detection: compare today's distribution of a column (mean, p50, p95, p99, distinct-count) against the trailing 28-day distribution, and alert when today's value sits more than 3 standard deviations from the trailing mean. This catches the class of bug where the data is well-shaped but the content has shifted — a producer release that changed the default value of a field, a merchant cohort that stopped contributing, a fraud-attack pattern that's pushing the amount distribution rightward. It is more expensive to compute than a fixed-threshold test (you need historical comparison data) and it has a higher false-positive rate (genuine business changes look like drift), but it is the only test class that catches "the data looks fine, the numbers feel wrong" complaints from analysts. Most teams add distribution-drift checks for the top 10% of business-critical columns and skip them for the rest.

Quality tests in the streaming world

Streaming pipelines (Build 8) have their own version of all three families, but the cadence collapses: row-level tests run as a Filter operator on the stream, table-level tests run on tumbling windows, referential tests run as stream-table joins. The model is conceptually identical to batch — assertions on data — but the implementation lives inside the stream-processing graph rather than as a separate runner. Apache Flink's BroadcastState pattern is commonly used to maintain a side stream of test results, and downstream consumers can subscribe to it the same way they'd subscribe to an alerting topic. The trade-off is that streaming quality tests have to be cheap (you can't run a 90-second SQL query per micro-batch); the cheap tests are usually a strict subset of the batch tests, and the catch-rate is correspondingly lower.

When to gate vs when to quarantine

The decision tree for a failing row-level test has three options: gate (halt the whole load and alert), quarantine (move the row to a side table and continue loading the rest), or flag and pass (load the row but mark it). Gate is right for tests where any single failure means the whole batch is wrong — schema mismatch, version drift, a currency value of XYZ (because that means the producer is broken). Quarantine is right for tests where some-percentage failure is normal — phone-regex failures from international users, optional-field nulls. Flag-and-pass is right for tests that are diagnostic only — email_format_uncommon for analytics, no business decision depends on it. Mapping each test to one of these three explicitly, in the spec, is what turns an alert-storm into a maintainable on-call rotation.

Where this leads next

The next chapter (35) covers anomaly detection — the ML-flavoured cousin of distribution-drift testing, useful for the patterns that hard thresholds miss. Chapter 36 closes Build 5 with incident response: when a quality test breaches a tier-1 SLO at 3 a.m., what does the runbook say?

Quality tests are the smallest unit of executable trust in a warehouse. Without them, the warehouse is a place where data lands and is assumed to be correct because the pipeline turned green. With them, the warehouse is a place where every dataset has a runnable contract, and every load is a verified transaction. The cultural shift — from "the load succeeded" to "the load passed its tests" — is what makes a data team a credible peer to the SRE team, the security team, and the product team. The bar a team should hold itself to: pick a random tier-1 table and ask "what tests run on every load, and when did they last fail?" If the answer is a list of names with timestamps in the last 30 days, the discipline is in place. If the answer is "we'll add tests once we have time" — that's the answer the team that filed the NPCI anomaly report gave six months ago, too.

References

  1. dbt tests documentation — the canonical reference for SQL-first test specs in a dbt project, including the four built-in tests (unique, not_null, accepted_values, relationships) and custom singular/generic test patterns.
  2. Great Expectations documentation — Python-first expectation library with 50+ built-in expectations and a data-docs HTML renderer; richer than dbt tests but heavier to operate.
  3. Soda Core documentation — YAML-first quality framework with a runner, alerting, and check-as-code patterns.
  4. Niels Claeys, "The Six Pillars of Data Observability" (Datafold, 2022) — practical breakdown of how quality tests fit alongside freshness, lineage, and schema monitoring.
  5. Andy Petrella, "Fundamentals of Data Observability" (O'Reilly, 2023) — book-length treatment of the test families with chapter-length examples for SCD referential checks.
  6. Apache Griffin documentation — open-source data-quality framework with batch + stream-mode test execution.
  7. Data contracts: the producer/consumer boundary — chapter 31, the document where the assertions that quality tests enforce are written down and version-controlled.
  8. Freshness SLOs: the data-eng analog of uptime — chapter 33, the lag-side counterpart that quality tests must run alongside.