dbt tests, Great Expectations, Soda — the landscape

It is Tuesday afternoon at a fintech in Bengaluru. The analytics team — two analysts and one data engineer named Aditi — has decided that the silent partition-drop incident from last quarter cannot happen again. Aditi opens a browser tab, types "data quality framework", and finds three names competing for the same job: dbt tests, Great Expectations, Soda. The blog posts all sound similar. The marketing pages all sound similar. By Friday she has to ship a recommendation, and by next Friday the chosen tool has to be running on the four tier-1 tables. The choice she's about to make will shape how the team writes tests, debugs failures, and onboards the next hire for the next two years — and most of the trade-offs that matter aren't on any of the comparison pages.

The three frameworks made fundamentally different bets. dbt tests are SQL-first and bound to the dbt model graph — cheap to start, hard to use without dbt. Great Expectations is Python-first with 50+ rich expectations and an HTML data-docs UI — flexible, heavier to operate. Soda is YAML-first with a tight check spec and built-in alerting — simplest to read, narrowest in expressiveness. Pick by who writes the tests, not by which one has the longer feature list.

The three bets and what each one optimises

A quality framework has to answer four questions: who writes the tests, where do the tests live, what does a test compile to, and what happens when one fails. dbt, Great Expectations, and Soda answered those four questions differently, and the resulting frameworks feel different to use even when they're checking the same column for the same property.

Three frameworks, three betsA diagram with three columns. Left column labelled dbt tests: writer is the analytics engineer, location is schema.yml next to the model, primitive is a SELECT that returns failing rows, failure surfaces in dbt run output. Middle column labelled Great Expectations: writer is the data engineer or data scientist, location is a Python suite or JSON expectation file, primitive is one of 50-plus expectation classes evaluated by the engine, failure surfaces in HTML data docs and Slack via custom action. Right column labelled Soda: writer is the analyst or data engineer, location is checks.yml in a soda-cloud directory, primitive is a declarative check parsed into SQL by the soda runner, failure surfaces in a Soda dashboard with built-in alerting. Three frameworks. Three writers. Three failure surfaces. dbt tests SQL-first, model-bound writer: analytics engineer lives in: schema.yml beside model compiles to: SELECT failing rows runs via: dbt test fails into: CLI + manifest.json cheap if you already use dbt; useless if not 4 built-ins + macros Great Expectations Python-first, expressive writer: data eng / data scientist lives in: expectation suite (JSON) compiles to: expectation class call runs via: checkpoint engine fails into: HTML data docs + Slack richest expectations; heavier to operate 50+ built-ins Soda YAML-first, ops-friendly writer: analyst / data eng lives in: checks.yml per dataset compiles to: parsed SQL by runner runs via: soda scan fails into: Soda Cloud + alerts simplest spec to read; narrowest expressiveness ~25 check types + SQL the bet each framework made determines who can write tests in your team
The four design questions each framework answers — writer, location, primitive, failure surface — and how those answers shape your day-to-day experience.

dbt tests bet that the analytics engineer is the writer, that tests live next to the model definition, and that a test is just a SELECT statement that returns failing rows. The four built-ins (unique, not_null, accepted_values, relationships) cover roughly 70% of what a typical warehouse needs, and custom tests are written as either a generic test (a SQL macro) or a singular test (a one-off .sql file). When a test fails, dbt prints the count of failing rows and dumps the rows themselves into a test failures table for inspection. Why dbt's bet works for dbt-shaped teams: the analytics engineer is already opening schema.yml to add column documentation; adding tests: next to the column is a one-line change with zero new tools to learn. The test runs in the same dbt run invocation as the model build, and failures surface in the same CLI output the engineer was already watching. The marginal cost of adding one more test is roughly zero, which is the right cost curve for getting from 0 to 200 tests in three months. The cost of dbt's bet is that it is structurally tied to dbt — if your warehouse doesn't use dbt, you can't use dbt tests in any honest sense. There is a dbt-core-only project in theory, but in practice teams adopt dbt tests because they're already adopting dbt, and the moment you exit the dbt graph you exit the test graph too.

Great Expectations bet that the writer is closer to a data scientist or a data engineer who's comfortable in Python, that tests live as a versioned "expectation suite" decoupled from the table that produced them, and that a test is one of 50+ named expectation classes ("expectation that...") evaluated by an engine. The richness is the headline feature: expect_column_kl_divergence_to_be_less_than, expect_column_values_to_match_strftime_format, expect_table_columns_to_match_ordered_list, expect_column_pair_values_a_to_be_greater_than_b. The HTML "data docs" output renders test results as a navigable site that you can serve to stakeholders — analysts and product managers can see the green/red bars without opening a CLI. Why Great Expectations' bet is heavy: the framework has its own configuration concepts — datasource, batch request, expectation suite, checkpoint, data context — and the learning curve is real. A team that just wants not_null and unique tests will spend two weeks reading the GE docs to figure out how to declare a checkpoint, when dbt would have shipped them in two hours. The richness is genuinely useful for teams whose tests need to express things like "this column's distribution should be within 0.1 KL-divergence of the trailing-month distribution" — but most warehouses don't actually need that, and paying for the full expressiveness when you only use 10% of it is a common GE failure mode.

Soda bet on the operations engineer and the analyst as joint writers, on a small declarative YAML spec as the unit of expression, and on built-in alerting as a first-class feature. A Soda check looks like - missing_count(merchant_id) = 0 or - duplicate_count(txn_id) = 0 — terse, readable, no Python, no SQL syntax to learn. The Soda runner parses the YAML, generates SQL, executes it, and either prints to the CLI (Soda Core, open-source) or pushes results to Soda Cloud (the hosted product, with its own alerting and dashboards). The check vocabulary is narrower than Great Expectations' — about 25 check types, plus an escape hatch to write raw SQL — but it covers the common cases cleanly and the spec is short enough to fit on screen. The cost of Soda's bet is that for the unusual checks ("compare today's distribution to the trailing-month distribution") you fall off the YAML cliff into either raw SQL or a separate framework, and the YAML has fewer composability primitives than dbt's macros or GE's Python.

The choice between the three is rarely about features in isolation. It is about which writer your team has, what graph the framework attaches to, and what failure surface the on-call already watches. A team that already runs dbt is going to use dbt tests for the first 70% and reach for one of the other two only when dbt's expressiveness runs out. A team where the data scientists own quality is going to reach for Great Expectations because the Python ergonomics let them embed quality checks inside their existing notebook workflow. A team where the analytics group writes tests but doesn't own the warehouse code reaches for Soda because the YAML is the abstraction that lets a non-engineer ship a test without going through a code review.

A side-by-side: the same five tests in all three frameworks

The clearest way to feel the difference is to write the same five tests — three row-level, one table-level, one referential — in each framework. The table they target is warehouse.fact.payments_settled, the same fact table the previous chapter used. The five tests:

  1. amount is not null on every row.
  2. currency is one of INR, USD, SGD.
  3. payer_phone matches ^\+91[0-9]{10}$.
  4. txn_id is unique across the whole table.
  5. Every merchant_id exists in dim_merchant.
# all_three_frameworks.py — for illustration; in practice you pick one and stick with it.

# ============================================================
# 1. dbt tests — schema.yml beside models/marts/payments_settled.sql
# ============================================================
DBT_SCHEMA_YML = """
version: 2
models:
  - name: payments_settled
    columns:
      - name: amount
        tests:
          - not_null
      - name: currency
        tests:
          - accepted_values:
              values: ['INR', 'USD', 'SGD']
      - name: payer_phone
        tests:
          - dbt_utils.expression_is_true:
              expression: "regexp_like(payer_phone, '^\\\\+91[0-9]{10}$')"
      - name: txn_id
        tests:
          - unique
      - name: merchant_id
        tests:
          - relationships:
              to: ref('dim_merchant')
              field: merchant_id
"""
# Run with:  dbt test --select payments_settled
# Failures appear in:  target/run_results.json + dbt test_failures table

# ============================================================
# 2. Great Expectations — expectation suite as Python
# ============================================================
GE_SUITE_PYTHON = """
import great_expectations as gx

context = gx.get_context()
suite = context.add_or_update_expectation_suite('payments_settled.warning')

batch = context.get_batch({'datasource_name': 'warehouse', 'table': 'fact.payments_settled'})

batch.expect_column_values_to_not_be_null('amount')
batch.expect_column_values_to_be_in_set('currency', ['INR', 'USD', 'SGD'])
batch.expect_column_values_to_match_regex('payer_phone', r'^\\+91[0-9]{10}$')
batch.expect_column_values_to_be_unique('txn_id')

# Referential integrity is awkward in pure GE — typically a custom expectation:
batch.expect_column_values_to_be_in_set(
    'merchant_id',
    value_set=context.get_batch({'table': 'dim.merchants'}).get_column_values('merchant_id')
)

context.save_expectation_suite(suite)
checkpoint = context.add_checkpoint(name='payments_settled_check', validations=[{'batch': batch}])
checkpoint.run()  # writes data_docs HTML + sends Slack via configured action
"""

# ============================================================
# 3. Soda — checks.yml under soda-checks/payments_settled.yml
# ============================================================
SODA_CHECKS_YML = """
checks for fact_payments_settled:
  - missing_count(amount) = 0
  - invalid_count(currency) = 0:
      valid values: ['INR', 'USD', 'SGD']
  - invalid_count(payer_phone) = 0:
      valid regex: '^\\+91[0-9]{10}$'
  - duplicate_count(txn_id) = 0
  - failed rows:
      name: orphan_merchant_ids
      fail query: |
        SELECT f.merchant_id FROM fact_payments_settled f
        LEFT JOIN dim_merchants d ON f.merchant_id = d.merchant_id
        WHERE d.merchant_id IS NULL
"""
# Run with:  soda scan -d warehouse soda-checks/payments_settled.yml

print("dbt:", DBT_SCHEMA_YML.count('- name:'), "test definitions in", len(DBT_SCHEMA_YML), "chars")
print("GE: ", GE_SUITE_PYTHON.count('expect_'), "expectations in", len(GE_SUITE_PYTHON), "chars")
print("Soda:", SODA_CHECKS_YML.count('-'), "checks in", len(SODA_CHECKS_YML), "chars")
# Output:
dbt: 5 test definitions in 624 chars
GE:  4 expectations in 1054 chars
Soda: 7 checks in 423 chars

The character counts hint at the trade-off but don't show the full picture. The dbt version is the shortest "interesting code" — five blocks, each two or three lines, all in YAML next to the column definition. The relationships test is a built-in; the regex test required pulling in the dbt_utils package. The five tests run in the same dbt test invocation, and the output integrates with whatever monitoring you already have on dbt run. Why dbt's relationships test is the killer feature here: it's the only one of the three frameworks where referential integrity is a one-line built-in. In GE you have to either materialise the parent column as a Python list (memory blowup on large dim tables) or write a custom expectation; in Soda you fall through to the raw-SQL failed rows escape hatch. dbt got this right because dbt's whole world view is the model graph, and FK-like relationships between models are the most natural thing to express in that worldview.

The Great Expectations version is the longest and the most code-like. The same five tests need a context, a batch, an expectation suite, and a checkpoint — four nouns that only exist in GE's vocabulary. The referential test is genuinely awkward: GE doesn't have a clean cross-table primitive, and the workaround of "fetch the parent column as a Python list and pass it as a value_set" works for a 10,000-row dimension table but breaks at scale. Production GE setups for referential checks usually write a custom expectation in Python that runs the SQL directly, which means you've now coupled GE to a SQL engine in addition to its native batch engine. The win is when you reach for expect_column_kl_divergence_to_be_less_than on a continuous column — none of the others have that built in, and writing it from scratch in dbt or Soda is half a day of work.

The Soda version is the most readable spec — somebody who's never seen Soda before can guess what missing_count(amount) = 0 does. The first four tests fit on one screen. The referential test escapes into raw SQL via the fail query block, which is a deliberate Soda design — when the YAML doesn't cover it, you write SQL, and Soda runs it the same way. The character count is 32% smaller than dbt and 60% smaller than GE for the same five tests. The cost of the brevity is that when you need a test Soda doesn't have a clean check for — say, the KL-divergence example — you write the SQL and lose the "Soda is concise" property anyway.

The decisive practical detail isn't visible in the code: it's where the failure goes. A dbt test failure shows up in dbt test CLI output and in manifest.json, and your CI/CD pipeline parses the manifest to gate the deployment. A GE failure shows up in the data-docs HTML site and, if you've configured the Slack action, in a Slack channel — but you have to configure the action; it's not on by default. A Soda failure shows up in the CLI and, if you're using Soda Cloud, in a hosted dashboard with built-in alerting and on-call routing. The "out-of-the-box failure surface" matters more than most teams realise: you get the surface the framework gave you, and you build all your incident response on top of that surface for years.

When to pick which — the team-shape map

Given the three bets, the reasonable mapping is:

The decision tree is not "which one has more features" — it's "which one matches the writer". If you pick the tool that doesn't match the writer, the writer either won't write tests or will write tests that nobody else can read.

A simple decision flow for picking a quality frameworkA decision tree starts with a root question: who writes the tests in your team? The first branch asks if the writers are analytics engineers in the dbt graph; if yes, go to dbt tests. The second branch asks if the writers are data scientists or platform engineers needing rich statistical expectations; if yes, go to Great Expectations. The third branch asks if the writers are analysts or non-engineers wanting hosted alerting without building it; if yes, go to Soda. A note at the bottom warns that mixing two frameworks in one warehouse usually ends in alerts arriving from both with no single source of truth. Pick by the writer, not by the feature list who writes the tests? root question analytics engineer already in dbt graph? most common case data scientist or ML eng, rich statistical checks? distribution drift, KL analyst or non-engineer, wants hosted alerting? YAML-only authoring dbt tests Great Expectations Soda running two frameworks side-by-side is the common anti-pattern — pick one and live with the gaps
The decision compresses to one question: who in your team will own the test spec? The framework that matches that writer wins, regardless of which has the longer feature list.

How the three fail in production

Every framework has a class of bug it handles badly. Knowing the failure modes upfront is the difference between a tool you adopt and a tool you regret in six months.

dbt tests fail badly when the test count grows past ~150 per project and the dbt test runtime balloons. The default behaviour runs every test serially against the warehouse, and a 200-test suite on a busy Snowflake warehouse can take 18 minutes — long enough that engineers stop running tests locally, which means broken tests don't get caught until CI. The fixes are real but require investment: --threads 16 for parallelism, scoping tests to changed models with --select state:modified+, and breaking the test suite into "fast" (run on every PR) and "slow" (run nightly) tiers. The other dbt failure mode is the test-as-side-effect problem: a not_null test that fires every model run and a relationships test on a 10-crore-row fact table both look identical in schema.yml, but the second one costs 10× the warehouse credits and most teams don't notice until the bill arrives. Adding severity: warn and where: clauses to scope expensive tests to recent partitions is a discipline that has to be learned through pain.

Great Expectations fails badly on operational concerns. The framework has been through three major API rewrites — the original "validation operators", the "checkpoints" model, and the "fluent API" — and a project that adopted GE in 2020 has been through two upgrades that broke its expectation suites. Migration is real work: an expectation suite written against GE 0.13 doesn't run on 0.16 without rewriting. Beyond the API churn, the data-docs HTML site is a static-site asset that someone has to host, and the Slack action requires you to write a small Python class and register it in the data context — operational chores that go on a checklist nobody owns. Teams that deploy GE successfully usually have a dedicated platform engineer who babysits the framework; teams that try to deploy GE casually as "just another quality library" tend to abandon it within a year.

Soda fails badly when you grow past the YAML's expressiveness. The check vocabulary is small on purpose, and the moment you need a test that doesn't fit — a temporal-join referential check, a KL-divergence drift check, a multi-column conditional check ("amount must be positive, but only when status = 'settled'") — you fall through to the raw-SQL escape hatch. Once a project has more than 30% of its checks as raw SQL, Soda has stopped being the abstraction and is just a SQL runner with extra YAML. The other Soda failure mode is the cloud-vs-core split: Soda Core is open-source and runs locally; Soda Cloud is the paid hosted product where most of the alerting and dashboarding lives. Teams that adopt Soda Core for cost reasons end up reimplementing the alerting layer themselves, and at that point the "built-in alerting was the differentiator" argument goes away. The pricing on Soda Cloud is per dataset per month, which scales unkindly past ~100 datasets in the catalogue.

The mixed-framework anti-pattern is the most common production failure across all three. Team A uses dbt tests for the transforms; team B uses GE on the feature pipeline; the data-platform team uses Soda for cross-cutting governance checks. Three test frameworks, three alert channels, three on-call rotations, three places to look when "something is wrong with the data". After 18 months the team consolidates onto one — usually dbt tests if the warehouse is the centre of gravity, usually GE if the ML pipeline is — and the migration costs more than picking right would have on day one. The lesson Aditi at the fintech needed to learn: pick one, accept the gaps it leaves, and patch the gaps with custom code rather than with a second framework.

Common confusions

Going deeper

How dbt's relationships test scales — and where it breaks

The dbt relationships test compiles to a SQL pattern: SELECT child_col FROM child WHERE child_col NOT IN (SELECT parent_col FROM parent). On Snowflake or BigQuery this query plans as an anti-semi-join and runs in seconds even on 10-crore-row child tables, because the warehouse can use bloom filters or hash-anti-join algorithms. The breakdown comes when (a) the parent table is itself huge — a relationships test from fact_clicks (50 crore rows) to fact_impressions (200 crore rows) is genuinely expensive — or (b) the relationship is temporal, where you need the SCD-2 join pattern from chapter 34. dbt has no built-in for the temporal version; you write a custom singular test (tests/singular/payments_join_correct_merchant_snapshot.sql) that does the temporal join. This is fine — singular tests are first-class — but it means the relationships built-in only covers the simple case, and the moment you need history, you're writing custom SQL.

Great Expectations checkpoints: what they actually solve

A "checkpoint" in GE is a saved bundle of (expectation suite, batch request, validation actions). The point is reproducibility and orchestration: you can declare "every time the daily payments load finishes, run the payments_settled checkpoint and route failures to Slack and PagerDuty" as a single named artefact. Without checkpoints, a GE run requires assembling the suite, the batch, and the post-validation actions in code on every invocation, which doesn't compose well with Airflow-style orchestrators that want to call a single function. The checkpoint pattern is GE's answer to "how do I run this from a DAG?" — and it works, but it's the conceptual boundary that gives most newcomers trouble. Once a team groks checkpoints, the rest of GE clicks; until they do, GE feels like five concepts wearing a trench coat.

Soda's failed rows escape hatch as the canary

Soda's spec language covers about 25 check types, and the moment your test doesn't fit, you write failed rows: { fail query: <SQL> } and Soda runs the SQL — failing if any rows return. This escape hatch is functionally important (you can never get stuck on Soda's expressiveness) and culturally important (it gives Soda a clean answer to "what about my edge case?"). But the escape hatch is also a canary: if more than ~30% of a project's checks are failed rows: { fail query }, the YAML abstraction has stopped paying its rent and you should consider whether dbt tests (with macros) or raw SQL in scheduled queries would be a cleaner home. The 30% threshold isn't documented anywhere — it's a heuristic from teams that have lived through Soda adoption and migration — but it's a useful self-check at the 6-month review.

Distribution drift: the test class only Great Expectations does well

The chapter 34 article touched on distribution-drift testing — comparing today's column distribution against the trailing 28-day distribution and alerting on a 3σ deviation. dbt has no built-in for this; you write it as a custom generic test using a window function (stddev_samp over trailing days), and the SQL is gnarly. Soda has anomaly score as a check type, but it's part of Soda Cloud and the algorithm is opaque. Great Expectations has expect_column_kl_divergence_to_be_less_than and expect_column_distinct_values_to_match_set and expect_column_value_lengths_to_be_between, plus a "profiler" that auto-generates baseline statistics for new datasets. For statistical-quality teams — fraud platforms, recommender systems, ML feature pipelines — this is GE's killer feature, and the reason GE survives in places where its operational cost would otherwise drive adoption to dbt or Soda. PhonePe's fraud-ML team at Bengaluru runs GE specifically because the trade-off is worth it: they need the distribution checks; the column-shape checks would be cheaper in dbt, but their tier-1 problem is drift, not shape.

The hidden cost: what each framework does with WHERE

A subtle but important detail is how each framework lets you scope a test to a partition. dbt supports where: load_date = current_date directly in the test config — a one-liner that turns a full-table scan into a partition scan. Great Expectations does this via batch requests, where the batch definition itself includes the partition predicate; the expectation suite stays the same and the batch tells GE which slice to evaluate. Soda has filter: on the dataset block. All three support partition scoping, but the syntax and ergonomics differ: dbt's where is closest to where the test author thinks about it (column level); GE's batch separation is cleanest conceptually but requires more setup; Soda's filter is mid-range. Test scoping is the difference between a 90-second suite and a 12-minute suite once the warehouse has a year of data, so getting the scoping right early — picking a framework that makes scoping cheap — is one of the most leveraged choices in the entire framework decision.

Where this leads next

The next chapter (36) closes Build 5 with anomaly detection — what to do when hard-threshold quality tests miss the bug. After that, Build 6 turns to the producer's side of columnar storage: Parquet anatomy, Iceberg, Delta, and the small-files problem that makes those formats interesting in the first place.

The framework you pick will shape the rest of your data-quality life for years. The right way to think about it is not "which one is best" but "which one would I still be glad I picked when I have 200 tests, three on-call engineers, and a CFO asking why the warehouse bill went up". By that lens, the answer for most teams is dbt tests for column shape, with one of GE or Soda as a focused supplement for the tests dbt can't express cleanly. Avoid the mixed-framework slow death; pick one primary framework and budget the operational cost honestly. Aditi at the fintech ended up picking dbt tests because the team was already migrating transforms to dbt — and the second-order benefit she didn't see coming was that the test catalogue and the model catalogue became the same catalogue, which is exactly the kind of compounding benefit framework choice was supposed to deliver.

References

  1. dbt tests documentation — the canonical reference for SQL-first test specs in a dbt project, covering generic tests, singular tests, and the four built-ins.
  2. dbt-utils package — the macros that extend dbt's built-in test vocabulary; expression_is_true, equal_rowcount, recency, and the often-used unique_combination_of_columns.
  3. Great Expectations documentation — Python-first expectation library with 50+ built-in expectations, checkpoints, and a data-docs HTML renderer.
  4. Soda Core documentation — the open-source YAML-first quality framework, with check syntax, runner CLI, and integration patterns.
  5. Tristan Handy, "What Is dbt? The Modern Data Stack" (Medium, 2017) — origin essay that explains why dbt tests live next to model definitions, not in a separate test framework.
  6. Abe Gong & James Campbell, "Down with Pipeline Debt" (Great Expectations blog, 2018) — the founding manifesto for GE, articulating the "expectation" abstraction.
  7. Quality tests: row-level, table-level, referential — the previous chapter, which categorises the test classes these frameworks implement.
  8. Data contracts: the producer/consumer boundary — chapter 31, where the rules these frameworks enforce are negotiated.