Full refresh vs incremental: when to pick which

Aditi runs a five-person data team at a Bengaluru insurance startup. Their warehouse has 47 source tables. 32 of them are loaded with TRUNCATE; INSERT SELECT * FROM source every night because that is what the original engineer wrote in 2022 when the largest table had 80 thousand rows. It is now 2026, the largest table has 4.1 crore rows, the nightly job runs from 02:00 to 06:30, and a corrupted CSV at 04:15 leaves the warehouse half-empty until the next night's run finishes. Aditi's instinct is "convert everything to incremental." Half the tables shouldn't be — they're 5 thousand-row reference tables where incremental adds bugs without saving meaningful time. The other half desperately should be, but each one needs a different cursor strategy depending on whether the source is append-only, mutable, or has hard deletes. The decision is not a global one; it is a per-table classification.

Full refresh wipes the destination and re-loads every row each run; incremental writes only the rows that changed since last run. Full refresh is correct by construction and stupidly simple to operate; incremental is fast and cheap but adds five new failure modes (cursor drift, missed deletes, schema skew, partial-batch state, late data). The right boundary is volume × correctness-budget × source-shape — not a global preference for one over the other.

What each pattern actually does

A full refresh load deletes every row in the destination and re-inserts the source's current state on every run. The implementation is three SQL statements:

BEGIN;
TRUNCATE warehouse.policies;
INSERT INTO warehouse.policies SELECT * FROM source.policies;
COMMIT;

That is the entire load logic. Every run produces a destination that is byte-identical to the source's snapshot at the moment the INSERT SELECT ran. There is no cursor to maintain, no high-water mark to checkpoint, no dedup to perform, no concept of "what changed". The destination's history is exactly what the source looks like right now; yesterday's view is gone.

An incremental load reads only the rows that changed since the last run, and merges them into the destination. The cursor — typically WHERE updated_at > :high_water_mark — bounds the read. The merge — typically a MERGE statement keyed on the primary key — handles the write. The destination accumulates state across runs; restarting from a clean state requires a backfill.

Full refresh vs incremental over four runsTwo side-by-side timelines. Left: full refresh — every run reads the entire source table (large bar), writes the entire destination. Right: incremental — only the changed rows since last run are read and merged.full refreshevery run reads the whole sourcerun 1: read 4.1cr → write 4.1crrun 2: read 4.1cr → write 4.1crrun 3: read 4.1cr → write 4.1crrun 4: read 4.1cr → write 4.1crcost: O(N) per runincrementaleach run reads only the deltarun 1: read 4.1cr (initial) → MERGE+8k+5k+12kcost: O(N) once + O(Δ) per run
Same source, two strategies, four runs. Full refresh pays the full read+write cost on every run; incremental pays it once at backfill, then only the delta. The savings compound at scale.

Both patterns produce the correct destination in the steady state, assuming the cursor is correct and no row is missed. That assumption is what costs an incremental pipeline its simplicity — and why so many production warehouses default to full refresh for the first 18 months.

Why full refresh isn't always wasteful: when the source is small, full refresh runs in seconds and uses zero state. The dollar cost of "scan a 50-thousand-row reference table once a night" is roughly ₹0.0008 on Snowflake X-Small. The engineering cost of converting it to incremental is several hours plus the perpetual operational cost of one more cursor to monitor. For small tables, full refresh is not a placeholder for "we'll fix it later" — it's the right answer.

The four axes that decide

There is no global preference between full refresh and incremental — the right choice is a function of four properties of the specific table being loaded. Walk through them in order; the first one that disqualifies full refresh forces you to incremental, and the first one that disqualifies incremental forces you back to full refresh.

Volume — the dollar threshold

The simplest axis. If full refresh fits in your nightly batch window at acceptable cost, it is correct by default. The crossover point depends on the warehouse's per-run cost model, but a useful Indian-fintech rule of thumb:

The thresholds shift with the warehouse — Snowflake's micro-partition pruning makes full refresh of a partitioned table cheaper than the row count suggests; BigQuery's on-demand pricing makes it more expensive at the same scale; ClickHouse's columnar speed pushes the threshold higher; Postgres-as-warehouse pushes it lower. Calibrate against your warehouse's actual cost-per-TB-scanned, not the row count.

Source shape — what cursors are even possible

The previous chapter ("Append-only vs mutable source tables") established that not every source supports a clean cursor. The table's shape filters which incremental strategies are possible:

If the source shape rules out a clean cursor, full refresh is sometimes the only correct choice short of building a CDC pipeline. The Razorpay payment_events table fits the first row; the Flipkart catalog post-Vitess-migration fit the third row before the team built the Debezium pipeline.

Correctness budget — what does a missed row cost?

This is the axis that engineers under-weight in design discussions and over-weight in post-mortems. Different domains have different tolerances:

Map every table to its correctness budget before choosing the load pattern. The mistake is treating every table the same — converting the marketing-events table to the same incremental discipline as the ledger gives you no extra correctness for material extra engineering cost.

Recovery cost — what happens when it breaks

The fourth axis is the one that changes the answer for senior engineers vs. junior ones. What does it cost to restart from a clean state when something goes wrong?

For a full refresh: nothing. The next run produces a correct destination from scratch. There is no concept of "behind"; every run is the canonical state.

For an incremental: meaningful effort. A corrupted high-water mark, a botched dedup, a partially-applied MERGE, or a schema-skew event leaves the destination in an indeterminate state that requires a backfill — which is itself a multi-hour operation that has to be designed (which time range to re-read, which destination rows to remove, how to avoid double-counting). A 5-year-old incremental pipeline accumulates these recovery procedures as institutional knowledge that lives in one engineer's head.

Senior engineers weight this axis heavily. Junior engineers under-weight it because they have not yet had to backfill production at 3 a.m. The rule of thumb: if the table is small enough that full refresh is cheap, the recovery-cost axis dominates and full refresh wins by default. The pipeline that takes 30 seconds to fully refresh from scratch every night has zero failure modes that require human intervention; the pipeline that takes 30 seconds to incrementally process the delta has five.

A worked example: classifying every table in your warehouse

Below is a Python utility that walks a list of source tables, gathers the four signals (row count, update fraction, presence of cursor columns, declared correctness tier), and emits a recommendation per table. The output is what an Aditi-shaped engineer would print and stick on the team's wiki to settle the "should this be full refresh or incremental?" debate per-table rather than per-team.

# classify_load_strategy.py
# For every source table, recommend full-refresh, incremental, or hybrid.

import psycopg2

DSN = "host=oltp.internal dbname=insurance user=etl password=etl"

# Correctness budget per table — owned by the data team, not inferred.
CORRECTNESS_TIER = {
    "policies":          "high",   # ledger-adjacent, audit
    "claims":            "high",
    "users":             "medium", # mutates, but reconcilable
    "agent_events":      "low",    # high-volume telemetry
    "policy_types":      "ref",    # tiny lookup
    "states":            "ref",
    "agent_login_audit": "medium",
}

def signals(cur, schema, table):
    cur.execute(f"SELECT COUNT(*) FROM {schema}.{table}")
    n_rows = cur.fetchone()[0]
    cur.execute("""
        SELECT n_tup_ins, n_tup_upd, n_tup_del
        FROM pg_stat_user_tables
        WHERE schemaname=%s AND relname=%s
    """, (schema, table))
    ins, upd, dele = cur.fetchone() or (0, 0, 0)
    total = ins + upd + dele
    upd_frac = (upd / total) if total else 0.0
    cur.execute("""
        SELECT column_name FROM information_schema.columns
        WHERE table_schema=%s AND table_name=%s
          AND column_name IN ('updated_at','deleted_at')
    """, (schema, table))
    cols = {r[0] for r in cur.fetchall()}
    return n_rows, upd_frac, dele > 0, cols

def recommend(schema, table):
    with psycopg2.connect(DSN) as conn, conn.cursor() as cur:
        n_rows, upd_frac, hard_del, cols = signals(cur, schema, table)
        tier = CORRECTNESS_TIER.get(table, "medium")
        small = n_rows < 1_000_000
        large = n_rows > 100_000_000
        has_cursor = "updated_at" in cols
        soft_del   = "deleted_at" in cols

        if small:
            strategy = "full refresh nightly"
        elif tier == "high" and (hard_del or not has_cursor):
            strategy = "full refresh + nightly reconciliation"
        elif large and has_cursor and not hard_del:
            strategy = "incremental (updated_at + 5min overlap + MERGE)"
        elif has_cursor and soft_del:
            strategy = "incremental + tombstone logic"
        elif has_cursor:
            strategy = "incremental + weekly full reconciliation"
        else:
            strategy = "CDC (Build 11) or full refresh"
        print(f"{schema}.{table:<22} rows={n_rows:>10,}  upd%={upd_frac:5.1%}  "
              f"tier={tier:<6} -> {strategy}")

if __name__ == "__main__":
    for t in CORRECTNESS_TIER:
        recommend("public", t)
# Output:
public.policies              rows=  4,12,89,003  upd%= 28.4% tier=high   -> full refresh + nightly reconciliation
public.claims                rows=    87,11,420  upd%= 14.8% tier=high   -> incremental (updated_at + 5min overlap + MERGE)
public.users                 rows=    52,40,118  upd%= 41.2% tier=medium -> incremental + tombstone logic
public.agent_events          rows=18,47,02,991   upd%=  0.0% tier=low    -> incremental (updated_at + 5min overlap + MERGE)
public.policy_types          rows=        148    upd%=  2.1% tier=ref    -> full refresh nightly
public.states                rows=         36    upd%=  0.0% tier=ref    -> full refresh nightly
public.agent_login_audit     rows=     8,72,114  upd%=  0.0% tier=medium -> full refresh nightly

Three lines do the work. signals() gathers the four axes from pg_stat_user_tables and information_schema — no application-team meeting required. CORRECTNESS_TIER is the only signal the script cannot infer; it has to be owned by the data team and reviewed quarterly. recommend() is a decision tree that maps the four axes to one of six strategies — and notice that policy_types (148 rows, ref tier) gets full refresh nightly even though the script could technically build an incremental for it. The rule "small + ref → full refresh" exists because the recovery-cost axis dominates for tiny tables.

Why the script doesn't pick "incremental" for the 148-row reference table even though it could: the operational cost of one more cursor to checkpoint, monitor, and reconcile is fixed regardless of the table's size. For a 148-row table that fully refreshes in 4 milliseconds, that fixed cost is pure overhead with zero offsetting gain. The threshold of "under 1 million rows = always full refresh" is calibrated against the engineering cost per cursor (~6 hours/year/cursor for monitoring) divided by the cost-per-run savings — under 1 million rows the savings never pay off.

A second use of the same script: re-run it quarterly. Tables grow over their lifetime; a 5-lakh-row reference table that stays at 5 lakh rows is fine on full refresh forever, but a 5-lakh-row events table that has grown to 5 crore rows in 18 months has crossed the threshold and the team didn't notice. The drift monitor for "this table just crossed 1 crore rows on full-refresh strategy" is one of the cheapest pieces of observability a data team can run, and it surfaces the next migration before it becomes the next 3 a.m. incident.

Why the recommendation includes "+ nightly reconciliation" even on incremental for high-tier tables: the cursor is correct in the steady state but every cursor accumulates a long tail of edge-cases over months — a clock skew, a botched migration, a manual UPDATE that bypassed the application. The reconciliation is not a substitute for the incremental; it is the audit that catches whatever the cursor missed. For ledger-adjacent tables, the reconciliation cost (~₹40 of warehouse credits per night) buys correctness insurance that no amount of cursor tuning replaces.

The hybrid pattern: incremental during the day, full refresh at night

Most mature production pipelines do not pick one strategy globally — they layer two. The default operating mode is incremental, and a periodic full refresh runs as the reconciler.

The pattern looks like this: every 15 minutes, an incremental load reads WHERE updated_at > :hwm and merges the delta into the destination. Once a night, between 02:00 and 04:00, a full refresh job runs that reads the entire source and computes the diff against the destination. Any row in the destination that differs from the source — too many rows, missing rows, stale rows — is corrected. Any row in the source that the incremental missed gets added. The full refresh is not the primary load path; it is the audit that catches what the incremental drifts past.

Why this pattern works: the incremental is fast and cheap during the day, so dashboards stay fresh. The nightly full refresh is the safety net, so any drift is corrected within 24 hours. The two together produce a destination whose worst-case staleness is 15 minutes and whose worst-case drift is one night — better than either pattern alone.

The cost is two pipelines instead of one, two failure modes to monitor, and two kinds of run-time alerts. The win is a data warehouse that data scientists trust without a daily "did the pipeline run cleanly?" check. Most Indian-fintech data teams converge on this pattern within 24 months of operating a pure-incremental pipeline; the breaking incident is usually a silent drift that finance discovers in a quarterly close.

Hybrid load pattern across one dayA 24-hour timeline. Small incremental ticks every 15 minutes during the day. One large full-refresh reconciliation at 02:00–04:00 in the early morning.00:0004:0008:0012:0016:0020:0024:00full refresh02:00–04:00reconcileincremental every 15 min (small bars)freshness window: 15 min staleness max
The hybrid pattern. Incremental loads every 15 minutes keep dashboards fresh during the day. The full-refresh reconciliation between 02:00 and 04:00 catches drift, schema skew, and missed deletes. Worst-case staleness is 15 minutes; worst-case drift is one night.

Common confusions

Going deeper

The CDAS (CREATE OR REPLACE TABLE AS SELECT) idiom on cloud warehouses

Snowflake, BigQuery, and Databricks all support a single-statement full-refresh idiom: CREATE OR REPLACE TABLE warehouse.policies AS SELECT * FROM source.policies. The semantics are atomic — the destination is the new query result, with zero observable downtime, because the underlying storage system swaps the table's pointer in one transaction. This is structurally better than TRUNCATE; INSERT because there is no window in which the destination is empty; readers either see the old version or the new version, never a half-loaded one.

The cost is that the old version's storage is retained for the warehouse's time-travel window (24 hours on Snowflake by default, configurable to 90 days on enterprise tiers). Storage is cheap; the audit and rollback capability is meaningful. Most production teams converge on CDAS for full-refresh tables and accept the storage overhead — Razorpay's nightly catalog refresh is a CDAS, Zerodha's daily symbol-master refresh is a CDAS, and the rollback story when a bad load lands is "switch back to the previous time-travel version" rather than "reload from source again".

The dbt incremental materialisation: the per-table choice as code

dbt formalises the per-table strategy as a model configuration. Every model has a materialized field that takes one of view, table, incremental, or ephemeral. The first two are the dbt names for "no warehouse storage" and "full refresh"; incremental is the dbt name for the cursor pattern this chapter has been describing.

A dbt incremental model declares the unique key, the cursor column, and the merge strategy in YAML:

{{ config(
    materialized='incremental',
    unique_key='claim_id',
    incremental_strategy='merge',
    on_schema_change='append_new_columns'
) }}

SELECT * FROM {{ source('oltp', 'claims') }}
{% if is_incremental() %}
  WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

The is_incremental() block is dbt's representation of "this is the incremental run, not the initial backfill". On the initial run (no destination yet) the WHERE clause is skipped and the full source is loaded. On subsequent runs the cursor bounds the read. The on_schema_change setting determines what happens when the source's columns change — append_new_columns adds new columns to the destination, fail fails the run, ignore silently drops them. dbt has codified the four-axis decision into per-model YAML, which is why it has become the dominant tool for warehouse pipelines.

Why streaming systems flip the framing entirely

In a streaming system (Build 7+), the full-refresh-vs-incremental question dissolves. Every event is its own delta, the cursor is the offset, and the destination is updated continuously. There is no batch boundary at which to "fully refresh". The streaming generalisation of full refresh is "replay the entire log from offset 0", which Kafka supports natively because the log is durable for the retention window.

This is why the data engineering community drifts toward streaming for large mutable sources: the incremental discipline that batch pipelines have to build (cursor checkpoint, dedup hash, MERGE on updated_at) is replaced by the structural property that the source is already a log. The trade-off is operational complexity (Kafka cluster, schema registry, consumer offsets) — but for a 10-crore-row source with high update churn, the streaming pipeline is structurally simpler than the equivalent batch pipeline with all its incremental machinery.

Build 8 will revisit this with watermarks and exactly-once. The current chapter establishes the batch-world framing; the streaming-world framing comes later. For now: if you find yourself building a complex incremental pipeline against a high-churn source, the right answer is often "stop, use CDC, treat the source as a log".

The Swiggy delivery-events table: a real classification story

Swiggy's data team operates a delivery-events table that captures every state transition of every order — created, picked_up, out_for_delivery, delivered, cancelled. The table is append-only by design (every state change is a new row), volume is around 15 crore rows per day at peak, and the warehouse copy is the source of truth for delivery-time SLA dashboards.

The team's first design used full refresh nightly. At 2 crore rows/day, the nightly full refresh ran in 18 minutes on Redshift dc2.large — fine. At 15 crore rows/day, the same job ran for 4 hours, the warehouse cluster had to be sized up, and the cost grew to ₹85,000 per night. The team migrated to incremental loading using the monotonic event_id as cursor, which dropped the per-run cost to ₹240 (incremental delta of around 1.2 crore rows) and the run time to 8 minutes.

The cost of the migration was 3 weeks of engineering: design review (1 week), implementation and dedup logic (1 week), and parallel running with reconciliation against the old full-refresh output (1 week). The team kept the full refresh job running weekly as the reconciler — Saturday night runs the full scan, compares against the incremental-loaded destination, and emits an alert if drift exceeds 0.01% of rows. In 18 months of operation the reconciler has fired three times, all caused by Redshift consistency edge-cases and not pipeline bugs. The hybrid pattern (incremental nightly + full refresh weekly) is the load strategy this table will run forever.

The ledger-table double-entry problem

Financial ledger tables — Razorpay's payments_ledger, BookMyShow's booking_ledger, Zerodha's trade_book — have a constraint that no other table type has: the sum of credits must equal the sum of debits, every time. Any load strategy that produces a partial state during its run violates that invariant for readers who query mid-load.

The naive TRUNCATE; INSERT full refresh fails badly here — for the duration of the load (potentially several minutes), the table is empty and SUM(amount) returns zero. A naive incremental fails differently — partial application of the delta leaves the table in a state where SUM(credits) ≠ SUM(debits) for the seconds between the two halves of a paired insert.

The pattern that solves it: load into a staging table, validate the invariants, then atomically swap the staging table for the destination. The CDAS idiom on Snowflake/BigQuery handles this natively. On Postgres-as-warehouse, the pattern is BEGIN; CREATE TABLE staging AS SELECT ...; ALTER TABLE policies RENAME TO policies_old; ALTER TABLE staging RENAME TO policies; DROP TABLE policies_old; COMMIT;. Either way, the readers never observe a half-loaded table — they observe the previous version or the new version, nothing in between.

For ledger tables specifically, this constraint dominates the load-strategy choice. Even if incremental would be faster, the atomic-swap requirement of full refresh is structurally easier to guarantee than partial-application correctness of incremental — and many ledger pipelines stay on nightly full refresh for that reason alone, accepting the staleness in exchange for the trivial atomicity story.

Where this leads next

Build 4 will introduce the scheduler and put both load strategies inside DAG tasks with retries, dependencies, and SLA monitoring; the per-table choice this chapter described becomes a per-task configuration. Build 5 will add lineage and contracts on top — the correctness-tier classification this chapter sketched as a Python dictionary becomes a YAML file in version control, validated in CI, and propagated through the lineage graph so a downstream consumer can see "this dashboard reads from a table loaded incrementally with weekly reconciliation, last reconciled 3 days ago". The mature production warehouse is one where every table's load strategy is visible to its readers, not buried in the pipeline code.

References

  1. dbt: incremental models documentation — the canonical reference for the four-axis classification expressed as configuration.
  2. Snowflake: time travel and CDAS — the atomic-swap full-refresh primitive on cloud warehouses.
  3. Ralph Kimball, The Data Warehouse Toolkit — the classical treatment of load strategies in batch warehouses.
  4. Maxime Beauchemin: The Rise of the Data Engineer — the essay that shaped the modern framing of "pipelines are software, load strategy is design".
  5. Postgres pg_stat_user_tables — the system catalog the classification script reads from.
  6. Append-only vs mutable source tables — the previous chapter, which established the source-shape axis.
  7. Swiggy engineering blog: scaling delivery-events — the team's published account of the full-refresh-to-incremental migration.
  8. Apache Iceberg: row-level operations — the table-format-level support for both load strategies, with snapshot isolation.

The honest summary: the question "should this table be full refresh or incremental?" has four answers depending on volume, source shape, correctness budget, and recovery cost — and the right answer is almost always per-table, never per-team. The team that defaults to "everything full refresh" wastes warehouse credits but ships zero correctness incidents. The team that defaults to "everything incremental" saves warehouse credits but accumulates a long tail of cursor-drift incidents that surface in finance reconciliations 3 months later. The mature team classifies every table on the four axes, picks the right strategy per-table, layers a reconciliation pass on top of the incremental ones, and re-runs the classification every quarter as tables grow over their lifetime.

A practical exercise: for each table in your warehouse, write down the four-axis answer (volume, shape, correctness tier, recovery cost) and the current load strategy. Count how many tables would change strategy if you classified them today using the rules in this chapter. The first run almost always finds at least three: a 50-thousand-row reference table that's incrementally loaded for no reason, a 5-crore-row events table that's full-refreshed at unnecessary cost, and a high-tier ledger table that's incrementally loaded without a reconciliation. Fixing those three is usually a quarter's worth of cleanup work — and it pays back in compute savings, fewer 3 a.m. pages, and a warehouse the data scientists actually trust.