Full refresh vs incremental: when to pick which
Aditi runs a five-person data team at a Bengaluru insurance startup. Their warehouse has 47 source tables. 32 of them are loaded with TRUNCATE; INSERT SELECT * FROM source every night because that is what the original engineer wrote in 2022 when the largest table had 80 thousand rows. It is now 2026, the largest table has 4.1 crore rows, the nightly job runs from 02:00 to 06:30, and a corrupted CSV at 04:15 leaves the warehouse half-empty until the next night's run finishes. Aditi's instinct is "convert everything to incremental." Half the tables shouldn't be — they're 5 thousand-row reference tables where incremental adds bugs without saving meaningful time. The other half desperately should be, but each one needs a different cursor strategy depending on whether the source is append-only, mutable, or has hard deletes. The decision is not a global one; it is a per-table classification.
Full refresh wipes the destination and re-loads every row each run; incremental writes only the rows that changed since last run. Full refresh is correct by construction and stupidly simple to operate; incremental is fast and cheap but adds five new failure modes (cursor drift, missed deletes, schema skew, partial-batch state, late data). The right boundary is volume × correctness-budget × source-shape — not a global preference for one over the other.
What each pattern actually does
A full refresh load deletes every row in the destination and re-inserts the source's current state on every run. The implementation is three SQL statements:
BEGIN;
TRUNCATE warehouse.policies;
INSERT INTO warehouse.policies SELECT * FROM source.policies;
COMMIT;
That is the entire load logic. Every run produces a destination that is byte-identical to the source's snapshot at the moment the INSERT SELECT ran. There is no cursor to maintain, no high-water mark to checkpoint, no dedup to perform, no concept of "what changed". The destination's history is exactly what the source looks like right now; yesterday's view is gone.
An incremental load reads only the rows that changed since the last run, and merges them into the destination. The cursor — typically WHERE updated_at > :high_water_mark — bounds the read. The merge — typically a MERGE statement keyed on the primary key — handles the write. The destination accumulates state across runs; restarting from a clean state requires a backfill.
Both patterns produce the correct destination in the steady state, assuming the cursor is correct and no row is missed. That assumption is what costs an incremental pipeline its simplicity — and why so many production warehouses default to full refresh for the first 18 months.
Why full refresh isn't always wasteful: when the source is small, full refresh runs in seconds and uses zero state. The dollar cost of "scan a 50-thousand-row reference table once a night" is roughly ₹0.0008 on Snowflake X-Small. The engineering cost of converting it to incremental is several hours plus the perpetual operational cost of one more cursor to monitor. For small tables, full refresh is not a placeholder for "we'll fix it later" — it's the right answer.
The four axes that decide
There is no global preference between full refresh and incremental — the right choice is a function of four properties of the specific table being loaded. Walk through them in order; the first one that disqualifies full refresh forces you to incremental, and the first one that disqualifies incremental forces you back to full refresh.
Volume — the dollar threshold
The simplest axis. If full refresh fits in your nightly batch window at acceptable cost, it is correct by default. The crossover point depends on the warehouse's per-run cost model, but a useful Indian-fintech rule of thumb:
- Under 10 lakh rows: full refresh is always correct. The whole-table scan finishes in under a minute on Snowflake X-Small or BigQuery on-demand. Cost is negligible.
- 10 lakh to 1 crore rows: full refresh works if the schedule allows it. Incremental saves real money but is not yet structurally required.
- Over 1 crore rows: full refresh starts to bite. Either the run takes hours or the warehouse cluster size has to grow to compensate. Incremental is structurally preferred.
- Over 10 crore rows: full refresh is operationally unviable. Incremental is the only sensible answer.
The thresholds shift with the warehouse — Snowflake's micro-partition pruning makes full refresh of a partitioned table cheaper than the row count suggests; BigQuery's on-demand pricing makes it more expensive at the same scale; ClickHouse's columnar speed pushes the threshold higher; Postgres-as-warehouse pushes it lower. Calibrate against your warehouse's actual cost-per-TB-scanned, not the row count.
Source shape — what cursors are even possible
The previous chapter ("Append-only vs mutable source tables") established that not every source supports a clean cursor. The table's shape filters which incremental strategies are possible:
- Append-only source: monotonic-id cursor. Incremental is straightforward.
- Mutable source with reliable
updated_at:updated_atcursor with overlap andMERGE. Incremental works. - Mutable source with hard deletes: cursor cannot detect deletions. Incremental needs CDC or periodic full reconciliation.
- Mutable source without
updated_at: no cursor possible. Choose between full refresh and CDC.
If the source shape rules out a clean cursor, full refresh is sometimes the only correct choice short of building a CDC pipeline. The Razorpay payment_events table fits the first row; the Flipkart catalog post-Vitess-migration fit the third row before the team built the Debezium pipeline.
Correctness budget — what does a missed row cost?
This is the axis that engineers under-weight in design discussions and over-weight in post-mortems. Different domains have different tolerances:
- Payments and ledger data: zero tolerance. A missed transaction is a financial discrepancy. Full refresh nightly + CDC continuously is the standard.
- User-facing analytics dashboards: low tolerance. Stale-by-an-hour is acceptable; missing-2%-of-rows is not. Incremental with a daily reconciliation full-refresh is typical.
- Marketing attribution: medium tolerance. A 1% drift over a quarter is invisible to most decisions. Pure incremental with weekly reconciliation works.
- Internal logging / observability: high tolerance. Missing rows are inconvenient, not catastrophic. Pure incremental with no reconciliation is fine.
Map every table to its correctness budget before choosing the load pattern. The mistake is treating every table the same — converting the marketing-events table to the same incremental discipline as the ledger gives you no extra correctness for material extra engineering cost.
Recovery cost — what happens when it breaks
The fourth axis is the one that changes the answer for senior engineers vs. junior ones. What does it cost to restart from a clean state when something goes wrong?
For a full refresh: nothing. The next run produces a correct destination from scratch. There is no concept of "behind"; every run is the canonical state.
For an incremental: meaningful effort. A corrupted high-water mark, a botched dedup, a partially-applied MERGE, or a schema-skew event leaves the destination in an indeterminate state that requires a backfill — which is itself a multi-hour operation that has to be designed (which time range to re-read, which destination rows to remove, how to avoid double-counting). A 5-year-old incremental pipeline accumulates these recovery procedures as institutional knowledge that lives in one engineer's head.
Senior engineers weight this axis heavily. Junior engineers under-weight it because they have not yet had to backfill production at 3 a.m. The rule of thumb: if the table is small enough that full refresh is cheap, the recovery-cost axis dominates and full refresh wins by default. The pipeline that takes 30 seconds to fully refresh from scratch every night has zero failure modes that require human intervention; the pipeline that takes 30 seconds to incrementally process the delta has five.
A worked example: classifying every table in your warehouse
Below is a Python utility that walks a list of source tables, gathers the four signals (row count, update fraction, presence of cursor columns, declared correctness tier), and emits a recommendation per table. The output is what an Aditi-shaped engineer would print and stick on the team's wiki to settle the "should this be full refresh or incremental?" debate per-table rather than per-team.
# classify_load_strategy.py
# For every source table, recommend full-refresh, incremental, or hybrid.
import psycopg2
DSN = "host=oltp.internal dbname=insurance user=etl password=etl"
# Correctness budget per table — owned by the data team, not inferred.
CORRECTNESS_TIER = {
"policies": "high", # ledger-adjacent, audit
"claims": "high",
"users": "medium", # mutates, but reconcilable
"agent_events": "low", # high-volume telemetry
"policy_types": "ref", # tiny lookup
"states": "ref",
"agent_login_audit": "medium",
}
def signals(cur, schema, table):
cur.execute(f"SELECT COUNT(*) FROM {schema}.{table}")
n_rows = cur.fetchone()[0]
cur.execute("""
SELECT n_tup_ins, n_tup_upd, n_tup_del
FROM pg_stat_user_tables
WHERE schemaname=%s AND relname=%s
""", (schema, table))
ins, upd, dele = cur.fetchone() or (0, 0, 0)
total = ins + upd + dele
upd_frac = (upd / total) if total else 0.0
cur.execute("""
SELECT column_name FROM information_schema.columns
WHERE table_schema=%s AND table_name=%s
AND column_name IN ('updated_at','deleted_at')
""", (schema, table))
cols = {r[0] for r in cur.fetchall()}
return n_rows, upd_frac, dele > 0, cols
def recommend(schema, table):
with psycopg2.connect(DSN) as conn, conn.cursor() as cur:
n_rows, upd_frac, hard_del, cols = signals(cur, schema, table)
tier = CORRECTNESS_TIER.get(table, "medium")
small = n_rows < 1_000_000
large = n_rows > 100_000_000
has_cursor = "updated_at" in cols
soft_del = "deleted_at" in cols
if small:
strategy = "full refresh nightly"
elif tier == "high" and (hard_del or not has_cursor):
strategy = "full refresh + nightly reconciliation"
elif large and has_cursor and not hard_del:
strategy = "incremental (updated_at + 5min overlap + MERGE)"
elif has_cursor and soft_del:
strategy = "incremental + tombstone logic"
elif has_cursor:
strategy = "incremental + weekly full reconciliation"
else:
strategy = "CDC (Build 11) or full refresh"
print(f"{schema}.{table:<22} rows={n_rows:>10,} upd%={upd_frac:5.1%} "
f"tier={tier:<6} -> {strategy}")
if __name__ == "__main__":
for t in CORRECTNESS_TIER:
recommend("public", t)
# Output:
public.policies rows= 4,12,89,003 upd%= 28.4% tier=high -> full refresh + nightly reconciliation
public.claims rows= 87,11,420 upd%= 14.8% tier=high -> incremental (updated_at + 5min overlap + MERGE)
public.users rows= 52,40,118 upd%= 41.2% tier=medium -> incremental + tombstone logic
public.agent_events rows=18,47,02,991 upd%= 0.0% tier=low -> incremental (updated_at + 5min overlap + MERGE)
public.policy_types rows= 148 upd%= 2.1% tier=ref -> full refresh nightly
public.states rows= 36 upd%= 0.0% tier=ref -> full refresh nightly
public.agent_login_audit rows= 8,72,114 upd%= 0.0% tier=medium -> full refresh nightly
Three lines do the work. signals() gathers the four axes from pg_stat_user_tables and information_schema — no application-team meeting required. CORRECTNESS_TIER is the only signal the script cannot infer; it has to be owned by the data team and reviewed quarterly. recommend() is a decision tree that maps the four axes to one of six strategies — and notice that policy_types (148 rows, ref tier) gets full refresh nightly even though the script could technically build an incremental for it. The rule "small + ref → full refresh" exists because the recovery-cost axis dominates for tiny tables.
Why the script doesn't pick "incremental" for the 148-row reference table even though it could: the operational cost of one more cursor to checkpoint, monitor, and reconcile is fixed regardless of the table's size. For a 148-row table that fully refreshes in 4 milliseconds, that fixed cost is pure overhead with zero offsetting gain. The threshold of "under 1 million rows = always full refresh" is calibrated against the engineering cost per cursor (~6 hours/year/cursor for monitoring) divided by the cost-per-run savings — under 1 million rows the savings never pay off.
A second use of the same script: re-run it quarterly. Tables grow over their lifetime; a 5-lakh-row reference table that stays at 5 lakh rows is fine on full refresh forever, but a 5-lakh-row events table that has grown to 5 crore rows in 18 months has crossed the threshold and the team didn't notice. The drift monitor for "this table just crossed 1 crore rows on full-refresh strategy" is one of the cheapest pieces of observability a data team can run, and it surfaces the next migration before it becomes the next 3 a.m. incident.
Why the recommendation includes "+ nightly reconciliation" even on incremental for high-tier tables: the cursor is correct in the steady state but every cursor accumulates a long tail of edge-cases over months — a clock skew, a botched migration, a manual UPDATE that bypassed the application. The reconciliation is not a substitute for the incremental; it is the audit that catches whatever the cursor missed. For ledger-adjacent tables, the reconciliation cost (~₹40 of warehouse credits per night) buys correctness insurance that no amount of cursor tuning replaces.
The hybrid pattern: incremental during the day, full refresh at night
Most mature production pipelines do not pick one strategy globally — they layer two. The default operating mode is incremental, and a periodic full refresh runs as the reconciler.
The pattern looks like this: every 15 minutes, an incremental load reads WHERE updated_at > :hwm and merges the delta into the destination. Once a night, between 02:00 and 04:00, a full refresh job runs that reads the entire source and computes the diff against the destination. Any row in the destination that differs from the source — too many rows, missing rows, stale rows — is corrected. Any row in the source that the incremental missed gets added. The full refresh is not the primary load path; it is the audit that catches what the incremental drifts past.
Why this pattern works: the incremental is fast and cheap during the day, so dashboards stay fresh. The nightly full refresh is the safety net, so any drift is corrected within 24 hours. The two together produce a destination whose worst-case staleness is 15 minutes and whose worst-case drift is one night — better than either pattern alone.
The cost is two pipelines instead of one, two failure modes to monitor, and two kinds of run-time alerts. The win is a data warehouse that data scientists trust without a daily "did the pipeline run cleanly?" check. Most Indian-fintech data teams converge on this pattern within 24 months of operating a pure-incremental pipeline; the breaking incident is usually a silent drift that finance discovers in a quarterly close.
Common confusions
- "Full refresh is for amateurs; real engineers always use incremental." No. Full refresh is the right answer for any table small enough that the full scan is cheap. Treating "incremental" as a status symbol is how teams ship 47 cursors when 32 of them shouldn't exist. The discipline is to prove the savings, not assume them.
- "Incremental is always faster than full refresh." Only after the first run. The initial backfill of an incremental pipeline reads the entire source — same cost as one full-refresh run. The savings come on every subsequent run where the delta is small. If the table is mostly-rewritten every day (high update fraction), incremental can actually be slower because the
MERGEis more expensive than aTRUNCATE; INSERT. - "A nightly full refresh job is redundant if the incremental is correct." If the incremental were guaranteed correct, yes. In practice the incremental drifts over months — a botched cursor checkpoint, a missed delete, a clock skew between source and warehouse — and the nightly full refresh is the audit that catches it. Drop the reconciliation only if you have CDC + a separate audit pipeline doing the same job.
- "
TRUNCATE; INSERTis the same asDELETE; INSERT." No.TRUNCATEis metadata-level and instant;DELETEwrites one WAL entry per row and can take minutes on a large table. The full-refresh idiom isTRUNCATE; INSERTinside a transaction, notDELETE; INSERT. On Snowflake, the equivalent isCREATE OR REPLACE TABLE AS SELECT .... - "Incremental loads have no downtime; full refresh does." Both can have zero downtime if you write the destination through a swap pattern — load into a staging table, then atomically rename. The naive
TRUNCATE; INSERTof full refresh leaves the destination empty for the duration of the load, which is the actual problem; the swap pattern fixes it. - "Full refresh is wasteful because it re-processes unchanged rows." Wasteful in compute, yes; but compute is the cheapest thing in a modern warehouse. The dollar cost of scanning an extra 80 lakh rows on Snowflake is around ₹4. The engineering cost of one more cursor to monitor is several hours per year. For tables under the volume threshold, the trade-off favours full refresh — even though it "wastes" compute by doing what looks like redundant work.
Going deeper
The CDAS (CREATE OR REPLACE TABLE AS SELECT) idiom on cloud warehouses
Snowflake, BigQuery, and Databricks all support a single-statement full-refresh idiom: CREATE OR REPLACE TABLE warehouse.policies AS SELECT * FROM source.policies. The semantics are atomic — the destination is the new query result, with zero observable downtime, because the underlying storage system swaps the table's pointer in one transaction. This is structurally better than TRUNCATE; INSERT because there is no window in which the destination is empty; readers either see the old version or the new version, never a half-loaded one.
The cost is that the old version's storage is retained for the warehouse's time-travel window (24 hours on Snowflake by default, configurable to 90 days on enterprise tiers). Storage is cheap; the audit and rollback capability is meaningful. Most production teams converge on CDAS for full-refresh tables and accept the storage overhead — Razorpay's nightly catalog refresh is a CDAS, Zerodha's daily symbol-master refresh is a CDAS, and the rollback story when a bad load lands is "switch back to the previous time-travel version" rather than "reload from source again".
The dbt incremental materialisation: the per-table choice as code
dbt formalises the per-table strategy as a model configuration. Every model has a materialized field that takes one of view, table, incremental, or ephemeral. The first two are the dbt names for "no warehouse storage" and "full refresh"; incremental is the dbt name for the cursor pattern this chapter has been describing.
A dbt incremental model declares the unique key, the cursor column, and the merge strategy in YAML:
{{ config(
materialized='incremental',
unique_key='claim_id',
incremental_strategy='merge',
on_schema_change='append_new_columns'
) }}
SELECT * FROM {{ source('oltp', 'claims') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}
The is_incremental() block is dbt's representation of "this is the incremental run, not the initial backfill". On the initial run (no destination yet) the WHERE clause is skipped and the full source is loaded. On subsequent runs the cursor bounds the read. The on_schema_change setting determines what happens when the source's columns change — append_new_columns adds new columns to the destination, fail fails the run, ignore silently drops them. dbt has codified the four-axis decision into per-model YAML, which is why it has become the dominant tool for warehouse pipelines.
Why streaming systems flip the framing entirely
In a streaming system (Build 7+), the full-refresh-vs-incremental question dissolves. Every event is its own delta, the cursor is the offset, and the destination is updated continuously. There is no batch boundary at which to "fully refresh". The streaming generalisation of full refresh is "replay the entire log from offset 0", which Kafka supports natively because the log is durable for the retention window.
This is why the data engineering community drifts toward streaming for large mutable sources: the incremental discipline that batch pipelines have to build (cursor checkpoint, dedup hash, MERGE on updated_at) is replaced by the structural property that the source is already a log. The trade-off is operational complexity (Kafka cluster, schema registry, consumer offsets) — but for a 10-crore-row source with high update churn, the streaming pipeline is structurally simpler than the equivalent batch pipeline with all its incremental machinery.
Build 8 will revisit this with watermarks and exactly-once. The current chapter establishes the batch-world framing; the streaming-world framing comes later. For now: if you find yourself building a complex incremental pipeline against a high-churn source, the right answer is often "stop, use CDC, treat the source as a log".
The Swiggy delivery-events table: a real classification story
Swiggy's data team operates a delivery-events table that captures every state transition of every order — created, picked_up, out_for_delivery, delivered, cancelled. The table is append-only by design (every state change is a new row), volume is around 15 crore rows per day at peak, and the warehouse copy is the source of truth for delivery-time SLA dashboards.
The team's first design used full refresh nightly. At 2 crore rows/day, the nightly full refresh ran in 18 minutes on Redshift dc2.large — fine. At 15 crore rows/day, the same job ran for 4 hours, the warehouse cluster had to be sized up, and the cost grew to ₹85,000 per night. The team migrated to incremental loading using the monotonic event_id as cursor, which dropped the per-run cost to ₹240 (incremental delta of around 1.2 crore rows) and the run time to 8 minutes.
The cost of the migration was 3 weeks of engineering: design review (1 week), implementation and dedup logic (1 week), and parallel running with reconciliation against the old full-refresh output (1 week). The team kept the full refresh job running weekly as the reconciler — Saturday night runs the full scan, compares against the incremental-loaded destination, and emits an alert if drift exceeds 0.01% of rows. In 18 months of operation the reconciler has fired three times, all caused by Redshift consistency edge-cases and not pipeline bugs. The hybrid pattern (incremental nightly + full refresh weekly) is the load strategy this table will run forever.
The ledger-table double-entry problem
Financial ledger tables — Razorpay's payments_ledger, BookMyShow's booking_ledger, Zerodha's trade_book — have a constraint that no other table type has: the sum of credits must equal the sum of debits, every time. Any load strategy that produces a partial state during its run violates that invariant for readers who query mid-load.
The naive TRUNCATE; INSERT full refresh fails badly here — for the duration of the load (potentially several minutes), the table is empty and SUM(amount) returns zero. A naive incremental fails differently — partial application of the delta leaves the table in a state where SUM(credits) ≠ SUM(debits) for the seconds between the two halves of a paired insert.
The pattern that solves it: load into a staging table, validate the invariants, then atomically swap the staging table for the destination. The CDAS idiom on Snowflake/BigQuery handles this natively. On Postgres-as-warehouse, the pattern is BEGIN; CREATE TABLE staging AS SELECT ...; ALTER TABLE policies RENAME TO policies_old; ALTER TABLE staging RENAME TO policies; DROP TABLE policies_old; COMMIT;. Either way, the readers never observe a half-loaded table — they observe the previous version or the new version, nothing in between.
For ledger tables specifically, this constraint dominates the load-strategy choice. Even if incremental would be faster, the atomic-swap requirement of full refresh is structurally easier to guarantee than partial-application correctness of incremental — and many ledger pipelines stay on nightly full refresh for that reason alone, accepting the staleness in exchange for the trivial atomicity story.
Where this leads next
- Append-only vs mutable source tables — chapter 15, the source-shape axis this chapter assumed
- Schema drift across incremental loads — chapter 17, what happens when the source's columns change while you're using a cursor
- Cursors, updated_at columns, and their lies — chapter 13, the four ways the cursor itself can betray you
- High-water marks: tracking what you've seen — chapter 12, the cursor primitive
- Late-arriving data and the backfill problem — chapter 14, the streaming generalisation of the same shape problem
Build 4 will introduce the scheduler and put both load strategies inside DAG tasks with retries, dependencies, and SLA monitoring; the per-table choice this chapter described becomes a per-task configuration. Build 5 will add lineage and contracts on top — the correctness-tier classification this chapter sketched as a Python dictionary becomes a YAML file in version control, validated in CI, and propagated through the lineage graph so a downstream consumer can see "this dashboard reads from a table loaded incrementally with weekly reconciliation, last reconciled 3 days ago". The mature production warehouse is one where every table's load strategy is visible to its readers, not buried in the pipeline code.
References
- dbt: incremental models documentation — the canonical reference for the four-axis classification expressed as configuration.
- Snowflake: time travel and CDAS — the atomic-swap full-refresh primitive on cloud warehouses.
- Ralph Kimball, The Data Warehouse Toolkit — the classical treatment of load strategies in batch warehouses.
- Maxime Beauchemin: The Rise of the Data Engineer — the essay that shaped the modern framing of "pipelines are software, load strategy is design".
- Postgres pg_stat_user_tables — the system catalog the classification script reads from.
- Append-only vs mutable source tables — the previous chapter, which established the source-shape axis.
- Swiggy engineering blog: scaling delivery-events — the team's published account of the full-refresh-to-incremental migration.
- Apache Iceberg: row-level operations — the table-format-level support for both load strategies, with snapshot isolation.
The honest summary: the question "should this table be full refresh or incremental?" has four answers depending on volume, source shape, correctness budget, and recovery cost — and the right answer is almost always per-table, never per-team. The team that defaults to "everything full refresh" wastes warehouse credits but ships zero correctness incidents. The team that defaults to "everything incremental" saves warehouse credits but accumulates a long tail of cursor-drift incidents that surface in finance reconciliations 3 months later. The mature team classifies every table on the four axes, picks the right strategy per-table, layers a reconciliation pass on top of the incremental ones, and re-runs the classification every quarter as tables grow over their lifetime.
A practical exercise: for each table in your warehouse, write down the four-axis answer (volume, shape, correctness tier, recovery cost) and the current load strategy. Count how many tables would change strategy if you classified them today using the rules in this chapter. The first run almost always finds at least three: a 50-thousand-row reference table that's incrementally loaded for no reason, a 5-crore-row events table that's full-refreshed at unnecessary cost, and a high-tier ledger table that's incrementally loaded without a reconciliation. Fixing those three is usually a quarter's worth of cleanup work — and it pays back in compute savings, fewer 3 a.m. pages, and a warehouse the data scientists actually trust.