Backfills: re-running history correctly
A bug shipped on Monday is discovered on Friday — and the four days of broken numbers are sitting in the warehouse, feeding dashboards that the finance team has been looking at all week. The fix is one line of SQL. The backfill is the entire weekend. You have to re-run Monday's job with Monday's date, then Tuesday's with Tuesday's, then Wednesday's, then Thursday's, with the right inputs for each, without overwriting Friday's run that is already correct, without saturating the warehouse so that today's pipeline misses its 2 a.m. SLA, and without anybody noticing that the dashboard numbers shifted by 0.4 % between when they checked at 9 a.m. and when they checked at 10 a.m. A backfill is the production form of "could you re-run that for last quarter?" — and it is the test that decides whether a pipeline is actually idempotent or merely "works on Tuesdays".
A backfill replays a pipeline against a historical run-date by overriding the date the task sees as "now". Done correctly, every backfilled run produces bit-identical output to the original — which requires the task to read date as a parameter, write to a partition keyed on that date, and use the same source-data version that existed at original run time. Done incorrectly, you get drift, double-writes, exhausted warehouses, and dashboards that change while users are watching them.
What a backfill actually replays
A scheduled task has three implicit inputs: the code, the source data, and the date. The scheduler usually only varies the date — every other input is "whatever is in the repo" and "whatever is in the source warehouse right now". A backfill says: hold the code constant at the version you want, hold the date constant at the historical date, and hope the source data behaves itself.
The scheduler's contract is: pass logical_date (Airflow's name; Dagster calls it partition_key, Prefect calls it scheduled_start_time) as an explicit parameter. The task body must use it for every date-aware computation — file paths, partition keys, query date filters, joins to dim tables, the lot. A task that contains a single datetime.now() is broken under backfill, and the backfill is how you find out.
There is a fourth implicit input that bites teams who upgrade their stack mid-history: the runtime libraries. A pipeline written against pandas==1.5 rounds floats slightly differently from pandas==2.1; a Spark 3.3 job groups by a NULL key differently from Spark 3.5. A backfill that runs today's container against last year's date produces last year's data with this year's library behaviour. For most pipelines the deltas are within rounding tolerance and nobody notices; for pipelines that compute monetary aggregates published to regulators, the deltas are visible enough to provoke a phone call. The defensive posture is to pin the container image alongside the git SHA — Dagster's "code locations" and Airflow's KubernetesPodOperator both let you do this; raw scripts need a Dockerfile versioned alongside the code.
The shape of a backfill command
The smallest correct backfill picks a date range, generates one task instance per date, runs them in some order, and writes each one to a partition keyed on its own date. The order matters less than people think — most backfills can run dates in parallel — but the partitioning matters absolutely. If two parallel backfill runs both write to a single un-partitioned target table, the last writer wins and earlier dates get lost.
from datetime import date, timedelta
from concurrent.futures import ThreadPoolExecutor, as_completed
import subprocess, sys
def daterange(start: date, end: date):
"""Inclusive on both ends. Backfills usually want [start, end]."""
cur = start
while cur <= end:
yield cur
cur += timedelta(days=1)
def run_one(logical_date: date) -> tuple[date, int, str]:
"""Re-run the pipeline for one historical date.
Returns (date, returncode, last_log_line)."""
cmd = [
sys.executable, "pipelines/orders_daily.py",
"--logical-date", logical_date.isoformat(),
"--write-mode", "overwrite", # idempotent partition write
"--source-snapshot-at", f"{logical_date.isoformat()}T18:00:00Z",
]
p = subprocess.run(cmd, capture_output=True, text=True)
last = (p.stdout.strip().splitlines() or [""])[-1]
return logical_date, p.returncode, last
def backfill(start: date, end: date, max_parallel: int = 4):
dates = list(daterange(start, end))
print(f"backfilling {len(dates)} dates with concurrency={max_parallel}")
ok, fail = [], []
with ThreadPoolExecutor(max_parallel) as pool:
futs = {pool.submit(run_one, d): d for d in dates}
for fut in as_completed(futs):
d, rc, last = fut.result()
(ok if rc == 0 else fail).append(d)
print(f"[{d}] rc={rc} {last}")
print(f"done. ok={len(ok)} fail={len(fail)}")
return ok, fail
if __name__ == "__main__":
backfill(date(2026, 4, 20), date(2026, 4, 23), max_parallel=4)
A typical run on a 4-date backfill against the orders pipeline at Razorpay (the same one whose 2023 thundering-herd you met in chapter 23) looks like:
$ python backfill.py
backfilling 4 dates with concurrency=4
[2026-04-22] rc=0 wrote 1,247,032 rows to ds=2026-04-22
[2026-04-20] rc=0 wrote 1,189,448 rows to ds=2026-04-20
[2026-04-23] rc=0 wrote 1,308,917 rows to ds=2026-04-23
[2026-04-21] rc=0 wrote 1,221,776 rows to ds=2026-04-21
done. ok=4 fail=0
The walkthrough is short and load-bearing. logical_date.isoformat() is the date the task body must use for everything — WHERE event_ts >= '2026-04-20' AND event_ts < '2026-04-21', file path s3://lake/orders/ds=2026-04-20/, dim-table snapshot key 2026-04-20. --write-mode overwrite is the idempotency switch — the task uses INSERT OVERWRITE PARTITION (ds='2026-04-20') (Spark/Hive) or DELETE FROM ... WHERE ds = '2026-04-20'; INSERT INTO ... (warehouse SQL); a re-run of the same date replaces the partition rather than appending to it. --source-snapshot-at uses a time-travel reference (Iceberg AS OF, Delta VERSION AS OF, Snowflake AT (TIMESTAMP =>)) so the backfill reads the source as it was at original run time, not as it is now. max_parallel = 4 is the throttle that protects today's pipeline — if the warehouse has 16 slots and today's job needs 8, a backfill that grabs 4 leaves 4 for everyone else.
Why parallel-by-default and not sequential: most backfills are CPU-bound on the warehouse, not on the orchestrator, and the warehouse already serialises queries that contend for the same data. Running 4 dates in parallel finishes a 90-date backfill in 1/4 the wall-clock time and consumes the same total compute. The exception is when one date depends on the previous (running totals, accumulators) — that case needs sequential, and is the "data dependency" pattern in chapter 22.
Idempotent partition writes — the foundation
The single behaviour that makes everything else possible is: writing the same output for a given logical_date is idempotent. Run it once or run it ten times, the final state of the warehouse is the same.
The mechanism that makes this concrete is the partition. Every fact table the pipeline writes is partitioned on the date dimension that matches logical_date — usually ds (date string) or event_date. The write path is partition-replace, not row-merge:
The exact SQL varies by engine. On Spark/Iceberg the write is INSERT OVERWRITE INTO orders_fact PARTITION (ds = '2026-04-20') SELECT ...; on Snowflake, you use MERGE keyed on (ds, order_id) with the same effect; on plain Postgres without time-travel, you do BEGIN; DELETE FROM orders_fact WHERE ds = '2026-04-20'; INSERT INTO orders_fact ...; COMMIT; inside one transaction. The shared property is: at no point is the warehouse in a state where the partition is partially written. The transaction sees the old version or the new version, never a mix.
def write_partition_overwrite(conn, ds: str, rows: list[dict]):
"""Idempotent partition write on warehouses that lack INSERT OVERWRITE.
Every re-run produces the same final state."""
with conn: # transaction
conn.execute(
"DELETE FROM orders_fact WHERE ds = %s", (ds,))
conn.executemany(
"INSERT INTO orders_fact (ds, order_id, amount_inr, merchant) "
"VALUES (%(ds)s, %(order_id)s, %(amount_inr)s, %(merchant)s)",
rows,
)
Two re-runs of this code with the same ds and the same input rows produce the same final state. A re-run with different input rows (because the source has been corrected since) produces the corrected state. That is the expected behaviour for a backfill — the corrected source is precisely why you are backfilling.
A subtle property of the DELETE; INSERT pattern in Postgres deserves its own paragraph: the transaction is atomic from the reader's perspective, but the rows are physically deleted and re-inserted, which churns indexes and bloats the table for every backfill. A 365-day backfill on a 100-million-row partitioned table will leave the indexes 30 % larger and the table itself with 365 × N dead-tuple rows that VACUUM has to reclaim afterwards. Production teams either run VACUUM FULL orders_fact after large backfills or use partitioned tables with DROP PARTITION; ATTACH PARTITION, which is metadata-only and avoids the index churn. Postgres 16's logical replication-friendly REPLICA IDENTITY FULL makes the trade-off worse — every deleted row is logged in full rather than by primary key — so a CDC-feeding warehouse pays even more during backfills.
Source-data drift: the bug everyone hits
A subtler failure mode: you re-run Monday's pipeline today, but the source table the pipeline reads from has been mutated since Monday. Maybe a customer-support agent corrected an order amount, or a fraud team marked 47 orders as cancelled, or the upstream upserted a new column. The backfill produces different numbers than the original run produced — even though the code and the date are pinned.
For some backfills this is correct (you want the corrections). For others it is catastrophic (the original numbers were already published; rewriting them changes a regulatory report that has already been submitted to the RBI).
The control is on the source side, not the pipeline side. Three options:
-
Time-travel queries. Iceberg and Delta tables both support
SELECT * FROM orders AS OF TIMESTAMP '2026-04-20T23:59:00Z'— the query reads the snapshot of the source as it existed at that timestamp. The backfill passes the original run-time as--source-snapshot-at. Snowflake'sAT (TIMESTAMP => '...')and BigQuery'sFOR SYSTEM_TIME AS OFdo the same. -
Append-only sources with event-time filtering. If the source is append-only (logs, transaction events, IoT telemetry), you don't need time-travel — you filter by event-time on the original date and ignore rows whose ingestion-time is later than the original run. The pipeline's WHERE clause becomes
WHERE event_date = '2026-04-20' AND ingest_ts <= '2026-04-21T02:00:00Z'. This pattern is cheaper than time-travel because the source already has the data you need; it just needs the right filter. -
Frozen snapshots. A daily snapshot of the source is taken right before the pipeline runs and stored as
orders_snapshot_2026_04_20. A backfill reads the snapshot, not the live source. This works on warehouses that don't support time-travel and where the source isn't append-only. The cost is storage — a 5-year history of daily snapshots of a 100 GB table is 180 TB.
def query_with_snapshot(conn, table: str, ds: str,
snapshot_at: str, mode: str = "iceberg"):
"""Read the source table as of a historical timestamp."""
if mode == "iceberg":
sql = f"""
SELECT * FROM {table}
FOR SYSTEM_VERSION AS OF
(SELECT snapshot_id FROM {table}.snapshots
WHERE committed_at <= TIMESTAMP '{snapshot_at}'
ORDER BY committed_at DESC LIMIT 1)
WHERE event_date = '{ds}'
"""
elif mode == "append_only":
sql = f"""
SELECT * FROM {table}
WHERE event_date = '{ds}'
AND ingest_ts < TIMESTAMP '{snapshot_at}'
"""
elif mode == "frozen":
sql = f"SELECT * FROM {table}_snapshot_{ds.replace('-','_')}"
else:
raise ValueError(f"unknown mode: {mode}")
return list(conn.execute(sql))
Why three modes and not one: the right mode depends on what the source actually is. Iceberg/Delta give you time-travel for free; using it for an append-only logs table is overkill and slow. Frozen snapshots cost 180 TB but work on every warehouse and are auditable (a regulator can verify the exact bytes used). Append-only event-time filtering is the cheapest and fastest, but it requires the source to never delete or correct rows — Razorpay's settlements pipeline switched from frozen snapshots to event-time filtering in 2024 and cut backfill cost by 92 %, but only because the upstream payments-event-stream is genuinely append-only.
Ordering, parallelism, and the warehouse
A 90-date backfill running fully parallel will hammer the warehouse. A 90-date backfill running fully sequential will take 90 days. The right answer is in between — and depends on the dependency shape of the pipeline.
Pipelines with no inter-day dependencies (each day's data is independent: orders, events, page views) parallelise freely. The throttle is warehouse capacity, and the typical setting is 1/4 to 1/2 of available slots. A pipeline with inter-day dependencies (running totals, deduplication-against-yesterday, slowly-changing dimensions) cannot — day N+1 needs day N's output committed first. These run sequentially; a 90-date backfill takes 90 × per-day-time.
Mixed pipelines exist too — a fact table with no dependencies parallel-fills, then a downstream summary table with a 7-day window runs sequentially over the same range. The orchestrator's job is to know the difference; the executor handles each level appropriately.
The numbers from a real Flipkart catalogue-refresh backfill in 2024 show the trade-off concretely. A 60-date backfill of product_inventory_fact (no inter-day dependencies) ran with max_parallel = 8 against a 24-slot Snowflake warehouse and finished in 47 minutes. The downstream weekly_inventory_trend table (7-day windowed; sequential) took another 3 hours 12 minutes for the same 60 dates. Total wall-clock was 4 hours, total warehouse-credits consumed was identical to a 60-day sequential run of both tables — parallelism here bought wall-clock time, not cost. The opposite extreme, the same 60-date backfill at max_parallel = 24 (saturating the warehouse), finished the fact table in 9 minutes but pushed the company-wide daily refresh 90 minutes late, which an SRE killed mid-run. The lesson is that the right max_parallel is bounded above by "how much capacity can today's pipelines spare", not "how much capacity does the warehouse have".
def backfill_topo(start: date, end: date, deps: dict[str, list[str]],
max_parallel: int = 4):
"""Backfill a small DAG of tables for a date range.
deps: {downstream_table: [upstream_tables]}
Each table is independent across dates EXCEPT the deps within one date."""
from concurrent.futures import ThreadPoolExecutor, as_completed
dates = list(daterange(start, end))
levels = topo_levels(deps) # [['orders'], ['daily_summary']]
for level in levels:
with ThreadPoolExecutor(max_parallel) as pool:
futs = []
for d in dates:
for table in level:
futs.append(pool.submit(run_one_table, table, d))
for fut in as_completed(futs):
table, d, rc = fut.result()
if rc != 0:
print(f"FAIL {table} {d} — aborting backfill")
return False
return True
The structure is: outer loop over dependency levels, inner parallelism across dates within a level. Nothing in level 2 starts until everything in level 1 has succeeded — you cannot summarise a day whose orders haven't been written yet. Why this nested structure rather than a flat DAG of (table, date) pairs: at production scale the flat DAG explodes — 90 dates × 40 tables = 3600 nodes, which is too many for a typical orchestrator's UI and far more than necessary because most level-2 tasks only need their own date's level-1 to be ready. The two-level approach gives the right correctness without the explosion. Airflow's BackfillJob and Dagster's partitioned-asset backfill both use this pattern.
A finer-grained alternative — used by Dagster's "asset reconciliation" — is to materialise the (table, date) graph as a sparse DAG, where each node depends only on its actual upstream (table, date) pairs rather than all upstream tables for the same date range. This catches cross-day dependencies precisely (a 7-day rolling sum at ds = 2026-04-23 depends on level-1 outputs for 2026-04-17 through 2026-04-23) and lets the executor start day 8 of the rolling sum the moment days 1–7 are done, even if days 8–14 of the underlying fact table aren't ready yet. The cost is the executor complexity; the benefit is wall-clock time on backfills where the dependency graph is non-trivial. A simple two-level loop is the right starting point; the sparse-DAG version is what production-scale backfills migrate to once they hit the limits of "wait for the entire level".
What goes wrong
Backfill runs to today's partition. The single most common bug is a task that calls datetime.now() instead of using the logical_date parameter. Backfilling Monday writes to today's partition, which silently overwrites today's already-correct data. The fix is to ban datetime.now() from the task body and pass logical_date everywhere — production teams enforce this with a linter or a code-review checklist. Airflow's macros ({{ ds }}, {{ data_interval_start }}) make this hard to forget; raw Python code makes it easy. Swiggy's analytics platform team caught this in 2023 with a CI rule that fails any PR introducing datetime.now() outside a small whitelist of system-utility files; the rule rejects two PRs a week, every week.
Source mutation between original run and backfill. Re-running an old date against the live source produces different numbers than the original run, because the source has been mutated since (corrections, deletes, late arrivals). For internally-published numbers this is fine; for externally-reported numbers (regulatory filings, public earnings, GST returns) it is not. The fix is time-travel queries or frozen snapshots; the audit is to compare backfill output against the original output and assert the deltas are within tolerance. A 0.05 % tolerance on amount is reasonable for engineering-driven backfills; a regulatory backfill needs zero tolerance and an explicit reissue notice to the reporting body if any row changed.
Warehouse exhaustion during backfill. A 90-date parallel backfill consuming 16 slots starves today's pipeline, which misses its 2 a.m. SLA, which pages the on-call, who kills the backfill, which is now half-done. The fix is to throttle backfills to 1/4 of warehouse capacity and run them outside business hours; the production hardening is a separate "backfill pool" with capped slots that today's pipelines can't be evicted from.
Partial backfill leaves the warehouse in a mixed state. A 90-date backfill that succeeds on 87 dates and fails on 3 leaves the warehouse with 3 dates' partitions still showing the old broken values. Dashboards that compute moving averages now show a sawtooth. The fix is the orchestrator-level "all-or-nothing" guarantee — write each date's output to a staging location, then atomically rename them all into place, then commit. Iceberg/Delta tables give this for free; raw Postgres gives you a 1-row consistency window per partition but not across partitions.
Backfilling code that has changed since the original run. A bug fix that rounded amounts to 2 decimals fixed Monday's output but a backfill of October last year produces different numbers because the bug fix didn't exist then. Sometimes you want this (you want today's correct logic everywhere). Sometimes you don't (you want the historical numbers as they were originally published). The control is to check out the historical git SHA before running the backfill — Airflow's git_sync and Dagster's code-versioning give you this; raw scripts need a manual git checkout.
Race between today's run and the backfill on the same partition. A 365-date backfill running on yesterday's partition collides with today's pipeline that is still finishing yesterday's late-arriving rows. The backfill writes its version, today's pipeline writes another version 4 minutes later, and one of them wins depending on commit order. The fix is a partition-level lock — Iceberg's optimistic concurrency control will retry one of the writers; warehouses without it need an explicit pg_advisory_xact_lock(hashtext('orders_fact_2026_04_24')) or a row in a partition_locks table. Dream11's match-day pipelines learned this the hard way during the 2024 IPL final when a backfill of the previous match's stats overlapped with the live pipeline ingesting the current match's events.
Common confusions
- "A backfill is the same as a re-run." A re-run runs the same date that just failed; a backfill runs a different date that succeeded long ago. The mechanics are similar (idempotent partition write), but the operational shape is different — re-runs are reactive (one date, fast), backfills are proactive (many dates, scheduled, throttled).
- "If the pipeline is idempotent, the backfill is automatically correct." Idempotency means re-running with the same inputs produces the same output. A backfill against a mutated source has different inputs even though the date is the same — and the backfill output diverges from the original output despite the pipeline being technically idempotent. Source-data versioning is the additional ingredient.
- "Backfills should always run sequentially to be safe." Sequential is safe but glacial — a 365-day backfill at 10 minutes per day is 60 hours. Most pipelines have no inter-day dependencies and parallelise correctly with the right throttle. Sequential is the right default only for pipelines with cross-day state (running totals, SCD-2 dimensions, sessionised events).
- "
logical_dateis the date the backfill runs at." No —logical_dateis the date the backfill pretends it is running at. The wall-clock at backfill time is irrelevant to the data; onlylogical_datecontrols which partition gets written and which source-snapshot gets read. Airflow used to call thisexecution_date, which was so confusing that they renamed it; new code should uselogical_dateordata_interval_start. - "A failed backfill leaves no trace." A failed backfill leaves partial writes — some dates committed, some not. Without an explicit "backfill state" table, you cannot tell which dates are now correct and which are still broken. Production backfill tooling (Dagster's
BackfillJob, dbt's--backfill-id) records every attempted date with its status; raw scripts need a CSV output. - "Time-travel queries are expensive and slow." Iceberg and Delta time-travel reads are within 5 % of the cost of a current read — they read a different metadata file but the data files are unchanged. The expensive part is retention (Iceberg's default 7-day snapshot retention is too short for a 90-day backfill; configure
history.expire.max-snapshot-age-msto the longest backfill window you expect).
Going deeper
How Airflow's BackfillJob differs from clear-and-rerun
Airflow gives you two ways to re-run history. airflow dags backfill -s 2026-04-20 -e 2026-04-23 my_dag schedules the runs through the regular scheduler, respecting max_active_runs and pool slots — this is the production-correct path. airflow tasks clear -d -s 2026-04-20 -e 2026-04-23 my_dag marks tasks as not-yet-run and lets the scheduler pick them up on its next loop — this is faster to start but interleaves with regular runs and can starve today's pipeline. The difference matters at scale: a 30-day backfill at Razorpay using clear once consumed all 24 scheduler slots and pushed the daily settlement run 4 hours late, costing ₹14 lakh in delayed cleared funds. They switched to backfill -s -e with --max-active-runs 4. Dagster's partitioned-asset model is cleaner here — the backfill UI explicitly throttles via the same pool the daily run uses, so cross-pollution is configured rather than accidental.
The "freshness vs replayability" trade-off in mutable sources
Some source tables are mutable by design — a CRM where customer-support agents fix mistakes, an inventory system where a SKU's current_price updates throughout the day, an order table that reaches a terminal state of delivered only after the warehouse picks it up. A pipeline that reads such a source has a choice: query at the latest state (fresh but unreplayable) or query at a snapshot (replayable but stale). Most operational dashboards want fresh; most regulatory reports want replayable. The pattern that solves both is bitemporal — store every state change with effective_from and effective_to, then the pipeline picks which view it wants. This costs storage (every change is a row) but gives both properties. Postgres + temporal extensions, Snowflake's AS OF, and Iceberg's snapshots all expose bitemporal patterns, with different performance trade-offs. The Aadhaar enrolment system at UIDAI is bitemporal end-to-end because they need to answer "what did we know about this person on the day we issued the card" forever.
Why "write to staging, atomically swap" beats "write directly with cleanup"
A backfill that DELETEs the old partition and INSERTs the new one inside a transaction is correct on a warehouse that supports multi-statement transactions across DDL (Postgres). Most analytical engines don't — Hive, BigQuery, Athena, ClickHouse — DDL inside a transaction is either unsupported or has weak isolation. The pattern that works everywhere is staging-then-swap: the backfill writes to orders_fact_staging_2026_04_20, validates row counts and basic checks, then atomically renames orders_fact_staging_2026_04_20 → orders_fact partition. Iceberg's INSERT OVERWRITE does this internally; Delta's replaceWhere does too; Hive's INSERT OVERWRITE does it via a hidden .staging directory and a final mv. The lesson is to treat partition replacement as a transactional primitive even on systems that don't call it that, and to never let the partition be in a partially-written state visible to readers — which is precisely the pattern Flipkart's catalogue refresh adopted in 2022 after a midday backfill briefly showed half-empty product pages to live shoppers.
Tracking backfill state — the CSV-and-checksum pattern
A 90-date backfill produces 90 rows of metadata that you need long after the run is done: which date, which git SHA, which source-snapshot timestamp, which output row count, and a SHA-256 checksum of the partition's data files. Production teams write this to a backfill_runs table at commit time and never delete it. Two months later, when the finance team asks "did we backfill the GST report for January?", the answer is one query away rather than a forensic dig through scheduler logs that have already rotated. The schema is small — (backfill_id, table_name, ds, git_sha, source_snapshot_at, row_count, data_checksum, started_at, ended_at, status) — and the discipline of writing to it is what separates a backfill that an auditor can verify from one that nobody can ever fully reconstruct. Razorpay's settlement-replay tool emits this row inside the same transaction that commits the partition, so the metadata cannot drift from the data.
The "blast radius" review every backfill should pass
Before a backfill runs in production, three questions deserve explicit answers, written down in the runbook for that backfill rather than reasoned about live at 11 p.m. on a Sunday.
The first question is: which downstream tables and dashboards will see the change? If the backfill rewrites orders_fact for 90 days, every dashboard that aggregates from orders_fact will shift, every materialised view will need a refresh, every ML training set sampled from those 90 days needs to know its labels may have moved. Lineage tooling (chapter 31, when Build 5 lands) is what answers this question programmatically; in its absence, the runbook is a hand-curated list of consumers, and the on-call's first job is to notify each consumer's owner.
The second is: what is the rollback path? If the backfill produces wrong numbers, can you revert to the previous state? Iceberg and Delta give you this for free — every commit is a snapshot, and ROLLBACK TO SNAPSHOT n undoes the backfill. Warehouses without time-travel need an explicit pre-backfill snapshot (CREATE TABLE orders_fact_pre_backfill_2026_04_25 AS SELECT * FROM orders_fact WHERE ds BETWEEN '2026-04-20' AND '2026-04-23') that the runbook commits to keeping until the backfill is verified-good.
The third is: who is the human in the loop? Every backfill has a sponsor (the team that asked for it), an executor (the engineer who runs it), and a verifier (someone who checks the output is right). For a small backfill, all three can be the same person. For a regulatory backfill, they must not be — the executor and the verifier are different people, and the sponsor signs off in writing. The discipline is annoying right up until the moment a backfill goes wrong, at which point the chain of accountability becomes the entire artefact that explains what happened.
Where this leads next
- SLAs and the meaning of "late" — chapter 25, where backfill duration meets the SLA contract.
- Airflow vs Dagster vs Prefect: the real design differences — chapter 26, including how each scheduler exposes backfills.
- Late-arriving data and the backfill problem — chapter 19, the streaming-side cousin where rows arrive after the partition is sealed.
The backfill primitive closes Build 4's executor story. Build 5 begins with lineage and observability — which exist precisely because, at 3 a.m. on the day after a backfill goes wrong, the on-call engineer needs to know which tables and dashboards were touched. The backfill is the operation; lineage is the audit trail; and the SLA chapter that follows turns "this backfill must finish by 2 a.m." from a hope into a contract that the scheduler enforces.
References
- Airflow backfill documentation — the CLI, parameters, and operational guidance.
- Dagster partitioned asset backfills — the asset-graph approach to replays.
- Iceberg time travel —
AS OFquery syntax and snapshot retention. - Delta Lake time travel —
VERSION AS OFandTIMESTAMP AS OF. - Maxime Beauchemin: "The rise of the data engineer" — the original Airbnb essay that defined backfill ergonomics for the modern era.
- Snowflake Time Travel — the warehouse-side primitive most production backfills depend on.
- Retries, timeouts, and poisoned tasks — chapter 23, the failure-mode primitives that backfills inherit.
- What idempotent actually means for data and why it's hard — chapter 12, the foundation that every backfill assumes.