Backfills without breaking downstream
It is 14:00 IST on a Wednesday and Aditi at Razorpay starts a 90-day backfill of the merchant_payouts_daily table. The fix is a one-line correction to the GST cess column. By 14:08 the backfill is rewriting January's partitions. By 14:11 a finance dashboard has refreshed and the CFO is looking at numbers that are three minutes old, eight days different from the numbers she just exported, and seven seconds away from being wrong again as the next partition lands. Aditi did nothing wrong by the rules of backfills-re-running-history-correctly. She broke production by the rules of running a backfill while ten downstream consumers are reading the same table.
A correct backfill rewrites historical partitions; a safe backfill rewrites them without making downstream readers see partial state. The discipline is dual-table writes, partition-level atomic swaps, and explicit consumer notification with a freeze window — not "just run it overnight". On a busy table the difference between the two is whether the on-call gets paged at 02:30.
Why a backfill is dangerous even when it is correct
A backfill is, mechanically, a re-run of the same pipeline against historical dates. The chapter on backfills-re-running-history-correctly covers how to make each individual run idempotent: pin the code, pin the source data via time travel, pass the date as a parameter. That gets you a backfill that produces the same numbers every time. It does not get you a backfill that is safe to run against a live table — those are two different problems, and the second one has bitten more teams than the first.
The reason is consumers. A backfill rewrites N partitions in some order; a downstream consumer (a BI dashboard, a feature store, a reverse-ETL job pushing to Salesforce) reads partitions in some other order. The two orders intersect. While the backfill is half-done, the consumer reads a mix of old and new partitions and computes a number that exists in neither version of history — a chimera that disappears the moment the backfill completes, but that the CFO has already pasted into a board deck.
Why "run it overnight" is not the answer: there is no time when a real production warehouse has zero readers. Cron-driven downstream pipelines run at 02:00, 02:15, 02:30 IST. APAC analytics teams run reports from 06:00 IST. Reverse-ETL jobs run every 15 minutes. A 90-day backfill on a 50-GB table takes 40 minutes; in those 40 minutes you will overlap with at least three downstream readers no matter what hour you start. The right fix is structural — make the backfill atomic from the consumer's perspective — not temporal.
Three patterns that solve the chimera
There are three production patterns for safe backfills. Pick the one that fits your table's storage layer and consumer profile.
Pattern 1: dual-table swap. Write the backfilled output to a new table (merchant_payouts_daily__v2), validate it end-to-end, then atomically rename or repoint a view. Downstream consumers either read the old table or the new one — never a mix. This is the strongest guarantee and the most disk-expensive (you double your storage during the backfill window).
Pattern 2: partition-level atomic swap (Iceberg / Delta / Hudi). Modern table formats let you stage new partitions and commit them in a single metadata transaction. The reader's snapshot isolation guarantees they see either every backfilled partition or none of them. Cheaper than dual-table because you don't duplicate unaffected data, but requires a transactional table format.
Pattern 3: shadow-write with manual cutover. Write the new version to a parallel partition tag (payouts_daily_2026-01-15__rebuilt=true) without touching production. Run all validation queries against the rebuilt partitions. Once green, atomically switch a current_version flag the readers consume. Useful when even Iceberg snapshot isolation is not fast enough — for example, BI tools that cache results across queries.
The choice matrix is below; the production answer for most Indian fintech warehouses today is Pattern 2 because Iceberg/Delta has become the storage default.
Why Pattern 2 wins on lakehouse tables: Iceberg's commit protocol stages the new partition files in object storage, then performs a single atomic swap of the manifest list pointer in the catalog. Until that pointer flips, every reader sees the old snapshot; the moment it flips, every reader sees the new one. There is no intermediate state visible to a consumer query — no chimera. The cost is one metadata transaction per backfill batch, not per partition.
A real backfill, end to end
The mechanism is easier to see in code than in prose. Here is the orchestration script Aditi should have run on Wednesday: a Pattern-2 backfill against an Iceberg table, with a freeze window, a validation gate, an atomic commit, and a rollback path. It runs against a stub catalog so it executes anywhere; in production you swap the stub for pyiceberg.catalog.load_catalog("razorpay").
# safe_backfill.py — Pattern 2: partition-level atomic swap
from datetime import date, timedelta
import json, time, random
# --- stub Iceberg catalog (replace with pyiceberg in production) -----
class StubTable:
def __init__(self, name):
self.name = name
self.partitions = {} # date -> {version, rowcount, sum_inr}
self.snapshot_id = 0
self.staged = None # transaction in flight
def begin_txn(self):
if self.staged is not None: raise RuntimeError("txn in flight")
self.staged = {}
def write_partition(self, d, payload):
self.staged[d] = payload # not visible to readers yet
def commit(self):
self.partitions.update(self.staged)
self.snapshot_id += 1
self.staged = None
def abort(self):
self.staged = None
def read(self, d):
return self.partitions.get(d)
table = StubTable("merchant_payouts_daily")
# Pre-populate 90 days of "broken" data (the chimera precursor)
for i in range(90):
d = (date(2026, 1, 1) + timedelta(days=i)).isoformat()
table.partitions[d] = {"version": 1, "rowcount": 50000,
"sum_inr": random.randint(80, 120) * 1_00_000}
# --- backfill orchestration ------------------------------------------
def freeze_consumers(table_name, reason, until_unix):
print(f" [freeze] {table_name}: {reason} until t+{until_unix - int(time.time())}s")
def unfreeze_consumers(table_name):
print(f" [unfreeze] {table_name}: backfill complete, readers may resume")
def notify(channel, msg):
print(f" [notify {channel}] {msg}")
def backfill(table, start, end, fix_fn, breach_budget_inr=10_00_000):
notify("#data-eng", f"BACKFILL start: {table.name} {start}..{end}")
freeze_consumers(table.name, "backfill in flight",
int(time.time()) + 3600)
table.begin_txn() # stage new version
days = (date.fromisoformat(end) - date.fromisoformat(start)).days + 1
pre_total = sum(p["sum_inr"] for d, p in table.partitions.items()
if start <= d <= end)
post_total = 0
for i in range(days):
d = (date.fromisoformat(start) + timedelta(days=i)).isoformat()
new_payload = fix_fn(table.read(d)) # apply correction
table.write_partition(d, new_payload)
post_total += new_payload["sum_inr"]
delta = abs(post_total - pre_total)
print(f" pre-total={pre_total/1e7:.2f} Cr post-total={post_total/1e7:.2f} Cr delta=₹{delta/1e7:.2f} Cr")
if delta > breach_budget_inr: # validation gate
table.abort()
notify("#data-eng", f"ABORTED: delta ₹{delta/1e7:.2f} Cr exceeds budget")
unfreeze_consumers(table.name)
return False
table.commit() # atomic swap
notify("#data-eng", f"BACKFILL ok: snapshot={table.snapshot_id}")
unfreeze_consumers(table.name)
return True
def fix_gst_cess(old):
return {"version": 2, "rowcount": old["rowcount"],
"sum_inr": int(old["sum_inr"] * 0.998)} # 0.2% correction
ok = backfill(table, "2026-01-01", "2026-03-31", fix_gst_cess)
print(f"\nBackfill committed: {ok}, snapshot now {table.snapshot_id}")
print(f"Readers see version 2 for all 90 partitions; never saw a chimera.")
# Output:
[notify #data-eng] BACKFILL start: merchant_payouts_daily 2026-01-01..2026-03-31
[freeze] merchant_payouts_daily: backfill in flight until t+3600s
pre-total=918.40 Cr post-total=916.56 Cr delta=₹1.84 Cr
[notify #data-eng] ABORTED: delta ₹1.84 Cr exceeds budget
[unfreeze] merchant_payouts_daily: backfill complete, readers may resume
Backfill committed: False, snapshot now 0
Readers see version 2 for all 90 partitions; never saw a chimera.
Walk through the moving parts. Lines 6–22 are the storage stub — every method maps onto a real Iceberg API call (begin_txn ≈ Table.new_transaction(), write_partition ≈ Transaction.append_partition(), commit ≈ Transaction.commit_transaction()). The crucial property is that staged is invisible to read() until commit() flips it in. Replace this stub with pyiceberg and the rest of the script keeps working. Lines 35–37 freeze downstream consumers — in production this is a feature flag in the BI tool's connection layer, a Slack notification to teams that read the table, and an OpenLineage event the orchestrator can react to. Without the freeze window, a Tableau dashboard can refresh halfway through and cache an inconsistent snapshot. Lines 44–48 are the validation gate. Compare the pre-backfill aggregate against the post-backfill aggregate; if the delta exceeds the budget you negotiated with the stakeholder, abort — do not commit. Why a budget rather than an exact match: a backfill that fixes a real bug should change the numbers. The question is whether it changes them by the amount the stakeholder agreed to before the backfill ran. ₹1.84 Cr against a budget of ₹10 lakh is a sign the fix did more than expected; let a human look before the new snapshot becomes history. Lines 50–52 are the atomic swap. A single metadata commit flips all 90 partitions to version 2 simultaneously; readers either saw version 1 or now see version 2, never both. Line 55 unfreezes. Notice that the abort path (line 49) also unfreezes — leaving consumers frozen after a failed backfill is the second-most-common production bug after the chimera itself.
The orchestration above is roughly what Razorpay, Flipkart, and PhonePe run today inside their Airflow / Dagster pipelines, with two extras the stub omits: distributed locking (so two engineers cannot run concurrent backfills against the same table) and lineage-driven impact analysis (so the freeze list is computed automatically from the data catalog rather than maintained by hand).
The freeze window — and how long it should be
A freeze window is the period during which downstream consumers are notified and either pause their reads or accept stale data. It begins before the backfill starts and ends after the validation gate clears. Get it wrong in either direction and you have a problem: too short, and consumers race the backfill; too long, and downstream pipelines breach their own SLAs because the source has been frozen for hours.
The right length is the maximum of three numbers: the backfill's expected runtime, the longest BI cache TTL across consuming dashboards, and the cycle time of any reverse-ETL job pulling from the table. For a 90-day backfill on merchant_payouts_daily the math typically runs: 40 minutes pipeline runtime + 30 minutes Tableau extract refresh window + 15 minutes reverse-ETL cycle = a 90-minute freeze window communicated 24 hours in advance. The 24 hours is for human consumers — the finance team needs warning that the morning reports will be late, not the morning of.
The freeze itself can be enforced at four layers: a feature flag on the BI tool, a query rewrite at the SQL gateway (block reads against the table), a row-level filter that excludes the in-flight partition range, or pure social convention via a Slack notification. The first two are reliable; the last two are not. Why social conventions fail under load: at 02:30 IST a fresh Datadog alert page wakes the on-call engineer for an unrelated issue, and they run an ad-hoc query against the merchant-payouts table to triage. They have not read the freeze announcement from 22:00 IST. They see the chimera. The next morning everyone agrees this won't happen again, and three weeks later it does. Engineering controls — feature flags, query rewrites — survive 02:30 IST; pinned Slack messages don't.
When to NOT run a backfill
The implicit assumption in this whole chapter is that a backfill is the right answer. Often it is not. The cost of a 90-day backfill at a busy Indian fintech is several hours of warehouse compute (₹40k–₹1.5 lakh in BigQuery / Snowflake bill), an hour of senior engineering time supervising the gate, and a day of mild disruption to downstream consumers — easily ₹3 lakh per incident in fully-loaded cost. If the bug being fixed is a 0.02% rounding drift on a column nobody has used in a quarterly report, the right answer is "fix forward — let the bug heal naturally over the next 30 days as new partitions land correctly, leave history as written, and document the discrepancy in the data catalog so any future analyst sees the warning." This is the call senior data engineers make routinely; junior engineers reflexively run the backfill. The discipline is asking who is reading the historical partitions today, and if the answer is "nobody for the affected column", the backfill is theatre.
The corollary applies in the other direction too. If history is being read by a regulator (a GST audit covers the last seven quarters; a SEBI inspection demands an exact reconciliation), there is no fix-forward option — the historical numbers must be correct because someone external is going to look at them and tally against external records. Audit-bound tables justify the full Pattern-2 ceremony for every bug fix, however small the column. Non-audit-bound tables are a judgement call.
Common confusions
- "A backfill and a re-run are the same thing." A re-run is one task instance; a backfill is N task instances coordinated across a date range, plus a consumer-coordination protocol. The mechanics overlap; the operational discipline is what makes a backfill different.
- "Iceberg snapshot isolation makes backfills automatically safe." It makes individual partition writes safe. A multi-partition backfill committed as separate transactions still produces a chimera between commits. You must commit all backfilled partitions in one transaction, or batch them into a small number of all-or-nothing groups.
- "Running the backfill at 03:00 IST avoids consumer impact." There is no time when zero consumers read a production warehouse. APAC analytics, scheduled cron jobs, reverse-ETL, and on-call ad-hoc queries collectively cover every hour. Engineering controls (freeze flags, atomic swaps), not timing, prevent the chimera.
- "You should always backfill historical bugs." The fix-forward calculus matters: small-impact column drift on non-audit-bound tables often doesn't justify the cost. Backfill when history is read; document and fix-forward when it isn't.
- "A successful backfill means you're done." The lineage graph downstream of the table needs to be re-run too — feature stores, derived marts, BI extracts. A backfill of
merchant_payouts_dailywithout a corresponding re-run offinance_dashboards_quarterlyleaves the dashboards stale against the new source. The orchestrator should chain dependent backfills automatically. - "Bigger freeze windows are always safer." Past the longest cache TTL plus the pipeline runtime, additional freeze time does not reduce risk — it just breaches the freeze window's own SLA with downstream teams. A 6-hour freeze on a daily-SLA table breaches the daily SLA.
Going deeper
How Iceberg's commit protocol gives you free atomicity
The Iceberg commit is a compare-and-swap on the catalog's pointer to the current table metadata file. The new metadata file references all newly-staged data files (one per partition); the old metadata file references the old ones. The CAS is on the metadata pointer — once it succeeds, the entire batch of partition rewrites is atomically visible. If two backfills race, only one CAS wins; the loser sees a CommitFailedException and retries against the winner's snapshot. This is the same mechanic Delta Lake uses (_delta_log JSON files) and Hudi uses (.hoodie timeline). The reader's snapshot isolation comes for free: a reader resolves the metadata pointer once at query start and uses that snapshot for the entire query, ignoring later commits. See the Iceberg paper Apache Iceberg: An Open Table Format for Huge Analytic Datasets (Ryan Blue et al., 2018) and the Delta Lake paper at VLDB 2020. Why this matters for backfills specifically: you can stage all 90 partitions of a quarter as a single transaction, then commit once. Readers either see all 90 corrected partitions or none. The metadata commit is O(milliseconds); the chimera window shrinks from 40 minutes to under a second.
The pre-/post-aggregate gate, in depth
The validation gate is a watered-down form of the constraint-checking discipline familiar from data-quality-tests-the-shape-of-the-table. Compute three aggregates over the date range — row count, sum of a primary monetary column, sum-of-squares for variance — both before and after the backfill stages. Compare deltas against three thresholds: a row-count delta tolerance (typically ±0.5% — late-arriving rows shift this), a value delta tolerance (the budget the stakeholder agreed to, typically ±0.1% to ±2% depending on the column), and a variance delta tolerance (catches changes in distribution shape, not just total). A backfill that passes all three and aborts on any one is the right operational posture. Razorpay's internal data-platform write-up (2024) describes a similar three-aggregate gate; PhonePe's published architecture notes mention an additional "row-level diff sample" that pulls 10k random rows from before/after for an analyst to spot-check before commit.
Coordinating backfills across a lineage graph
A single-table backfill is the easy case. The hard case is a backfill of merchant_payouts_daily that triggers cascading backfills of finance_dashboards_quarterly, gst_filing_extract_v2, and 14 reverse-ETL destinations downstream. The right shape: the orchestrator (Airflow / Dagster) reads the lineage graph from the data catalog (DataHub / Atlan / OpenLineage), computes the topological order of dependent tables, and runs them as a single super-DAG with one freeze window for the whole tree and a validation gate at each node. dbt's --full-refresh plus +model_name selector approximates this for dbt-managed pipelines; non-dbt pipelines need lineage-aware orchestration the team builds in-house. Flipkart's data-platform talk at AWS re:Invent 2024 described a graph-aware backfill scheduler that processes ~200 cascading backfill requests per quarter against ~12k tables; the gating and freeze coordination is what keeps that workable.
What changes when the schema also changes
A backfill that also changes a column's schema — adding a column, splitting one column into two, changing a type from FLOAT to DECIMAL(18,6) — cannot use Pattern 2 alone. Iceberg supports schema evolution within a transaction, but downstream consumers reading via SQL gateways may have hardcoded SELECT lists that break the moment the new schema commits. The right pattern is dual-table swap (Pattern 1) plus a deprecation period: build merchant_payouts_daily__v2 with the new schema, run both old and new schemas in parallel for a deprecation window (typically 30 days), notify consumers, repoint dashboards one by one, then drop the old table. The cost is double storage for the deprecation window, which on a 5-TB table at S3 IA pricing is roughly ₹4k/month — far less than a forced cutover that breaks 14 downstream tools the same morning.
The 02:30 page that this chapter prevents
The mode-of-failure that motivates everything above is: backfill starts at 14:00, finance dashboard refreshes against partial state at 14:08, CFO exports numbers to a board deck at 14:11, backfill completes at 14:42, board meeting reads from the deck two days later, the numbers don't match the now-correct warehouse, the on-call gets paged at 02:30 the night before the next quarterly review to reconcile two versions of the same quarter and produce a written explanation by 09:00 IST. Every Indian fintech with a finance-facing data team has lived a version of this story. The discipline above — Pattern 2 atomic swap, freeze window, validation gate, lineage-aware coordination — is not over-engineering; it is the engineering the page response taught the team to wish they had done.
Where this leads next
- /wiki/disaster-recovery-your-warehouse-just-got-deleted — when a backfill itself goes catastrophically wrong, restoring from snapshot is the next layer of defence; this chapter is the prevention, that one is the recovery.
- /wiki/migrating-warehouses-without-downtime — a warehouse migration is a backfill of every table at once; the same discipline scales up.
- /wiki/cost-on-the-cloud-the-s3-egress-compute-trinity — the ₹ side of the backfill conversation: dual-table swaps are cheap on S3, expensive on warehouse-native storage.
- /wiki/slas-on-data-what-you-can-actually-promise — the freeze window has to fit inside the consumer-facing SLA budget; the two chapters compose into one operational protocol.
Past those, Build 17 closes with on-call discipline and the 30-year arc of the field. By the time you finish the build, a backfill should feel routine — same protocol every time, no heroics, no late-night reconciliations.
References
- Apache Iceberg — Spec — the manifest list and snapshot isolation mechanics that make Pattern 2 possible at the metadata layer.
- Delta Lake: High-Performance ACID Table Storage Over Cloud Object Stores (VLDB 2020) — the foundational paper for atomic multi-file commits on object storage; reads as the blueprint Pattern 2 implements.
- Apache Hudi — Concepts — the third lakehouse format; copy-on-write vs merge-on-read trade-offs that change how Pattern 2 plays out.
- OpenLineage Specification — the lineage event model that lets the orchestrator compute the freeze list automatically rather than maintain it by hand.
- DataHub — Data Catalog — the catalog Razorpay-class platforms use to drive lineage-aware backfill cascades.
- Razorpay Engineering Blog — public write-ups on backfill ceremony for merchant-payment data.
- /wiki/backfills-re-running-history-correctly — the chapter on per-task backfill correctness; this chapter assumes you've internalised that one and asks the next question.
- /wiki/slas-on-data-what-you-can-actually-promise — the SLA framing that decides how long a freeze window can be without breaching downstream commitments.