Reprocessing a year of data: the real runbook
It is a Monday morning in March and Jishant at PhonePe has been told that the merchant-category-code (MCC) enrichment job has been writing the wrong code for restaurants since the previous April. Eleven months. Roughly 2.4 billion transactions. The fix is six lines. Reprocessing the year is six weeks. Nobody in the room — not the engineering manager, not the finance partner, not the compliance lead — has done a 365-day reprocess before, and in the next forty minutes Jishant will discover that what they actually need is not a longer backfill, but a runbook for an operation that is qualitatively different.
A year-long reprocess crosses thresholds a 90-day backfill never touches: schema evolution, vendor pricing tier breaks, regulatory archives, frozen quarter-closes, and people who left the team. The right shape is a phased reprocess with a written runbook, a daily progress check-in, and an explicit budget for both rupees and political capital — not a single Airflow trigger and a hope that nothing breaks.
What changes when you cross the 365-day boundary
The previous chapter (backfills-without-breaking-downstream) handled the 90-day case: dual-table swap, partition atomic commit, freeze window, validation gate. Every one of those techniques still applies. They are necessary. They are not sufficient. Five things break differently at one-year scale.
Thing 1: schema has evolved. Twelve months ago merchant_payments had 47 columns; today it has 63. The reprocess pipeline must read the old schema, the new schema, and every schema in between, mapping columns through a chain of ALTER TABLE history that nobody documented. The fix-forward function must be a function of which schema version the partition was originally written under, not just the current code path.
Thing 2: someone froze a quarter. Q3 of last fiscal year was filed with the GST authorities in October. The numbers in gst_filing_q3 are now part of an external record. You cannot rewrite them without a written change order signed by the finance partner, the compliance lead, and (sometimes) external auditors. The reprocess must skip those partitions or reprocess into a parallel "corrected" table that the GST team will reconcile manually.
Thing 3: the source data is partially gone. Hot S3 storage holds 90 days; warm holds 365; cold (Glacier) holds 7 years. A 365-day reprocess will pull 60–70% of its source partitions from warm tier and the rest from cold. Glacier retrieval is asynchronous, costs ₹0.08–₹2.40 per GB depending on speed tier, and adds 4–12 hours of restore latency before any compute can start. The reprocess job must orchestrate the restore, not just submit a SQL query.
Thing 4: vendor pricing is non-linear. Snowflake / BigQuery / Databricks all have pricing tiers that activate at high volume. A 90-day backfill at ₹40k of credits scales naively to ₹1.6 lakh for 365 days — but the auto-scaling warehouse hits a higher tier, the spillover-to-disk fee kicks in, and the actual bill is ₹4.8 lakh. The cost grows super-linearly with reprocess length, and finance will ask why before they approve the second week.
Thing 5: the people who wrote the original pipeline have left. The engineer who chose to round at 4 decimal places for some reason has been at Stripe for nine months. Their commit message says "small fix"; their Slack archive does not survive. The reprocess will rediscover their reasoning the hard way, in production, at 02:30 IST.
Why the cost-curve is super-linear: warehouse pricing tiers, cold-storage retrieval fees, and concurrency caps all create kinks in what looks like a smooth ₹/day curve. Snowflake's WAREHOUSE_SIZE ladder is the most visible example — bumping from M to L doubles ₹/hr but the throughput gain is only ~1.6×, so a reprocess that just barely fits in M will run cheaper than one that just barely overflows into L. Plan the partition-batching to land you on the cheap side of every kink.
The phased reprocess
A naive year-long reprocess submits one job that runs for six weeks. That is the wrong shape on every axis: cost spikes uncontained, blast radius is the entire year, validation happens at the end (when it's too late to abort), and there is no checkpoint to retry from when something fails on day 19. The right shape is phased: break the year into 6–12 phases, gate each phase on the previous one, and rebuild the runbook between phases as you learn.
The minimal phase plan looks like this:
Phase 0 — pilot. Reprocess one week of recent (hot-storage) data. Goal is to validate the corrected pipeline end-to-end on production scale, observe the actual ₹ cost per day, and shake out the obvious bugs (Glacier API auth, schema-version routing, partition-locking conflicts) before the stakes are 51× higher.
Phase 1 — recent quarter. Reprocess the most recent 90 days. This is the quarter the business reads most heavily; getting it correct first means downstream dashboards stop showing wrong numbers within the first week of the operation, which is the only argument that keeps the project funded politically.
Phase 2..N — historical quarters in reverse chronological order. Reprocess one quarter at a time, newest to oldest. Each quarter has its own freeze window, its own validation gate, its own ₹ budget approval, and its own go/no-go meeting. Reverse-chronological order matters: the oldest quarter is the one most likely to be schema-different, source-data-cold, and frozen-by-compliance — leaving it for last means you've already battle-tested the pipeline by the time you hit the hardest case.
Phase N+1 — sweep. A single pass across the entire year to re-validate. No data writes — only aggregate comparisons against the post-reprocess state, comparing pre-reprocess aggregates from the snapshots taken at Phase 0. This is the audit artifact compliance will ask for.
Why a Phase 0 pilot is non-negotiable: every reprocess has at least one bug nobody caught in code review. Discovering it on day 1 against 7 days of data costs ₹3k and 2 hours; discovering it on day 19 against 200 days of data costs ₹1.4 lakh, four days of downstream chaos, and a board-level explanation. The pilot is the single highest-ROI thing in this runbook.
A reprocess controller, in code
The orchestration is concrete enough that the script is clearer than the prose. Below is a controller that runs the phase plan: it tracks state across phases, restores cold partitions, runs each phase against an Iceberg-style atomic-commit storage, evaluates a cost gate after each phase, and persists a checkpoint so the operation can resume after a crash. It uses stubs for the warehouse and S3 calls so it executes anywhere; in production the stubs become pyiceberg, boto3.s3.restore_object, and a Slack webhook.
# reprocess_controller.py — phased year-long reprocess with checkpointing
from datetime import date, timedelta
import json, os, time, random
# --- stubs (replace with real APIs in production) --------------------
class WarehouseStub:
def __init__(self): self.snapshot = 0; self.partitions = {}
def reprocess_partition(self, d, fix_fn, schema_for_date):
rows = 50_000 + random.randint(-2000, 2000)
return {"date": d, "rows": rows, "schema": schema_for_date(d),
"sum_inr": int(rows * 1850 * (1 - 0.002 if "fix" in str(fix_fn) else 1))}
def commit_phase(self, phase_payloads):
for p in phase_payloads: self.partitions[p["date"]] = p
self.snapshot += 1; return self.snapshot
class S3Stub:
def __init__(self): self.restored = set()
def is_warm(self, d): return d > (date.today() - timedelta(days=90)).isoformat()
def restore(self, d): # async glacier-style restore
if d in self.restored: return True
time.sleep(0.001); self.restored.add(d); return True
def schema_for_date(d): # schema evolved over the year
if d < "2025-09-01": return "v1"
if d < "2026-01-15": return "v2"
return "v3"
def fix_mcc_enrichment(payload): # the actual six-line fix
payload["sum_inr"] = int(payload["sum_inr"] * 0.998); return payload
# --- controller ------------------------------------------------------
def run_phase(phase_id, dates, wh, s3, ckpt, max_rupees):
print(f"\n=== Phase {phase_id}: {len(dates)}d {dates[0]}..{dates[-1]} ===")
for d in dates:
if not s3.is_warm(d): s3.restore(d) # cold-storage gate
payloads = []; rupees_used = 0
for d in dates:
before = wh.partitions.get(d, {"sum_inr": 0})
new = wh.reprocess_partition(d, fix_mcc_enrichment, schema_for_date)
new = fix_mcc_enrichment(new)
rupees_used += 4_200 # warehouse $ per partition
payloads.append(new)
delta = abs(new["sum_inr"] - before["sum_inr"])
if rupees_used > max_rupees: # cost gate
print(f" ABORT: budget ₹{max_rupees:,} exceeded at d={d}")
return False
snapshot = wh.commit_phase(payloads) # atomic phase commit
ckpt[phase_id] = {"dates": [dates[0], dates[-1]], "snapshot": snapshot,
"rupees": rupees_used, "ok": True}
with open("ckpt.json", "w") as f: json.dump(ckpt, f, indent=2)
print(f" OK snapshot={snapshot} ₹{rupees_used:,} spent ckpt saved")
return True
# --- run the plan ----------------------------------------------------
wh, s3 = WarehouseStub(), S3Stub()
ckpt = json.load(open("ckpt.json")) if os.path.exists("ckpt.json") else {}
def days(start, n): return [(date.fromisoformat(start) + timedelta(days=i)).isoformat() for i in range(n)]
plan = [("phase_0_pilot", days("2026-04-18", 7), 80_000),
("phase_1_q4", days("2026-01-01", 90), 6_00_000),
("phase_2_q3_parallel", days("2025-10-01", 92), 6_50_000),
("phase_3_q2", days("2025-07-01", 92), 6_50_000),
("phase_4_q1", days("2025-04-01", 91), 7_50_000)]
for phase_id, dates, budget in plan:
if phase_id in ckpt and ckpt[phase_id]["ok"]: continue # resume past checkpoint
if not run_phase(phase_id, dates, wh, s3, ckpt, budget): break
print(f"\nFinal: {len([p for p in ckpt.values() if p['ok']])}/{len(plan)} phases done; "
f"₹{sum(p['rupees'] for p in ckpt.values()):,} spent")
# Output:
=== Phase phase_0_pilot: 7d 2026-04-18..2026-04-24 ===
OK snapshot=1 ₹29,400 spent ckpt saved
=== Phase phase_1_q4: 90d 2026-01-01..2026-03-31 ===
OK snapshot=2 ₹3,78,000 spent ckpt saved
=== Phase phase_2_q3_parallel: 92d 2025-10-01..2025-12-31 ===
OK snapshot=3 ₹3,86,400 spent ckpt saved
=== Phase phase_3_q2: 92d 2025-07-01..2025-09-30 ===
OK snapshot=4 ₹3,86,400 spent ckpt saved
=== Phase phase_4_q1: 91d 2025-04-01..2025-06-30 ===
OK snapshot=5 ₹3,82,200 spent ckpt saved
Final: 5/5 phases done; ₹15,62,400 spent
Walk through the load-bearing pieces. Lines 18–22 are the schema-version routing — every partition is reprocessed under the schema that was active when the partition was originally written, not under today's schema. A naive reprocess that uses current_schema for everything will produce columns that didn't exist in the source data and silently null them, which the validation gate will flag too late. Lines 30–33 enforce the cold-storage gate. Glacier-tier partitions go through s3.restore() before reprocess can begin; in production this is boto3.s3.restore_object(RestoreRequest={'Days': 7, 'GlacierJobParameters': {'Tier': 'Standard'}}) and you wait 4–12 hours for it to complete, polling head_object. Lines 38–40 are the cost gate — every phase has a hard ₹ budget; running past it aborts the phase and forces a manual approval cycle before continuing. Why an aborted phase is better than an over-budget one: the finance partner who approved ₹6 lakh did not approve ₹9 lakh. Going over without re-approval — even by ₹50k — is the way you don't get budget for the next reprocess. Aborting and asking for ₹3 lakh more is professional; running over silently is career-limiting. Lines 42–44 commit the phase atomically and persist a checkpoint. The checkpoint is the resume point if the controller crashes — line 56's check (if phase_id in ckpt) skips already-committed phases, so you can rerun the script after a crash without redoing successful work. Line 53's plan list is the runbook in code form; the document version of it lives in Confluence with stakeholder names, freeze windows, and a Slack channel reference next to each phase.
The controller above is roughly the shape PhonePe, Razorpay, and Flipkart use today, with two important production extras: (a) lineage-driven downstream cascade — every committed phase fires OpenLineage events that trigger dependent table reprocesses; (b) a metric-aggregate diff at every phase boundary, posted to Slack with charts, so stakeholders see progress without asking.
The runbook artifact itself
The code is half the deliverable. The other half is a written runbook — a Markdown document in Confluence or a GitHub repo, owned by the senior engineer leading the operation, that captures every operational decision in advance. A working runbook for a year-long reprocess has these sections:
Header (1 page). Title, scope ("reprocess merchant_payments_enriched for FY 2025-26 to fix MCC code 5812 → 5814 mapping"), commander (the senior engineer responsible end-to-end), deputy (so there's coverage if the commander is sick), business sponsor (the VP who approved the operation), affected systems (4–10 downstream tables), expected duration, expected ₹ cost, expected risk to downstream SLAs.
Phase plan (2–3 pages). A table with one row per phase: phase id, date range, ₹ budget, freeze window, downstream consumers notified, validation gate threshold, go/no-go decision-maker, expected runtime, dependencies on previous phases. The same table the controller code consumes, written in human-readable form.
Pre-flight checklist (1 page). Things that must be true before phase 0 begins. Sample items: corrected pipeline code is merged to main and tagged reprocess-fy26; a snapshot of every affected table has been taken via CREATE TABLE ... AS SELECT * FROM ...__pre_reprocess; the cold-storage restore has been kicked off for the oldest 90 days; finance partner has confirmed the ₹15 lakh budget; compliance has confirmed no Q3 partitions will be touched.
Per-phase go/no-go criteria (1 page each). What does it mean for a phase to be ready to commit? Sample: row-count delta within ±0.5%; sum of amount_inr delta within ±0.1%; no schema-version routing failures in the logs; downstream notifier confirmed all 14 consumers acknowledged the freeze; the deputy has reviewed the validation report.
Rollback plan (1 page). If a phase commits and is later discovered wrong, what do you do? For Pattern 2 atomic commits, this is ALTER TABLE ... RESTORE TO SNAPSHOT <pre_reprocess_id> — but the runbook needs the actual SQL and the actual snapshot id, not "consult the documentation". Rollback should be runnable in under 5 minutes by the on-call engineer at 02:30 IST.
Communication plan (1 page). Who gets notified before each phase, during, and after. Channel names, escalation paths, status update cadence (typical: a daily Slack post in #data-reprocess at 10:00 IST with the phase status, ₹ used, ₹ remaining, partitions done, ETA). The status updates are the single biggest difference between a year-long reprocess that goes well politically and one that doesn't, and the runbook is what makes the daily post easy to write.
Lessons-learned section (added during/after). Empty at the start; filled in as the operation progresses with surprises and decisions. By Phase 3 this section is the most-read part of the document — it's where the next person who runs a year-long reprocess at the company learns from your scars.
When NOT to reprocess a year
Same calculus as the previous chapter, scaled up. A year-long reprocess at a busy fintech costs ₹15–₹40 lakh in compute, six weeks of senior engineering attention, and one or two tense conversations with finance. Before committing, run the impact estimate: how many users / dashboards / compliance reports actually depend on the corrected historical numbers? If the answer is "the analytics team would prefer it but won't escalate" — fix forward and document. If the answer is "GST audit Q3 next quarter" or "the CFO is presenting trailing-twelve-months to the board" — reprocess.
Two specific situations make reprocess almost always wrong: (1) the bug being fixed has an impact smaller than the noise floor of the metric — a 0.001% drift in a number that already rounds to two decimal places is not worth ₹15 lakh; (2) a more recent schema change has made the historical fix moot — if the column being corrected has since been dropped or replaced, just leave history as-is and document the discrepancy.
Common confusions
- "A year-long reprocess is just a 90-day backfill done four times." Mechanically yes, operationally no. The thresholds in the first figure — frozen quarters, cold storage, schema drift, vendor-tier breaks, team turnover — only show up at the year scale. Treating it as 4× a 90-day operation underestimates the orchestration burden by an order of magnitude.
- "You should reprocess in chronological order." Reverse chronological is almost always right. Recent partitions are hot in S3, schema-stable, and most-read by current dashboards — fixing them first delivers value within the first week. Oldest partitions are coldest, schema-oldest, and frozen-most-likely — leaving them last lets the pipeline mature before tackling the hard cases.
- "The reprocess pipeline is the same code as the daily pipeline." The daily pipeline assumes the current schema; the reprocess pipeline must route by partition-original-schema. The daily pipeline has no Glacier restore step; the reprocess pipeline does. The daily pipeline has no cost gate; the reprocess pipeline has one per phase. They share the transformation logic and nothing else.
- "Once you commit a phase you're done with it." The lineage graph downstream of the phase needs to re-run too — feature stores, BI extracts, downstream marts, reverse-ETL targets. A phase commit without the cascade leaves downstream stale, and the daily Slack update will catch hell for it.
- "The runbook is over-engineering for an internal operation." The runbook IS the operation — it's how a six-week project survives one engineer going on leave, a re-org, and a Q3 reprioritisation. The first time a year-long reprocess gets dropped halfway through because nobody documented the state, the team builds the runbook discipline retroactively. Build it the first time.
- "Glacier restore is a one-line API call." The API call is one line; the restore takes 4–12 hours per object batch, has tier-dependent ₹ costs (Bulk is cheapest, Expedited is ~30× more expensive), and fails silently if the IAM role doesn't have
s3:RestoreObject. Treat it as a separate phase with its own checklist.
Going deeper
Schema-version routing in practice
The reprocess pipeline must apply different transformation paths to partitions written under different historical schemas. The clean implementation uses a routing function transform_for_schema(version, partition) that dispatches to one of N implementations. The dirty-but-common implementation uses if-ladders that grow over time — workable for 2–3 schemas, unmaintainable past 5. A middle path: keep the routing function clean, version it semantically (v1.2.0 corresponds to "schema added gst_state_code column"), and store the version in the partition's metadata alongside the data. Iceberg's per-file metadata can hold this; Delta Lake supports it via userMetadata in the commit. PhonePe's internal data-platform writeup (2024) describes a routing layer that handles 11 schema versions across 18 months of historical data; the layer is ~400 lines of Python, all of it testable in isolation. Compare this against airflow-vs-dagster-vs-prefect-the-real-design-differences — the orchestrator handles the when, the routing layer handles the what. Why this discipline matters for reprocess specifically: a schema-routing bug doesn't crash; it silently produces wrong numbers. The validation gate is the only thing that catches it, and the gate only fires per-phase. Without explicit routing the bug surfaces in production, post-commit, where rollback is the only fix.
The freeze-window cost calculus
A 90-minute freeze window on a daily-SLA table is fine. A 4-hour freeze window on the same table breaches the daily SLA — the consumers expected fresh data by 09:00 IST and the freeze ends at 13:00. For a year-long reprocess this becomes a problem: each phase is its own freeze window, and the cumulative effect across 5 phases is roughly 25 hours of frozen state spread over six weeks. The runbook needs explicit per-phase scheduling: phase 0 (pilot) on a Saturday morning when consumer load is lowest; phase 1 (recent quarter) over a long weekend; phase 4 (oldest quarter) on a holiday if possible. Razorpay's reprocess discipline reportedly schedules the largest phases against quarter-close blackout periods, when finance has already paused new dashboard refreshes for their own reasons — this is the single highest-leverage scheduling trick available.
Cost surface — warehouse pricing under reprocess load
A reprocess workload is qualitatively different from a daily workload: continuous high concurrency, large scan volume, no caching benefit (every partition is read once and written once). Snowflake's per-credit pricing handles this fine until concurrency caps kick in around 8 simultaneous large queries; above that, queries queue and the wall-clock time balloons. BigQuery's slot model is similar — a project with 2,000 reserved slots can absorb a reprocess phase, a project with 500 cannot. Databricks' Photon engine charges per-DBU and the DBU rate is workload-dependent; a reprocess job that reads + writes Iceberg gets a higher DBU rate than the daily ETL because of the metadata overhead. Build 16's cost-on-the-cloud-the-s3-egress-compute-trinity covers the egress side of this; for reprocess specifically, the trick is to negotiate a one-time pricing concession with the vendor's account team (most will give you 20–40% off if you commit to a 6-week reprocess in advance) — Indian fintechs with Snowflake or Databricks contracts routinely use this lever.
What the next reprocess inherits from this one
The lessons-learned section of the runbook is half the value of the operation. The next year-long reprocess at the same company will (with luck) be cheaper, faster, and less politically expensive — because the runbook artifact survives the engineer who wrote it, the schema-routing layer is reusable, the cost gate thresholds have been calibrated against real numbers, and the freeze-window scheduling pattern has been negotiated with downstream teams once. Treat each reprocess as building infrastructure for the next one, not as a one-shot operation. Karan at Flipkart's data platform has reportedly run four year-long reprocesses across three years; the fourth one took two weeks because three previous runbooks had built the muscle. Why this is the framing that changes the project's politics: a one-shot reprocess looks like ₹15 lakh of fire-fighting. A reprocess that builds reusable runbook + routing + gating looks like ₹15 lakh of platform investment. Same code, different sponsor message — and the platform-investment framing is what gets the project funded promptly.
The post-mortem that isn't a post-mortem
Every year-long reprocess produces a document at the end. Don't call it a "post-mortem" — that word is for outages. Call it a "completion report" or "operation summary". Structure: the bug that motivated the reprocess; the chosen scope and why; the actual ₹ cost vs. budget; the actual duration vs. estimate; the surprises encountered (with mitigation); the lessons folded into the runbook for next time; the artifacts produced (audit log, snapshot ids, downstream cascade list); compliance sign-off; finance sign-off. A good completion report is 8–12 pages, lives in the data-platform Confluence space, and is the document linked from the next reprocess's pre-flight checklist.
Where this leads next
- /wiki/disaster-recovery-your-warehouse-just-got-deleted — when a reprocess phase commits the wrong thing and rollback is too slow, the DR conversation starts; this chapter prevents getting there, that one is the recovery if you do.
- /wiki/migrating-warehouses-without-downtime — a warehouse migration is functionally a reprocess of every table at once, with the same phase / gate / freeze discipline scaled up to the platform level.
- /wiki/cost-on-the-cloud-the-s3-egress-compute-trinity — the ₹ side of reprocess: cold-storage retrieval fees, warehouse-tier breaks, and egress costs across phases.
- /wiki/on-call-for-data-pipelines-the-real-playbook — the on-call posture during a reprocess is its own discipline; the runbook composes with the playbook.
Past those, Build 17 closes with the 30-year arc and the parts of the field that don't fit anywhere else. The operations chapters of this build are the ones senior engineers say they wish they had read first.
References
- Apache Iceberg — Spec — the snapshot and atomic-commit primitives that make phase-level rollback feasible.
- AWS S3 Glacier Retrieval Tiers — the ₹ and time cost of cold-storage restore that the reprocess controller must orchestrate.
- Snowflake — Warehouse Sizing & Pricing — the tier ladder where reprocess cost goes super-linear.
- Databricks — Reprocessing with Delta Lake (engineering blog) — public write-ups on phase-based reprocess discipline at platform-team scale.
- PhonePe Engineering Blog — public posts on data-platform reprocess discipline and schema-version routing.
- OpenLineage Specification — the lineage events the reprocess fires for the downstream cascade.
- /wiki/backfills-without-breaking-downstream — the prerequisite chapter; everything in this chapter assumes the 90-day discipline as a foundation.
- /wiki/slas-on-data-what-you-can-actually-promise — the SLA framing that decides how long phase-level freeze windows can be without breaching downstream commitments.