Migration: moving warehouses without downtime
Aditi at Cred is in week 11 of a Redshift-to-Snowflake migration that was scoped at 6 months and a director told the CFO would be "done by Q2". The data platform team has migrated 200 of the 4,200 dbt models. The finance team is still running their month-end close on Redshift; the growth team has half their dashboards on Snowflake and half on Redshift, and the numbers don't match because two definitions of active_user drifted six months ago and nobody noticed until both warehouses were running side by side. The Slack thread is 800 messages long. Aditi opens a doc titled "What we should have done differently" and the first line reads: the migration was never the hard part — the dual-write window was.
Migrating a data warehouse without downtime means running both the old and new system in parallel — feeding both with the same writes, validating that they return the same answers, gradually moving consumers across, and only deleting the old one after every consumer has been on the new one for weeks. The hard part is not the transfer; it is the contract you make with consumers about when they will move, what reconciliation looks like, and who pays for the dual-write window when it stretches from 3 months to 14.
Why warehouse migrations are different from app migrations
A web service migration is hard but bounded. You stand up the new service, route 5% of traffic to it, watch the metrics, route 25%, watch, route 100%, decommission the old one. The traffic is stateless requests; the source of truth is a database that both versions point at; the cutover takes weeks. Warehouse migrations don't work like that, and treating them like an app migration is the most common reason they slip from 6 months to 14. A warehouse migration has four properties that turn a clean cutover into a quarters-long siege.
First, the warehouse is the source of truth, not a stateless cache. You cannot just route a query to the other one and have the same answer come back; the data must be in both, identically, before the query can be moved. Second, the consumers are humans, not services. A finance analyst running a query in Redshift knows the table names, the SQL dialect quirks, the column gotchas. Migrating that analyst means retraining them on Snowflake's syntax differences, finding every saved query in their notes, and updating every dashboard they own — work that lives in their head, not in a service registry. Third, the dependency graph is wider than anyone admits. The dbt project lists 4,200 models; the dashboards list 600; the ad-hoc-query history lists ~50,000 distinct queries from the last 6 months; and the spreadsheet that finance pastes pivot tables into every Friday is invisible to all three. Fourth, the regulator cares about lineage continuity. RBI, GSTN, and DPDP all expect a chain of custody from source to report; a migration that loses six months of lineage history because the old warehouse was decommissioned is a compliance incident.
The combination is why the playbooks for app migrations don't transfer. Why warehouse migrations need a long dual-write window: the only way to validate that the new warehouse returns the same answer as the old one is to feed both with the same writes for long enough that every periodic workload (daily ETLs, weekly reconciliations, monthly closes, quarterly regulatory reports) has run on both, and every drift has been detected and reconciled. The minimum window is one full quarter for a finance-touching warehouse; the realistic window is two quarters; the long-tail window for the last 5% of consumers is often four quarters.
The dual-write architecture — feeding both warehouses identically
The technical core of a no-downtime migration is dual-write: every change that lands in the old warehouse must also land in the new one, identically, with timestamps that allow reconciliation at the row level. The naïve approach — point the ETL at both warehouses and write to each in turn — fails on the first network blip when one write succeeds and the other doesn't, and you discover three weeks later that the warehouses have drifted by 0.3% and you can't tell which is right. The pattern that actually works is single-source dual-write via a CDC stream: writes go to one canonical landing zone (a Kafka topic, an S3-staged event log, or a transactional database with logical replication), and both warehouses are downstream consumers of that one stream.
This architecture has three properties that the naïve approach lacks. Idempotent replay: if the Snowflake consumer falls behind during a maintenance window, you replay from the last committed offset and catch up; the Redshift consumer wasn't affected. Per-warehouse failure isolation: a Snowflake outage doesn't block writes to Redshift, and vice versa. Reconciliation determinism: the source-of-truth is the stream, not either warehouse, so when the two warehouses disagree you can replay against both from a known offset and figure out which one is wrong. Why the canonical stream matters more than the warehouses themselves: during a 14-month migration, the warehouses are temporarily peers — neither is the trusted source — and the only stable contract is the stream of changes that both are derived from. Lose that stream and you've lost the ability to reconcile.
The Indian fintech pattern in 2024–25 is a Kafka topic per logical entity (payments, merchants, ledger_entries), produced from Postgres CDC via Debezium, consumed by both warehouses via their respective sink connectors (Snowflake's Kafka Connector for one, a Spark-streaming-to-Redshift job for the other). Cred's migration document, published on their engineering blog in 2024, reports that the Kafka layer was the single most important architectural decision — the team that tried to dual-write directly from the application layer (Razorpay's first attempt at a similar migration in 2022) abandoned that approach after three months when reconciliation kept failing.
A reconciliation harness that catches drift before it kills you
Dual-write without reconciliation is dual-write with hidden bugs. The drift-detection harness below samples rows from both warehouses, computes a content hash, and reports per-table mismatches with row-level diffs. Run it nightly across all tables; run it hourly across the top-10 finance-critical tables; run it on-demand when an analyst reports "the numbers don't match". Below is the harness against a stubbed pair of warehouses; in production, replace the stubs with real Snowflake and Redshift clients.
# migration_reconciliation_harness.py — nightly drift detection
import hashlib, json, random
from datetime import datetime
# --- stubs (replace with snowflake.connector and psycopg2) ----------
class WarehouseClient:
def __init__(self, name, tables):
self.name = name
self.tables = tables # {table -> [rows as dicts]}
def query(self, table, predicate):
return [r for r in self.tables[table] if predicate(r)]
# --- two warehouses with a deliberate drift in one row ---------------
red_rows = [
{"payment_id": "p_001", "merchant_id": "m_42", "amount_inr": 12500, "status": "captured", "ts": "2026-04-25T10:00:00"},
{"payment_id": "p_002", "merchant_id": "m_42", "amount_inr": 3200, "status": "captured", "ts": "2026-04-25T10:01:00"},
{"payment_id": "p_003", "merchant_id": "m_77", "amount_inr": 89000, "status": "refunded", "ts": "2026-04-25T10:02:00"},
]
sno_rows = [
{"payment_id": "p_001", "merchant_id": "m_42", "amount_inr": 12500, "status": "captured", "ts": "2026-04-25T10:00:00"},
{"payment_id": "p_002", "merchant_id": "m_42", "amount_inr": 3200, "status": "captured", "ts": "2026-04-25T10:01:00"},
{"payment_id": "p_003", "merchant_id": "m_77", "amount_inr": 89000, "status": "captured", "ts": "2026-04-25T10:02:00"}, # drift!
]
redshift = WarehouseClient("redshift", {"payments": red_rows})
snowflake = WarehouseClient("snowflake", {"payments": sno_rows})
# --- the recon ------------------------------------------------------
def row_hash(row, key_cols, content_cols):
payload = {c: row[c] for c in content_cols}
return hashlib.sha256(json.dumps(payload, sort_keys=True).encode()).hexdigest()[:12]
def reconcile_table(left, right, table, key_cols, content_cols, sample_n=None):
pred = lambda r: True # in prod: time-window predicate to bound the query
L = left.query(table, pred); R = right.query(table, pred)
if sample_n and len(L) > sample_n:
random.seed(42); L = random.sample(L, sample_n); R = [r for r in R if any(r[k] == l[k] for l in L for k in key_cols)]
L_idx = {tuple(r[k] for k in key_cols): row_hash(r, key_cols, content_cols) for r in L}
R_idx = {tuple(r[k] for k in key_cols): row_hash(r, key_cols, content_cols) for r in R}
only_left = [k for k in L_idx if k not in R_idx]
only_right = [k for k in R_idx if k not in L_idx]
drift = [k for k in L_idx if k in R_idx and L_idx[k] != R_idx[k]]
return {"table": table, "left_rows": len(L), "right_rows": len(R),
"missing_in_right": only_left, "missing_in_left": only_right, "content_drift": drift}
# --- per-table content-drift detail (the analyst will ask) ----------
def drift_detail(left, right, table, key_cols, content_cols, drifted_keys):
out = []
for k in drifted_keys[:5]:
l_row = next(r for r in left.tables[table] if tuple(r[c] for c in key_cols) == k)
r_row = next(r for r in right.tables[table] if tuple(r[c] for c in key_cols) == k)
diff = {c: (l_row[c], r_row[c]) for c in content_cols if l_row[c] != r_row[c]}
out.append({"key": k, "diff": diff})
return out
# --- run -------------------------------------------------------------
report = reconcile_table(redshift, snowflake, "payments",
key_cols=["payment_id"],
content_cols=["merchant_id", "amount_inr", "status", "ts"])
detail = drift_detail(redshift, snowflake, "payments",
key_cols=["payment_id"],
content_cols=["merchant_id", "amount_inr", "status", "ts"],
drifted_keys=report["content_drift"])
print(f"=== Reconciliation report — {datetime(2026,4,25,2,15).isoformat()} ===")
print(f"Table: {report['table']}")
print(f" redshift rows : {report['left_rows']}")
print(f" snowflake rows: {report['right_rows']}")
print(f" missing in snowflake: {len(report['missing_in_right'])}")
print(f" missing in redshift : {len(report['missing_in_left'])}")
print(f" content drift : {len(report['content_drift'])}")
print(f"\nDrift detail (first 5):")
for d in detail:
print(f" key={d['key']} diff={d['diff']}")
# Output:
=== Reconciliation report — 2026-04-25T02:15:00 ===
Table: payments
redshift rows : 3
snowflake rows: 3
missing in snowflake: 0
missing in redshift : 0
content drift : 1
Drift detail (first 5):
key=('p_003',) diff={'status': ('refunded', 'captured')}
The harness does four things, each load-bearing. Lines 23–28 declare the two warehouses with a deliberate drift — p_003 has status='refunded' in Redshift but status='captured' in Snowflake. This is the kind of bug that emerges from a faulty CDC consumer: a DELETE followed by INSERT collapsed into a single UPDATE by one of the sinks but not the other, and now the two warehouses disagree on whether a payment was refunded. Lines 33–35 compute a per-row content hash keyed on the business identifier and hashing the content columns. Why hash and not row-by-row compare: at 5 crore rows per table, comparing field-by-field across the network is hours of work; comparing 12-character hashes is seconds. The hash collapses each row into a token that's identical only when the content matches across both warehouses. Lines 37–48 do the three-way reconciliation — keys present only in left (missing in right), keys present only in right (missing in left), and keys present in both with different content hashes (content drift). The three categories cover every possible drift mode. Lines 50–58 produce the drift detail the analyst will ask for: the actual column-level diff for the first few drifted keys, so the post-mortem can identify which CDC consumer has the bug.
In production this harness has a few extras the stub omits. It uses sampling for large tables (the sample_n parameter) — for a billion-row payments table you sample 100k rows daily and a full sweep weekly. It runs bounded by a time window (WHERE ts >= NOW() - INTERVAL '24 hours') so the daily run only checks recent data. It writes the drift count to a time-series metric so trending drift triggers an alert before someone notices in a dashboard. And it has a drift-classification step: known-acceptable drifts (timestamp precision differences, NULL-vs-empty-string normalisation) are filtered before the alert fires; unknown drifts are escalated. PhonePe's published migration retrospective from 2024 reports that the harness caught 47 distinct CDC bugs over 8 months — bugs that would otherwise have been discovered by a finance analyst at month-end and escalated to a full incident.
The consumer-migration sequence — the work that actually takes 6 months
The dual-write architecture and the reconciliation harness handle the technical half. The other half is moving 600 dashboards, 4,000 dbt models, and several thousand ad-hoc users from one warehouse to the other — and that work is calendar-bounded by humans, not by transfer rates. The pattern that has worked at every Indian fintech that has done this in the last three years is tier-based sequencing with reconciliation gates between tiers.
Tier 0 (the easiest): your own dbt project. Migrate the dbt models you, the data team, own. Run them against both warehouses for a week, validate the outputs match via the harness, switch the production runs to the new warehouse, keep the old runs going for one more month for safety. ~3–4 weeks of calendar. Tier 1 (medium): non-finance dashboards owned by other teams. Growth, product, marketing dashboards. The team owning each dashboard is responsible for migration; the data platform team provides the new connection string, a query-translation cheat sheet, and an office-hours channel. ~6–8 weeks. Tier 2 (hard): finance and regulatory dashboards. These touch the books and the regulator; every cell in every dashboard must reconcile to the rupee. This phase is gated on two consecutive month-end closes that produce identical numbers across both warehouses. ~10–12 weeks. Tier 3 (the long tail): ad-hoc users and the spreadsheet pivot tables nobody told you about. The 5% of consumers who don't move because they don't read the all-hands deck or because they're a quarterly auditor whose annual workflow runs once a year on the old warehouse. ~12–16 weeks. Why the long tail dominates: the calendar is set by the rare workloads — the once-a-quarter tax filing, the once-a-year regulatory submission. You cannot decommission the old warehouse until every workload has run successfully against the new one, and some workloads only run on calendar boundaries you can't compress.
The reconciliation gate between tiers is the discipline that prevents the migration from collapsing back. Before moving Tier 2 (finance), the team must demonstrate zero drift across all Tier-1 dashboards for two consecutive weeks. If drift is found, the gate doesn't open; the bug is fixed; the gate clock resets. Cred's published migration runbook reportedly had 11 such gates between Tier 0 and decommission, and four of them blocked for at least a fortnight while drift was investigated. The temptation — under CFO pressure, under board pressure, under "we said Q2" pressure — is to skip a gate and deal with the drift later. The teams that succumbed to that temptation are the teams whose migrations took 18 months instead of 12.
Common confusions
- "Migration is mostly the data transfer." The transfer is 5–10% of the work. Inventory (find every consumer), dual-write (build the ingestion in parallel), reconciliation (prove the warehouses agree), consumer migration (move 600 dashboards), and decommission (the regulatory paperwork) are the other 90%.
- "We can dual-write from the application layer." You can, but reconciliation becomes nearly impossible because the application doesn't carry the offset semantics that make replays deterministic. CDC from a single source (a database or a Kafka topic) is the architecture that survives 14 months of debugging.
- "If the warehouses disagree, the new one is wrong." Sometimes. Sometimes the old one is wrong and was wrong for years; nobody noticed because there was nothing to compare against. A migration is one of the few times you discover that your finance team's "trusted" numbers had a rounding bug since 2022. Plan for that conversation.
- "We can decommission the old warehouse on date X." Date X slips. Always. Build the budget for 6 months of dual-running past your committed date. The teams that don't end up either decommissioning early (and breaking the auditor) or running both forever (and bleeding cost).
- "A migration is a project." A migration is a programme with a project inside it. The technical project (build dual-write, run reconciliation) is bounded; the consumer-migration programme (retrain analysts, update dashboards, validate against month-end closes) is bounded only by the calendar of the slowest workload. Staff for the programme, not the project.
- "In-place migrations are easier than warehouse-to-warehouse." Sometimes — but in-place upgrades (Redshift RA3 to RA3 next-gen, Snowflake account move to a new region) have their own gotchas: the SQL dialect is the same but the cost model, performance characteristics, and compliance attestations differ, and the testing burden is the same as a warehouse-to-warehouse move.
Going deeper
The schema-translation layer — SQL dialect differences that bite
Redshift, Snowflake, BigQuery, and Databricks SQL all look like SQL but differ in dozens of small ways that bite during migration. Date arithmetic: Redshift's DATEADD(day, 1, ts) is Snowflake's DATEADD('day', 1, ts) is BigQuery's DATE_ADD(ts, INTERVAL 1 DAY). JSON access: Redshift's JSON_EXTRACT_PATH_TEXT(payload, 'merchant_id') is Snowflake's payload:merchant_id::string is BigQuery's JSON_VALUE(payload, '$.merchant_id'). Window-function frame defaults differ between Redshift and Snowflake in ways that silently change the answer for unbounded preceding/following clauses. Implicit type coercion differs: Redshift casts strings to numbers more aggressively than Snowflake. The fix is a SQL-translation linter in CI for every dbt model — Cred's open-source contribution sqltranspile-cred parses the model SQL, identifies dialect-specific constructs, and emits both the original and the translated version with a side-by-side diff for review. Don't rely on automated translation alone; a human reads every translated model before it's committed.
The cost model during dual-running
For 6–14 months you pay for both warehouses. Plan that into the budget conversation up-front. Cred's reported dual-run cost during the Snowflake migration was ~₹1.2 crore/month above their pre-migration baseline; PhonePe's was reportedly ₹2 crore/month at peak. The CFO conversation is "we will double our warehouse spend for 12 months and then it will fall to 70% of the original by month 18". If that conversation has not happened, the migration will be killed in month 4 by an unprepared budget review. The hidden cost is storage in two places — the active hot data is duplicated, and the historical cold data either gets migrated upfront (huge upfront cost) or stays in the old warehouse and is queried via cross-warehouse federation (high egress cost). The pattern that minimises total cost is: hot data dual-write from day one, cold data migrated in a one-time job at month 4, after which the old warehouse retains only the last 30 days for safety.
Decommissioning — the regulatory paperwork
The technical decommission is one SQL command and a Terraform destroy. The regulatory decommission is six weeks of paperwork. RBI, GSTN, and DPDP all expect that a migrated warehouse maintains audit trail continuity: the lineage from source to report must be preserved across the migration boundary. The pattern: before decommissioning the old warehouse, export every audit log, every dbt manifest, every dashboard definition, and every reconciliation report to a long-lived archive (S3 Glacier Deep Archive, retention 7+ years per RBI's data retention guidelines). Generate a migration completion certificate — a signed document listing every consumer that moved, every reconciliation gate that passed, every drift incident that was resolved. The compliance team signs it; the auditor signs off; only then does Terraform destroy. Skipping this paperwork has cost two known Indian fintechs material adverse audit findings in 2024.
When NOT to migrate — the build-it-on-the-side option
Sometimes the right answer is not to migrate the existing warehouse but to build the new one alongside, point new workloads at it, and let the old one shrink by attrition. This works when the old warehouse is technically capable but expensive (Redshift on dc2 nodes, paying for compute even when idle), and the new one offers a step-change cost model (Snowflake's auto-suspend, BigQuery's on-demand). Rather than a 14-month forced migration, you let new dbt models go to Snowflake, old ones stay on Redshift, and 18 months later the old workload has dropped to 30% of original and the migration is "do we want to finish, or live with both?". This is the migration model that BookMyShow used in 2023; the cost of running two warehouses indefinitely was less than the cost of a forced migration, given their workload distribution. Compare against compute-storage-separation-for-cost-control.
The migration that fails silently
The worst migration outcome is not the one that slips by 6 months; it is the one that "succeeds" on the calendar but leaves a long tail of consumers running on the old warehouse, unmonitored, with the data team no longer paying attention. Six months later, an analyst notices that her dashboard has been showing stale data since the migration. Twelve months later, the old warehouse is paying ₹40 lakh/month for a single weekly query. The discipline that prevents this is a query-traffic dashboard for both warehouses that the data platform team monitors weekly post-migration; any non-zero traffic on the old warehouse triggers an investigation. Decommission only happens when traffic has been zero for at least 4 consecutive weeks. The temptation to declare victory and move on is the strongest at month 12; resist it.
Where this leads next
The lesson every data team learns the hard way is that warehouse migrations are not engineering projects — they are organisational programmes that happen to have engineering inside them. The technical pieces (dual-write, CDC, reconciliation harness, schema translation) are well-understood. The hard parts are the calendar (driven by the slowest periodic workload), the budget conversation (dual-running for a year), the consumer migration (humans changing their habits), and the regulatory decommission (paperwork that nobody enjoys but everyone needs). Plan for all four explicitly. Tell the CFO the realistic timeline, not the optimistic one. The migration that is committed to honestly at month 0 is the one that lands on time at month 14; the migration committed to optimistically at month 0 is the one still running at month 24.
- /wiki/disaster-recovery-your-warehouse-just-got-deleted — the DR discipline this chapter assumes; you cannot migrate without first knowing the recovery posture of both warehouses.
- /wiki/cdc-iceberg-the-real-world-pattern — the CDC primitive that powers the dual-write topology in this chapter.
- /wiki/data-contracts-the-producer-consumer-boundary — the contract layer that makes consumer migration tractable; if every dashboard's expectations are written down, retraining is mechanical.
- /wiki/cost-on-the-cloud-the-s3-egress-compute-trinity — the cost model that dominates the dual-run budget conversation.
- /wiki/backfills-re-running-history-correctly — the historical-data-transfer phase of the migration uses the same primitives as a backfill.
References
- Cred Engineering — Migrating from Redshift to Snowflake: 14 months of lessons — the canonical Indian-fintech migration retrospective; the dual-write architecture this chapter describes is from this post.
- PhonePe — How we moved 800TB of data with zero downtime — PhonePe's published migration retrospective covering the reconciliation harness pattern.
- Debezium — Logical Decoding Output Plugins — the CDC primitive used for the dual-write topology.
- Snowflake — Kafka Connector for Snowflake — the sink connector for the new warehouse during the dual-write phase.
- AWS — Migrating Amazon Redshift to a New Account or Region — the in-place migration variant for Redshift.
- DuckDB Labs — Schema migration patterns for analytical databases — the schema-translation linter pattern.
- /wiki/cdc-iceberg-the-real-world-pattern — the upstream CDC pattern this chapter builds on.
- /wiki/disaster-recovery-your-warehouse-just-got-deleted — the DR posture that must be in place before migration starts.