Resharding Without Downtime

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

You picked a shard count on day one. Today the cluster is hot, the largest shard is past its disk budget, or you finally admitted that the shard key was wrong. You have to move a chunk of data to a new shard. The product team says you cannot take the application offline.

Online resharding is the protocol that solves this. The shape is always four phases. Phase 1 — Dual-write: a router sends every write for the affected keys to both the old shard and the new shard. Reads still go to the old. Phase 2 — Backfill: a background job bulk-copies historical rows from old to new while dual-write keeps the two copies in sync for new traffic. Phase 3 — Cutover: reads switch from old to new. Writes still dual-write, so a regression is a one-flag rollback. Phase 4 — Cleanup: writes stop on the old shard; the old rows are redundant and eventually dropped.

Each phase is reversible — flip the flag back and you are where you were. Each phase is non-skippable — drop dual-write and you lose every write between the backfill snapshot and cutover; drop backfill and the new shard has no history.

Vitess's VReplication automates the whole thing: MoveTables, SwitchReads, SwitchWrites are three CLI commands, and the topology service plus the vtgate proxy handle the transition. MongoDB does it with moveChunk; Citus with citus_move_shard_placement. Production cost is 10–20% extra CPU; elapsed time runs from hours to weeks. Downtime, when the protocol is implemented correctly, is zero.

You launched on ten MySQL shards. Two years later you are at 80% disk on the busiest five. The capacity plan says you need 15 shards by Diwali — a third of the data has to move to hardware that does not exist yet.

The clean way would be a maintenance window. Six hours of pricing pages returning 503, six hours of UPI callbacks failing. Nobody will sign off.

This chapter is how the rest of the industry — Slack, YouTube, ShopForge, Slipway — moves data between shards without ever taking the application down. The protocol has four phases. It looks simple on a slide. The interesting parts are the corners where it can go wrong.

The 4-phase migration protocol

The procedure is a state machine. The state lives in the proxy tier — vtgate, the Citus coordinator, your own routing layer — and it advances one transition at a time. Every phase has the property that you can flip back to the previous phase without losing data.

Phase 1 — Dual-write. The proxy starts sending every write for the affected keys to both the old shard and the new shard. Reads still go to the old shard. The new shard is empty at this moment except for the writes that have arrived since dual-write started.

Phase 2 — Backfill. A background job streams the existing rows from the old shard to the new shard, in chunks. Dual-write keeps running throughout — every new INSERT, UPDATE, DELETE flows to both. By the end of the backfill, the new shard contains both the historical rows (from the bulk copy) and the live rows (from dual-write).

Phase 3 — Cutover. Reads switch from old to new. Writes are still dual-writing. The old shard is now redundant from the application's perspective — it serves nothing — but its writes are kept current as an insurance policy. Run this state for a day or a week, watching error rates and tail latency. If anything looks wrong, flip the read flag back to the old shard and you are in Phase 2 again.

Phase 4 — Cleanup. Stop writing to the old shard. The old shard's rows are now stale and will diverge from the new shard's; they are no longer the source of truth for anything. Eventually you drop them — drop the table, deprovision the host, reclaim the storage.

Why four phases and not three: dual-write and backfill have to overlap. If you backfill first and then turn on dual-write, every write that arrived during the backfill is missing from the new shard — silent data loss. If you turn on dual-write first and never backfill, the new shard has only the live writes and is missing the entire history. The two phases are not stages; they are layers, and the protocol works because they coexist.

The protocol is symmetric to a transaction commit. Phase 1–2 is the prepare. Phase 3 is the commit point. Phase 4 is the cleanup after commit. Like transactions, the dangerous moment is the commit — Phase 3 is when reader behaviour changes — and like transactions, the protocol survives a crash at any phase by re-reading the state and resuming.

Phase 1 — Dual-write

The proxy starts sending every affected write to both shards. The application sees no change; its driver still issues UPDATE users SET cart = ... WHERE user_id = 42. The proxy intercepts and writes to both.

The hard requirement is atomicity. If the write succeeds on the old shard and fails on the new, the two diverge. The cleanest rule: if either dual-write fails, the application's write fails. The client retries. The proxy guarantees both-or-neither.

You can implement both-or-neither with 2PC, or serialise: write old first, then new, fail if either fails. Most production systems take the serial path — the worst case is that the old has a write the new is missing, and the next retry replays the same UPDATE on both, idempotent if the application is well-designed.

Why dual-write is non-negotiable: backfill copies a snapshot. From the instant the snapshot starts, every subsequent write is a delta the snapshot does not contain. Dual-write captures those deltas on the destination in real time. Skip dual-write and your only options are (a) freeze writes during backfill — downtime — or (b) reconstruct deltas from the old shard's binlog later, which is what Vitess's VReplication does internally. The binlog reader is the dual-write, just implemented inside the database tier rather than the proxy.

End-to-end p99 for affected keys rises 1.5–2× during dual-write. Dual-write is also where you discover the "writes" that were actually reads in disguise — SELECT ... FOR UPDATE row locks. Those need to dual-lock or be reasoned about explicitly. Production migrations find a handful in the first week.

Phase 2 — Backfill

While dual-write keeps the new shard in sync with live traffic, the backfill copies the historical rows — a bulk job that walks the old shard's affected rows in chunks and inserts them into the new shard. Three properties matter: resumable, rate-limited, and convergent.

Resumable. A 50 TB backfill takes days; connections die and hosts reboot. Track the last successfully copied primary key (last_copied_id) in a metadata table. Each chunk is SELECT * FROM users WHERE id > last_copied_id ORDER BY id LIMIT 1000. A crash loses at most one chunk; INSERTs use ON DUPLICATE KEY UPDATE because dual-write may have already placed those rows on the new shard.

Rate-limited. Bulk copy saturates network, disk, and source replication lag. Production backfills cap throughput ("1000 rows/s" or "10 MB/s per stream") and pause when source replication lag crosses a threshold. A backfill that knocks the source over is a self-inflicted outage.

Convergent. When the backfill completes, the old and new shards must hold the same data for the affected keys. Verification is a sub-protocol — a checksum sweep that compares row hashes. Vitess calls this VDiff. If the diff finds divergent rows, pause and investigate.

def backfill(old, new, table, chunk_size=1000):
    last_id = read_progress(table)
    while True:
        rows = old.fetch(
            f"SELECT * FROM {table} WHERE id > %s ORDER BY id LIMIT %s",
            (last_id, chunk_size),
        )
        if not rows:
            return  # done
        for row in rows:
            new.execute(
                f"INSERT INTO {table} VALUES (...) ON DUPLICATE KEY UPDATE ...",
                row,
            )
        last_id = rows[-1]["id"]
        save_progress(table, last_id)
        throttle()  # respect rate limit, check source lag

The ON DUPLICATE KEY UPDATE makes the backfill safe under concurrent dual-writes. By the time backfill reaches row 42, dual-write may already have inserted row 42 from a live application write. The upsert is idempotent if the values match; if dual-write produced a newer version, monotone version columns let the newer version win.

Why backfill cannot replace dual-write: a snapshot at T0 captures only rows that exist at T0. Writes between T0 and the backfill's completion are missing unless something else captures them. Dual-write is that something. Backfill closes the history window; dual-write closes the live-traffic window. Closing both is the entire algorithmic content of online resharding.

Phase 3 — Cutover

Cutover is the moment the proxy starts reading from the new shard. It is the only phase where reader-visible behaviour changes; everything before this was background machinery. The transition is a single proxy-config flip — vtctldclient SwitchReads in Vitess, a metadata-table update in Citus, a phase enum advance in your own router.

Preconditions before cutover:

Backfill complete. last_copied_id has reached the maximum primary key in the affected range.
VDiff clean. A checksum sweep returns zero divergent rows.
Dual-write healthy. No backlog of failed dual-writes; the new shard keeps up with write throughput.
New shard load-tested. Days of dual-write traffic confirm it can take the read load.

The reason writes still dual-write during cutover is rollback. If reads from the new shard reveal a missing index or a regression the diff did not catch, the operator flips reads back to the old shard — which is still current because dual-write has been running. The cutover is undone in seconds. Stop dual-write at cutover (a tempting "simplification") and rollback ceases to be instantaneous; you would have to backfill the cutover window from new to old first. The cost of keeping dual-write running through cutover is one extra write per request; the value is a one-flag rollback. Always pay it.

Production teams hold cutover for 24 hours to a week, watching read latency, error rates, write latency on both shards, and application correctness signals. If nothing regresses, advance to Phase 4.

Phase 4 — Cleanup

Cleanup stops writes to the old shard. From this moment the old data is stale and there is no rollback — going back would discard every write since cleanup started. Phase 3 is the last chance to roll back without losing user data.

After a confidence period, the operator drops the old rows — DROP TABLE for a whole shard, DELETE FROM ... WHERE shard_key IN (...) for a range. The proxy's dual-write logic is removed; it reverts to the single-write fast path, and the migration's runtime overhead disappears.

Python sketch of the dual-write proxy

The state machine fits in a small class. This is not Vitess — there is no binlog reader, no VDiff, no topology service — but every proxy that does online resharding has these branches in its write and read paths.

class ReshardingProxy:
    def __init__(self, old_shard, new_shard, phase):
        # phase is one of: 'dual', 'backfill', 'cutover', 'cleanup'.
        self.old = old_shard
        self.new = new_shard
        self.phase = phase

    def write(self, key, value):
        if self.phase in ('dual', 'backfill', 'cutover'):
            self.old.write(key, value)
            self.new.write(key, value)
        elif self.phase == 'cleanup':
            self.new.write(key, value)
        else:
            raise ValueError(f"unknown phase {self.phase}")

    def read(self, key):
        if self.phase in ('dual', 'backfill'):
            return self.old.read(key)
        elif self.phase in ('cutover', 'cleanup'):
            return self.new.read(key)
        else:
            raise ValueError(f"unknown phase {self.phase}")

    def advance(self, target_phase):
        # The advance logic is what an operator triggers after verifying
        # the preconditions for the next phase — backfill complete, diff
        # clean, soak time elapsed. In production this is a CLI command
        # that updates the topology service; every proxy picks up the change.
        self.phase = target_phase

Thirty-odd lines. The whole protocol is a phase enum and two branch tables. The substance — the backfill job, the diff sweep, the topology distribution, the binlog tracking — sits in other modules; the routing decision itself is this small. Production systems run this logic inside vtgate, the Citus coordinator, or your own router; the substance is identical.

Why the read path flips earlier than the write path: writes go to both during cutover so that rollback (read flip back) restores the system to a known-good state. Reads go to one or the other — there is no "read both and merge" mode in this protocol, because reads are not idempotent in the way writes are (a read returning two different values from two shards is a divergence, not a recovery). The asymmetry is essential; the read path is where commit happens.

Vitess VReplication — the gold standard

Every Vitess team runs the four-phase protocol, except the team does not write any of it. VReplication does dual-write via MySQL binlog streams, backfill via parallel chunked copy, the diff via VDiff, and cutover via topology updates. Three operator commands:

vtctldclient MoveTables --workflow=move_users \
    --source-keyspace=commerce --target-keyspace=customer create \
    --tables="users,carts"
# ... wait, monitor, run VDiff ...
vtctldclient MoveTables --workflow=move_users SwitchReads
# ... soak ...
vtctldclient MoveTables --workflow=move_users SwitchWrites

MoveTables create provisions VReplication streams from source to destination shards — they read the current binlog position, do a parallel chunked copy, and continuously apply new writes from the binlog (Vitess's dual-write, implemented inside the database tier rather than the proxy). SwitchReads flips routing so vtgate reads from the destination. SwitchWrites flips writes and tears down the reverse stream that VReplication keeps available for rollback.

The same machinery powers Vitess's Reshard (splitting or merging shards), keyspace migrations, and OnlineDDL schema changes via shadow tables. YouTube adds MySQL shards weekly this way; Slack, MarkHub, Square, GitHub all run VReplication-driven reshards as routine operations. For Postgres, Citus offers citus_move_shard_placement and citus_split_shard_by_split_points. For MongoDB, the balancer's moveChunk. Vendors differ in surface area; the protocol underneath is the four phases.

Challenges — things that break

The protocol is correct; the failures come from the corners.

Write conflicts. If any write path bypasses the proxy — an admin tool connecting directly to a host, a cron job with hardcoded credentials — those writes do not dual-write and the shards diverge. Audit every write path before starting.

Backfill latency. A 50 TB backfill at 100 MB/s takes six days. Backfill reads compete with application reads; source p99 climbs. Production runbooks throttle backfill if source p99 exceeds 50 ms.

Schema changes mid-migration. Do not do them. Backfill reads rows in one schema; alter midway and INSERTs stop matching. Vitess refuses; a homemade pipeline silently breaks.

Application bugs that assume single-shard. A query without the shard key was always a fan-out; you may not have noticed. After resharding it waits for the slowest shard. Cutover exposes timing assumptions that previously held by accident.

Hot keys during dual-write. A celebrity user sees doubled load — the new shard takes full traffic from the moment dual-write starts. Pre-warm the new shard before going to production traffic.

Replication lag. VReplication is asynchronous; the destination lags milliseconds usually, minutes under load. Cut over during a lag spike and destination reads see stale data. The cutover precondition is "lag below threshold for N consecutive minutes".

Rebalancing — a specific case of resharding

When one shard grows too big you split it into two. The four-phase protocol applies unchanged. In Vitess this is the Reshard workflow — declare new shard ranges (split 0x80-0xc0 into 0x80-0xa0 and 0xa0-0xc0), VReplication backfills each new range from the parent, dual-write keeps them in sync, and cutover flips reads then writes.

Consistent hashing minimises the rebalance. If the shard map is a consistent-hash ring with virtual nodes (chapter 76), adding a shard touches only keys in the new shard's arc — about 1/N of total keys. Adding a shard to a 10-shard cluster moves 9% of data. In hash-modulo-N sharding, going from N=10 to N=11 moves 91% of keys; the four-phase protocol still works but the migration takes ten times as long. This is one of the strongest practical arguments for consistent hashing — it makes routine rebalancing tractable.

Slack — 10 to 20 MySQL shards

Slack's messages cluster ran on Vitess with 10 MySQL shards, partitioned by team_id. Growth pushed the busiest shards past 70% disk; the on-call team was paged for replication lag every other week. The plan: split each shard in two, doubling shard count to 20 and halving per-shard load. Each old shard i becomes new shards 2i and 2i+1, with the Vitess Reshard workflow running ten parallel migrations.

Phase 1 — Dual-write (24 hours). VReplication streams open from each old shard to its two children. Binlog events are filtered by the new key range and applied to the appropriate child. vtgate keeps reading from the old shards while writes flow to both. The team watches replication lag and per-stream error rates.

Phase 2 — Backfill (4 days). Parallel copy streams the existing rows from each old shard to its children, throttled to keep source p99 read latency under 30 ms. VDiff at the end reports zero divergent rows across all 20 destinations.

Phase 3 — Cutover (1 hour, then 1 week soak). Reshard SwitchReads flips reads. Writes still dual-write. Rollout is staged — internal workspaces first (5%), then 25%, then 100%. Read p50 and p99 stay under pre-migration baselines because each new shard now serves half its parent's traffic.

Phase 4 — Cleanup. SwitchWrites flips writes to the new shards only and tears down the reverse VReplication streams. Old shards are decommissioned over the following weeks — read-only, then dropped, then hardware reclaimed.

Total: ~2 weeks, zero downtime, no application code changes. The Rails code issuing Message.where(team_id: id) saw nothing different.

The four-phase resharding timeline as Slack ran it. Dual-write begins on day 1; backfill runs across days 2–5 alongside dual-write; cutover (SwitchReads) is a one-hour staged rollout on day 5 and the soak lasts a week; SwitchWrites finalises the migration on day 12, with cleanup running through day 14 and beyond. The read path flips at SwitchReads; the write path flips at SwitchWrites. The window between the two flips is the rollback window — flipping reads back is a one-flag operation because writes have been dual-writing the whole time.

Common confusions

"Resharding without downtime is impossible." It was, in 2010. Today it is routine — Vitess, Citus, MongoDB, ScyllaDB, DynamoDB all ship the tooling. The technical impossibility is gone; the political impossibility ("we will reshard next quarter") is the actual blocker.
"You can skip phases if the cluster is small." No. Skip dual-write and you lose every write made during the backfill — silent data corruption, discovered weeks later. Skip backfill and the new shard is empty. The four phases are a minimal cover.
"Backfill catches everything." Backfill catches historical rows. Dual-write catches live rows. Backfill alone leaves a write-shaped hole; dual-write alone leaves a history-shaped hole.
"Vitess is overkill — we can write our own." At production scale, the homemade approach is Vitess minus the testing. Teams that try it end up writing VReplication's edge cases (binlog catchup, schema-change interlock, VDiff, rollback streams) one outage at a time. You are not saving engineering; you are deferring it.
"Cutover is the dangerous moment." Cutover is the visible moment. The dangerous moment is in Phase 2 — backfill has been running for a day, dual-write is producing writes the diff has not seen, and someone deploys a schema change without realising the migration is in flight. By cutover the danger has been mitigated.

Going deeper

Vitess VReplication internals

VReplication runs on top of MySQL's binlog. A stream connects a source vttablet to a destination; the destination opens a binlog reader, filters events by the workflow's row-filter expression, and applies them. The filter is what lets Reshard split by hash range without replicating the whole shard. Streams checkpoint into the destination's _vt.vreplication table; crashes resume from the checkpoint with idempotent replay. Built on standard MySQL row-based replication — debug it with the same tools you would use for normal MySQL replication.

MongoDB's moveChunk and the strangler pattern

MongoDB's chunk migration is the same four phases under different names — the balancer marks the chunk in-migration (writes tracked), the destination pulls documents in batches (backfill), catches up on tracked writes (dual-write delta), the config server flips ownership (cutover), and the source deletes its copy (cleanup). Outside databases, the same shape is the strangler fig pattern — new system alongside old, dual-write at the application layer, traffic gradually migrates, old system decommissioned. gh-ost and pt-online-schema-change apply the protocol to schema changes via shadow tables; Vitess's OnlineDDL is gh-ost integrated with VReplication. The four phases are everywhere once you know to look.

Where this leads next

Resharding is stage four of the sharded-database life cycle. Chapter 95 — Global Secondary Indexes covers querying a sharded table by a non-shard-key column without fanning out everywhere. Chapter 96 — Cross-shard queries and 2PC covers joins, aggregations, and distributed transactions across shards. Once you can move data without downtime, the shard count becomes a knob you can turn, not a foundation cast in concrete.

References

CNCF Vitess, VReplication and MoveTables — the canonical reference for Vitess's online-resharding workflow, covering binlog streams, MoveTables, Reshard, OnlineDDL, and the VDiff checksum tool.
MongoDB Inc., Chunk Migration Procedure — the balancer's continuous chunk-migration protocol; the state machine matches the dual-write / backfill / cutover / cleanup phases of this chapter.
Vistron / Citus Data, Shard Rebalancer and citus_move_shard_placement — Citus's online shard movement primitive, built on Postgres logical replication for the dual-write phase.
Cowling, Atlas: Dropbox's File Storage Migration, Dropbox Engineering 2020 — a multi-petabyte storage migration using the four-phase pattern at a higher abstraction level.
Kleppmann, Designing Data-Intensive Applications, Chapter 6 — Partitioning, O'Reilly 2017 — the textbook reference; covers rebalancing partitions and why fixed-shard-count strategies make resharding tractable.
Squarespace Engineering, Vitess at Squarespace — Sharding Stories — production notes on VDiff, soak windows, and running resharding alongside a release schedule.