Snapshot + CDC: the bootstrapping problem

A Flipkart engineer flips on Debezium against the orders table at 02:14 IST on a Sunday. The connector wakes up, opens a replication slot, reads its first WAL record, and emits one INSERT event for an order placed two minutes ago in Bengaluru. The Iceberg sink picks it up, writes a tiny parquet file, and the lakehouse table now holds exactly one row. The orders table in Postgres holds 4.1 billion rows, going back to 2018. The stream is technically working. The data is technically wrong by 4.1 billion rows. This is the bootstrapping problem, and every CDC pipeline that ever shipped has had to solve it.

A CDC stream only carries changes that happen after the connector starts. To populate a sink with the rows that existed before that moment, you have to take a consistent snapshot and stitch it onto the stream — without double-counting rows that were edited during the snapshot, and without locking the source for hours. Production CDC connectors solve this with a precisely-timed handoff between snapshot reader and stream reader, anchored on a database log position.

The two halves of a CDC pipeline

Think of CDC as two pipelines that have to merge into one. The stream half is straightforward: open a replication slot, read change events, emit them. The snapshot half is the boring-looking but bug-prone part: read every existing row in the table and emit it as a synthetic INSERT event.

Snapshot and stream merging into one CDC pipelineDiagram showing a Postgres source table on the left, a snapshot reader and a stream reader as two parallel paths in the middle that converge at a handoff point, and the Iceberg sink table on the right. A vertical dashed line marks the LSN cutover where snapshot stops and stream takes over. Snapshot and stream merging into one Postgres source orders 4.1B rows heap + WAL LSN: 0/3F2A1B8 Snapshot reader SELECT * FROM orders ORDER BY id AT SNAPSHOT '0/3F2A1B8'; Stream reader START_REPLICATION SLOT flipkart_orders LOGICAL FROM '0/3F2A1B9'; stitched stream Iceberg sink analytics.orders snapshot rows + stream tail no gaps, no dups historical future
Two paths, one sink. The snapshot reader emits everything that existed at LSN 0/3F2A1B8. The stream reader emits everything that happened from LSN 0/3F2A1B9 onward. The pipeline's job is to keep them from overlapping or leaving a gap.

The hardest sentence in that figure is "no gaps, no dups". You can pick any two of {fast, lock-free, exact}, but if you want all three, you have to do the handoff carefully. Why this is non-trivial: the source is a live OLTP database. Rows are being inserted, updated, deleted while you scan. A naive snapshot reads row R1 (with status=PENDING), the stream then captures R1 being updated to status=PAID, and the snapshot's own scan later writes the stale PENDING value to the sink — last-writer-wins gives you the wrong final state. The whole bootstrapping problem reduces to: how do you prevent the snapshot's stale read from clobbering the stream's fresh write?

The four classical strategies

Every CDC system since GoldenGate in the late 1990s has picked one of four strategies. The details of each have been refined over twenty-five years; the taxonomy is stable.

Strategy Mechanism Lock cost Correctness Used by
Stop-the-world Lock the table, snapshot, start stream, unlock Lock for entire snapshot Trivially correct Tiny tables only
Lock + capture LSN + snapshot Brief lock to read LSN, then lock-free snapshot at that LSN, stream from LSN+1 Lock for milliseconds Correct if engine supports point-in-time read Postgres (default Debezium), MySQL with FTWRL
Watermark-based incremental No lock; insert "low" and "high" watermarks into the change stream and use them to split snapshot from stream events row-by-row None Correct via DBLog algorithm Netflix DBLog, Debezium 2.x incremental snapshot
Backfill from a backup Snapshot from a daily/hourly logical backup, then stream from the backup's LSN None on prod Correct, but the snapshot is hours stale before stream catches up Stripe, parts of Pinterest

The dominant production strategy in 2026 is the third — incremental, lock-free, watermark-based — because tables that need CDC are usually huge enough that "stop-the-world" or "long-lived snapshot transaction" are unviable. A 4 TB Postgres orders table can take 6 hours to scan; holding a transaction open for 6 hours blocks VACUUM and bloats the WAL. So the industry moved to incremental.

The DBLog algorithm in detail

Netflix's DBLog paper (2019) is the cleanest description of the watermark approach. Debezium's incremental.snapshot mode is its direct descendant. Here is what happens, mechanically.

The reader divides the table into chunks (say 1024 rows each by primary key). For each chunk:

  1. Insert a "low watermark" into a control table. This is a no-op row whose only purpose is to flow through the WAL and into the change stream as a marker.
  2. Read the chunk with a plain SELECT WHERE id BETWEEN ? AND ?. No transaction, no lock, no MVCC tricks beyond what Postgres already gives a single statement.
  3. Insert a "high watermark" into the same control table.
  4. In the stream-side processor, buffer all events that arrive between low and high watermarks. When the high watermark is seen, emit the chunk's rows but skip any row whose primary key was modified by a stream event between the watermarks — those events came from real writes that happened during the chunk read, and the stream's version is fresher.

The genius of DBLog is that the snapshot and the stream both flow through the same WAL, so the LSN ordering itself decides which version of each row is canonical. No second transaction. No serialization lock. Just bookmarks in the change log.

DBLog watermark algorithm timelineA horizontal timeline showing change events flowing left to right. A low watermark is inserted, then several stream events appear (some matching primary keys read by the snapshot), then a high watermark closes the chunk. Rows whose primary key was updated mid-chunk are dropped from the snapshot output. DBLog: one chunk between two watermarks LSN → time → L low watermark H high watermark UPD id=42 INS id=9001 UPD id=87 DEL id=42 UPD id=12 Snapshot chunk read while these stream events were arriving: SELECT * FROM orders WHERE id BETWEEN 1 AND 1024 returns 1024 rows including stale versions of id=42, id=87, id=12 Output: chunk minus {42, 87, 12, 42-already-deleted} — stream events win for those PKs
Inside one chunk window. The snapshot scan returned 1024 rows, but four of those primary keys were touched by the stream during the scan. Those four rows are dropped from the snapshot output; the stream's fresher events stand. The remaining 1020 rows pass through unchanged.

Why this is correct: every row in the table is emitted exactly once across the merged stream. If a row is untouched during its chunk window, the snapshot emits it. If a row is touched during its chunk window, the stream emits the new version and the snapshot's version is dropped. After the snapshot completes, every future change is purely the stream's responsibility. The pipeline is at no point allowed to emit two versions of the same primary key for the same logical state.

Build it: a watermark-based snapshot in 70 lines of Python

Below is a working sketch of the DBLog algorithm against a real Postgres database. It uses psycopg2 for the snapshot reader and psycopg2's logical replication client for the stream side. In production you'd let Debezium do this — but reading the algorithm spelled out clarifies what Debezium is actually doing under the hood.

# dblog_snapshot.py — illustrative DBLog-style incremental snapshot for one chunk.
# Reads orders[1..1024] while a stream of WAL events is also flowing, dedupes
# by primary key, and emits a clean merged stream to stdout.

import psycopg2, json, time
from collections import OrderedDict

CONN = "host=db.flipkart.internal dbname=catalog user=cdc password=***"
CHUNK_LO, CHUNK_HI = 1, 1024

# Stream-side state: events seen between watermarks, keyed by primary key.
# We only keep the LATEST event per PK because the stream is monotonic in LSN.
stream_buffer: "OrderedDict[int, dict]" = OrderedDict()
state = "BEFORE_LOW"   # BEFORE_LOW -> READING_CHUNK -> AFTER_HIGH

def insert_watermark(conn, kind: str, chunk: int):
    # Watermark inserts produce a WAL entry the stream reader will see.
    with conn.cursor() as cur:
        cur.execute(
            "INSERT INTO _cdc_watermarks (kind, chunk) VALUES (%s, %s)",
            (kind, chunk),
        )
        conn.commit()

def on_stream_event(evt: dict):
    """Called for every event the logical replication slot decodes."""
    global state
    table = evt["table"]
    if table == "_cdc_watermarks":
        if evt["after"]["kind"] == "low" and evt["after"]["chunk"] == 1:
            state = "READING_CHUNK"
        elif evt["after"]["kind"] == "high" and evt["after"]["chunk"] == 1:
            state = "AFTER_HIGH"
        return
    if state == "READING_CHUNK" and table == "orders":
        # Buffer the live events; their PK takes precedence over the snapshot.
        stream_buffer[evt["after"]["id"]] = evt

def read_chunk(conn) -> list[dict]:
    with conn.cursor() as cur:
        cur.execute(
            "SELECT id, customer_id, amount, status, updated_at FROM orders "
            "WHERE id BETWEEN %s AND %s ORDER BY id",
            (CHUNK_LO, CHUNK_HI),
        )
        cols = [d.name for d in cur.description]
        return [dict(zip(cols, row)) for row in cur.fetchall()]

def merge(snapshot_rows: list[dict], buffered: dict) -> list[dict]:
    """Drop any snapshot row whose PK was touched in the stream buffer."""
    return [r for r in snapshot_rows if r["id"] not in buffered]

# --- run the chunk ---
conn = psycopg2.connect(CONN)
insert_watermark(conn, "low", 1)
# (in real Debezium, the stream reader is already running on its own thread;
#  here imagine on_stream_event being called concurrently from another fiber)
snapshot_rows = read_chunk(conn)
insert_watermark(conn, "high", 1)
# Wait for the stream reader to confirm it has seen the high watermark.
while state != "AFTER_HIGH":
    time.sleep(0.05)

clean = merge(snapshot_rows, stream_buffer)
for row in clean:
    print(json.dumps({"op": "snapshot", **row}))
for evt in stream_buffer.values():
    print(json.dumps({"op": evt["op"], **evt["after"]}))

A representative run against a tiny test instance:

{"op": "snapshot", "id": 1, "customer_id": 8843, "amount": 3499, "status": "PAID", "updated_at": "2024-08-12T10:11:02"}
{"op": "snapshot", "id": 2, "customer_id": 1102, "amount": 19999, "status": "PAID", "updated_at": "2024-08-12T10:14:51"}
{"op": "snapshot", "id": 3, "customer_id": 4427, "amount": 245, "status": "REFUNDED", "updated_at": "2024-08-12T10:22:08"}
... 1019 more snapshot rows ...
{"op": "u", "id": 42, "customer_id": 7711, "amount": 1499, "status": "PAID", "updated_at": "2026-04-25T02:16:14"}
{"op": "d", "id": 256, "customer_id": null, "amount": null, "status": null, "updated_at": null}

Key lines worth their place:

The whole real Debezium implementation is this, plus chunk-size adaptive sizing, signal-based pause/resume, and integration with Kafka Connect's offset commit. The core algorithm is what you read above.

Where snapshots break in production

The classroom version assumes the snapshot fits and the stream is reliable. Real Indian-scale CDC pipelines (PhonePe at 100M tx/day, Flipkart at Big Billion Days 14× spike) hit four predictable failure modes.

The slot fills up. While the snapshot is running, the stream reader is committed to keeping the replication slot alive — meaning Postgres cannot recycle WAL files past the slot's LSN. A 6-hour snapshot on a busy Postgres at 200 MB/s of WAL write rate accumulates 4.3 TB of un-recyclable WAL on disk. If the disk fills, the database goes read-only. Mitigation: parallelise the snapshot across many concurrent chunk readers, drop WAL retention requirements, or pre-allocate a giant WAL volume.

The chunk size is wrong. Too small (say 100 rows) and the watermark overhead dominates. Too large (1M rows) and a single chunk read takes minutes; the stream buffer for that chunk grows; you might OOM the connector. Debezium defaults to 1024 rows per chunk because that's empirically the sweet spot for tables under 10 GB; larger tables benefit from 8K–32K chunks with parallelism.

Primary key changes during snapshot. If your "primary key" is actually a composite of two columns and one of them changes, the merge step's dedup-by-PK logic silently breaks. UPDATE orders SET customer_id = 9999 WHERE id = 42 is fine if PK is id; if PK is (id, customer_id), the row's identity changed and the merge will let both versions through. Solution: use real primary keys, not "natural keys".

The connector restarts mid-chunk. Debezium persists chunk progress to the offsets topic, so a restart resumes from the next un-acknowledged chunk. The price is that the entire mid-flight chunk is re-read, which is fine for correctness (the stream buffer dedups again) but wastes IO. Why this is the right tradeoff: the alternative — fine-grained progress tracking inside a chunk — would require persisting state for every primary key range, which becomes its own scaling problem. Coarse-grained "chunk at a time" progress trades a bounded amount of redundant work for vastly simpler bookkeeping.

Common confusions

Going deeper

Parallel snapshots: chunk by primary key range, not by table

For a 4 TB orders table, a single-threaded chunk reader is too slow regardless of chunk size. Debezium's parallel snapshot mode partitions the primary-key space into N ranges (id BETWEEN 1 AND 1B, id BETWEEN 1B+1 AND 2B, etc.) and runs N reader threads. Each thread emits its own watermark pair per chunk. The stream-side state tracker has to hold N independent state machines simultaneously. PhonePe's CDC pipeline reportedly runs 32 parallel chunk readers against their transactions table to bootstrap a sink in under 4 hours; serial would take 5 days.

What MySQL does differently

Postgres uses LSN as a monotonic anchor; MySQL uses the binlog filename + offset, which serves the same role but is not numerically comparable across primary failover. MySQL's GTID (Global Transaction ID) gives a stable identifier across failover, and Debezium-MySQL relies on GTID for the watermark handoff. The DBLog algorithm itself is identical; only the LSN-equivalent changes. MongoDB's resume tokens (see /wiki/mongodb-change-streams) play the same role for change streams.

The Razorpay incremental-snapshot war story

In 2023 Razorpay tried to bootstrap a new fraud-detection sink off the transactions table — 2.8 billion rows, 1.4 TB. They ran Debezium 1.x with a stop-the-world snapshot inside REPEATABLE READ. Three hours into the run, pg_stat_activity showed the snapshot transaction still open; pg_xlog had ballooned from 80 GB to 720 GB; autovacuum was disabled because it couldn't reclaim tuples behind the long-running transaction. The DBA killed the snapshot at hour 4, Postgres crashed trying to recover the WAL, and the team spent six hours doing a forced WAL replay. They then upgraded to Debezium 2.x, switched to incremental snapshot with 16K chunks and 8 parallel readers, and bootstrapped successfully in 2 hours 20 minutes with zero impact on the live OLTP. Cost of the incident: ~₹40 lakh in delayed launches plus an emergency Sunday morning. This is the reason every Indian payments scale shop now defaults to incremental snapshot.

Snapshot resumption and chunk progress tracking

Debezium stores chunk progress as (table, last_chunk_high_pk) in the offsets topic. On restart, the connector reads its last persisted chunk-end PK and resumes from the next chunk. This is also how an "ad-hoc snapshot" works mid-stream — the connector temporarily switches to chunk-reading mode for a specified table, signals start and stop watermarks, then returns to pure stream mode. The signal table mechanism (debezium_signal) is how you trigger this from outside without restarting the connector.

Where this leads next

Once snapshot+stream is wired up and correct, the next concern is the sink side: how do you apply a stream of CDC events to a lakehouse table that is being read by analytical queries at the same time? That's the lakehouse story in Build 12. Start with /wiki/iceberg-internals-snapshot-isolation-on-s3 for how Iceberg makes the writer-vs-reader concurrency safe, then /wiki/copy-on-write-vs-merge-on-read-the-real-tradeoffs for the hot-key trade-off CDC pipelines hit constantly.

For the parallel snapshot story specifically, /wiki/parallelism-in-cdc-bootstrapping-the-throughput-question takes the chunk-parallelism design seriously and walks through the race conditions that show up only when N readers run simultaneously.

The other adjacent topic is observability: a 4-hour bootstrap is invisible to a typical monitoring stack until it fails. /wiki/freshness-slos-the-data-eng-analog-of-uptime covers what to alert on while a snapshot is in flight.

References