Snapshot + CDC: the bootstrapping problem
A Flipkart engineer flips on Debezium against the orders table at 02:14 IST on a Sunday. The connector wakes up, opens a replication slot, reads its first WAL record, and emits one INSERT event for an order placed two minutes ago in Bengaluru. The Iceberg sink picks it up, writes a tiny parquet file, and the lakehouse table now holds exactly one row. The orders table in Postgres holds 4.1 billion rows, going back to 2018. The stream is technically working. The data is technically wrong by 4.1 billion rows. This is the bootstrapping problem, and every CDC pipeline that ever shipped has had to solve it.
A CDC stream only carries changes that happen after the connector starts. To populate a sink with the rows that existed before that moment, you have to take a consistent snapshot and stitch it onto the stream — without double-counting rows that were edited during the snapshot, and without locking the source for hours. Production CDC connectors solve this with a precisely-timed handoff between snapshot reader and stream reader, anchored on a database log position.
The two halves of a CDC pipeline
Think of CDC as two pipelines that have to merge into one. The stream half is straightforward: open a replication slot, read change events, emit them. The snapshot half is the boring-looking but bug-prone part: read every existing row in the table and emit it as a synthetic INSERT event.
0/3F2A1B8. The stream reader emits everything that happened from LSN 0/3F2A1B9 onward. The pipeline's job is to keep them from overlapping or leaving a gap.The hardest sentence in that figure is "no gaps, no dups". You can pick any two of {fast, lock-free, exact}, but if you want all three, you have to do the handoff carefully. Why this is non-trivial: the source is a live OLTP database. Rows are being inserted, updated, deleted while you scan. A naive snapshot reads row R1 (with status=PENDING), the stream then captures R1 being updated to status=PAID, and the snapshot's own scan later writes the stale PENDING value to the sink — last-writer-wins gives you the wrong final state. The whole bootstrapping problem reduces to: how do you prevent the snapshot's stale read from clobbering the stream's fresh write?
The four classical strategies
Every CDC system since GoldenGate in the late 1990s has picked one of four strategies. The details of each have been refined over twenty-five years; the taxonomy is stable.
| Strategy | Mechanism | Lock cost | Correctness | Used by |
|---|---|---|---|---|
| Stop-the-world | Lock the table, snapshot, start stream, unlock | Lock for entire snapshot | Trivially correct | Tiny tables only |
| Lock + capture LSN + snapshot | Brief lock to read LSN, then lock-free snapshot at that LSN, stream from LSN+1 | Lock for milliseconds | Correct if engine supports point-in-time read | Postgres (default Debezium), MySQL with FTWRL |
| Watermark-based incremental | No lock; insert "low" and "high" watermarks into the change stream and use them to split snapshot from stream events row-by-row | None | Correct via DBLog algorithm | Netflix DBLog, Debezium 2.x incremental snapshot |
| Backfill from a backup | Snapshot from a daily/hourly logical backup, then stream from the backup's LSN | None on prod | Correct, but the snapshot is hours stale before stream catches up | Stripe, parts of Pinterest |
The dominant production strategy in 2026 is the third — incremental, lock-free, watermark-based — because tables that need CDC are usually huge enough that "stop-the-world" or "long-lived snapshot transaction" are unviable. A 4 TB Postgres orders table can take 6 hours to scan; holding a transaction open for 6 hours blocks VACUUM and bloats the WAL. So the industry moved to incremental.
The DBLog algorithm in detail
Netflix's DBLog paper (2019) is the cleanest description of the watermark approach. Debezium's incremental.snapshot mode is its direct descendant. Here is what happens, mechanically.
The reader divides the table into chunks (say 1024 rows each by primary key). For each chunk:
- Insert a "low watermark" into a control table. This is a no-op row whose only purpose is to flow through the WAL and into the change stream as a marker.
- Read the chunk with a plain
SELECT WHERE id BETWEEN ? AND ?. No transaction, no lock, no MVCC tricks beyond what Postgres already gives a single statement. - Insert a "high watermark" into the same control table.
- In the stream-side processor, buffer all events that arrive between low and high watermarks. When the high watermark is seen, emit the chunk's rows but skip any row whose primary key was modified by a stream event between the watermarks — those events came from real writes that happened during the chunk read, and the stream's version is fresher.
The genius of DBLog is that the snapshot and the stream both flow through the same WAL, so the LSN ordering itself decides which version of each row is canonical. No second transaction. No serialization lock. Just bookmarks in the change log.
Why this is correct: every row in the table is emitted exactly once across the merged stream. If a row is untouched during its chunk window, the snapshot emits it. If a row is touched during its chunk window, the stream emits the new version and the snapshot's version is dropped. After the snapshot completes, every future change is purely the stream's responsibility. The pipeline is at no point allowed to emit two versions of the same primary key for the same logical state.
Build it: a watermark-based snapshot in 70 lines of Python
Below is a working sketch of the DBLog algorithm against a real Postgres database. It uses psycopg2 for the snapshot reader and psycopg2's logical replication client for the stream side. In production you'd let Debezium do this — but reading the algorithm spelled out clarifies what Debezium is actually doing under the hood.
# dblog_snapshot.py — illustrative DBLog-style incremental snapshot for one chunk.
# Reads orders[1..1024] while a stream of WAL events is also flowing, dedupes
# by primary key, and emits a clean merged stream to stdout.
import psycopg2, json, time
from collections import OrderedDict
CONN = "host=db.flipkart.internal dbname=catalog user=cdc password=***"
CHUNK_LO, CHUNK_HI = 1, 1024
# Stream-side state: events seen between watermarks, keyed by primary key.
# We only keep the LATEST event per PK because the stream is monotonic in LSN.
stream_buffer: "OrderedDict[int, dict]" = OrderedDict()
state = "BEFORE_LOW" # BEFORE_LOW -> READING_CHUNK -> AFTER_HIGH
def insert_watermark(conn, kind: str, chunk: int):
# Watermark inserts produce a WAL entry the stream reader will see.
with conn.cursor() as cur:
cur.execute(
"INSERT INTO _cdc_watermarks (kind, chunk) VALUES (%s, %s)",
(kind, chunk),
)
conn.commit()
def on_stream_event(evt: dict):
"""Called for every event the logical replication slot decodes."""
global state
table = evt["table"]
if table == "_cdc_watermarks":
if evt["after"]["kind"] == "low" and evt["after"]["chunk"] == 1:
state = "READING_CHUNK"
elif evt["after"]["kind"] == "high" and evt["after"]["chunk"] == 1:
state = "AFTER_HIGH"
return
if state == "READING_CHUNK" and table == "orders":
# Buffer the live events; their PK takes precedence over the snapshot.
stream_buffer[evt["after"]["id"]] = evt
def read_chunk(conn) -> list[dict]:
with conn.cursor() as cur:
cur.execute(
"SELECT id, customer_id, amount, status, updated_at FROM orders "
"WHERE id BETWEEN %s AND %s ORDER BY id",
(CHUNK_LO, CHUNK_HI),
)
cols = [d.name for d in cur.description]
return [dict(zip(cols, row)) for row in cur.fetchall()]
def merge(snapshot_rows: list[dict], buffered: dict) -> list[dict]:
"""Drop any snapshot row whose PK was touched in the stream buffer."""
return [r for r in snapshot_rows if r["id"] not in buffered]
# --- run the chunk ---
conn = psycopg2.connect(CONN)
insert_watermark(conn, "low", 1)
# (in real Debezium, the stream reader is already running on its own thread;
# here imagine on_stream_event being called concurrently from another fiber)
snapshot_rows = read_chunk(conn)
insert_watermark(conn, "high", 1)
# Wait for the stream reader to confirm it has seen the high watermark.
while state != "AFTER_HIGH":
time.sleep(0.05)
clean = merge(snapshot_rows, stream_buffer)
for row in clean:
print(json.dumps({"op": "snapshot", **row}))
for evt in stream_buffer.values():
print(json.dumps({"op": evt["op"], **evt["after"]}))
A representative run against a tiny test instance:
{"op": "snapshot", "id": 1, "customer_id": 8843, "amount": 3499, "status": "PAID", "updated_at": "2024-08-12T10:11:02"}
{"op": "snapshot", "id": 2, "customer_id": 1102, "amount": 19999, "status": "PAID", "updated_at": "2024-08-12T10:14:51"}
{"op": "snapshot", "id": 3, "customer_id": 4427, "amount": 245, "status": "REFUNDED", "updated_at": "2024-08-12T10:22:08"}
... 1019 more snapshot rows ...
{"op": "u", "id": 42, "customer_id": 7711, "amount": 1499, "status": "PAID", "updated_at": "2026-04-25T02:16:14"}
{"op": "d", "id": 256, "customer_id": null, "amount": null, "status": null, "updated_at": null}
Key lines worth their place:
insert_watermark(conn, "low", 1)— the cheapest possible WAL-side bookmark. It's just aINSERTinto a side table, but because logical decoding sees every committed change, the stream reader will pick it up and switch state. Why a side table and not apg_logical_emit_messagecall: Debezium uses the latter (a built-in Postgres function that writes a synthetic WAL record without touching any real table) because it avoids the table mutation. Either works; the side table version is easier to debug because the watermarks are visible inSELECT * FROM _cdc_watermarks.stream_buffer[evt["after"]["id"]] = evt— the OrderedDict is keyed by primary key, which means later events for the same PK overwrite earlier ones. After the chunk window closes, this dict contains only the last event per PK. That last-writer-wins behaviour is correct because we're going to emit the stream's version, period.while state != "AFTER_HIGH": time.sleep(0.05)— the synchronization barrier. The snapshot can't proceed to the next chunk until the stream reader has confirmed it saw the high watermark. In Debezium this is implemented as an internal Java-side latch, not a sleep loop, but the semantic is identical.merge(snapshot_rows, stream_buffer)— the actual deduplication. Rows touched by the stream during this chunk are dropped from the snapshot output. The stream's version of those rows will be emitted right after.for evt in stream_buffer.values(): print(...)— the stream events are emitted in LSN order after the snapshot rows. This is what makes the merged output replayable: a downstream consumer sees a clean snapshot of 1024 rows, then a small set of "fresh updates" that supersede some of them.
The whole real Debezium implementation is this, plus chunk-size adaptive sizing, signal-based pause/resume, and integration with Kafka Connect's offset commit. The core algorithm is what you read above.
Where snapshots break in production
The classroom version assumes the snapshot fits and the stream is reliable. Real Indian-scale CDC pipelines (PhonePe at 100M tx/day, Flipkart at Big Billion Days 14× spike) hit four predictable failure modes.
The slot fills up. While the snapshot is running, the stream reader is committed to keeping the replication slot alive — meaning Postgres cannot recycle WAL files past the slot's LSN. A 6-hour snapshot on a busy Postgres at 200 MB/s of WAL write rate accumulates 4.3 TB of un-recyclable WAL on disk. If the disk fills, the database goes read-only. Mitigation: parallelise the snapshot across many concurrent chunk readers, drop WAL retention requirements, or pre-allocate a giant WAL volume.
The chunk size is wrong. Too small (say 100 rows) and the watermark overhead dominates. Too large (1M rows) and a single chunk read takes minutes; the stream buffer for that chunk grows; you might OOM the connector. Debezium defaults to 1024 rows per chunk because that's empirically the sweet spot for tables under 10 GB; larger tables benefit from 8K–32K chunks with parallelism.
Primary key changes during snapshot. If your "primary key" is actually a composite of two columns and one of them changes, the merge step's dedup-by-PK logic silently breaks. UPDATE orders SET customer_id = 9999 WHERE id = 42 is fine if PK is id; if PK is (id, customer_id), the row's identity changed and the merge will let both versions through. Solution: use real primary keys, not "natural keys".
The connector restarts mid-chunk. Debezium persists chunk progress to the offsets topic, so a restart resumes from the next un-acknowledged chunk. The price is that the entire mid-flight chunk is re-read, which is fine for correctness (the stream buffer dedups again) but wastes IO. Why this is the right tradeoff: the alternative — fine-grained progress tracking inside a chunk — would require persisting state for every primary key range, which becomes its own scaling problem. Coarse-grained "chunk at a time" progress trades a bounded amount of redundant work for vastly simpler bookkeeping.
Common confusions
- "Just dump the table to CSV and load it, then start the stream." This is the stop-the-world strategy. It works only if you can lock the table for the duration of the dump (usually no) or if you accept some window of double-counting (usually no). Without the LSN handoff, you have no guarantee the stream picks up exactly where the dump left off.
- "The snapshot and stream are the same thing." They share the change-event format and they share the sink, but the snapshot reader runs as a series of regular
SELECTstatements while the stream reader is decoding WAL. Two different code paths, two different failure modes, two different rate-limit characteristics. - "DBLog watermarks lock the table." They don't. The watermark is a single-row
INSERTinto a side table; the chunk read is a plainSELECTwith noLOCKorFOR UPDATE. The whole point of the algorithm is that it's lock-free. - "You only do a snapshot once, at the start." You also do a snapshot whenever you bootstrap a new sink, when you re-create the replication slot (e.g. after a slot is dropped), or when you signal an "ad-hoc snapshot" for a single table after a schema change. Snapshots are a recurring operation, not a one-time event.
- "The snapshot is consistent because Postgres MVCC." Only if you wrap the whole snapshot in a single
REPEATABLE READtransaction — and that means holding a transaction open for the entire scan, which is the long-running-transaction problem above. Debezium's incremental snapshot deliberately does NOT do this; it relies on the watermark dedup instead, which is what makes it scale. - "Once snapshot is done, you never think about it again." A new column added to the source table after snapshot completion has no historical values in the sink. If you need that history, you re-snapshot. See /wiki/schema-evolution-across-the-cdc-boundary for the schema side of the same problem.
Going deeper
Parallel snapshots: chunk by primary key range, not by table
For a 4 TB orders table, a single-threaded chunk reader is too slow regardless of chunk size. Debezium's parallel snapshot mode partitions the primary-key space into N ranges (id BETWEEN 1 AND 1B, id BETWEEN 1B+1 AND 2B, etc.) and runs N reader threads. Each thread emits its own watermark pair per chunk. The stream-side state tracker has to hold N independent state machines simultaneously. PhonePe's CDC pipeline reportedly runs 32 parallel chunk readers against their transactions table to bootstrap a sink in under 4 hours; serial would take 5 days.
What MySQL does differently
Postgres uses LSN as a monotonic anchor; MySQL uses the binlog filename + offset, which serves the same role but is not numerically comparable across primary failover. MySQL's GTID (Global Transaction ID) gives a stable identifier across failover, and Debezium-MySQL relies on GTID for the watermark handoff. The DBLog algorithm itself is identical; only the LSN-equivalent changes. MongoDB's resume tokens (see /wiki/mongodb-change-streams) play the same role for change streams.
The Razorpay incremental-snapshot war story
In 2023 Razorpay tried to bootstrap a new fraud-detection sink off the transactions table — 2.8 billion rows, 1.4 TB. They ran Debezium 1.x with a stop-the-world snapshot inside REPEATABLE READ. Three hours into the run, pg_stat_activity showed the snapshot transaction still open; pg_xlog had ballooned from 80 GB to 720 GB; autovacuum was disabled because it couldn't reclaim tuples behind the long-running transaction. The DBA killed the snapshot at hour 4, Postgres crashed trying to recover the WAL, and the team spent six hours doing a forced WAL replay. They then upgraded to Debezium 2.x, switched to incremental snapshot with 16K chunks and 8 parallel readers, and bootstrapped successfully in 2 hours 20 minutes with zero impact on the live OLTP. Cost of the incident: ~₹40 lakh in delayed launches plus an emergency Sunday morning. This is the reason every Indian payments scale shop now defaults to incremental snapshot.
Snapshot resumption and chunk progress tracking
Debezium stores chunk progress as (table, last_chunk_high_pk) in the offsets topic. On restart, the connector reads its last persisted chunk-end PK and resumes from the next chunk. This is also how an "ad-hoc snapshot" works mid-stream — the connector temporarily switches to chunk-reading mode for a specified table, signals start and stop watermarks, then returns to pure stream mode. The signal table mechanism (debezium_signal) is how you trigger this from outside without restarting the connector.
Where this leads next
Once snapshot+stream is wired up and correct, the next concern is the sink side: how do you apply a stream of CDC events to a lakehouse table that is being read by analytical queries at the same time? That's the lakehouse story in Build 12. Start with /wiki/iceberg-internals-snapshot-isolation-on-s3 for how Iceberg makes the writer-vs-reader concurrency safe, then /wiki/copy-on-write-vs-merge-on-read-the-real-tradeoffs for the hot-key trade-off CDC pipelines hit constantly.
For the parallel snapshot story specifically, /wiki/parallelism-in-cdc-bootstrapping-the-throughput-question takes the chunk-parallelism design seriously and walks through the race conditions that show up only when N readers run simultaneously.
The other adjacent topic is observability: a 4-hour bootstrap is invisible to a typical monitoring stack until it fails. /wiki/freshness-slos-the-data-eng-analog-of-uptime covers what to alert on while a snapshot is in flight.
References
- Andreas Andreakis & Ioannis Papapanagiotou — "DBLog: a generic change-data-capture framework" — the Netflix paper that gave us the watermark-based incremental snapshot algorithm. Read sections 3 and 4 closely.
- Debezium documentation: Incremental Snapshots — the production-grade implementation of DBLog. The "signaling" subsection explains the runtime trigger mechanism.
- Postgres docs: Logical Replication — the underlying primitive everything in this chapter sits on.
- Gunnar Morling — "An Overview of Debezium's Incremental Snapshot" — the Debezium maintainer's walkthrough of how DBLog ended up in the Debezium codebase.
- Tim Berglund — "Snapshots and the Database Bootstrap Problem" — a Confluent talk that motivates the snapshot problem from first principles, with diagrams.
- /wiki/postgres-logical-decoding-from-scratch — the WAL decoding mechanism the stream side depends on.
- /wiki/debeziums-architecture — the connector architecture in which incremental snapshot lives.
- /wiki/schema-evolution-across-the-cdc-boundary — the chapter on what happens when the schema changes during or after a snapshot.