Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Debezium's architecture

The previous three chapters showed three completely different CDC sources: Postgres slots and pgoutput, MySQL binlog with its row events, MongoDB change streams over the oplog. A team running all three would otherwise need three custom consumers, three resumption schemes, three flavours of schema inference, three failure modes. Debezium's reason to exist is to collapse those three into one Kafka topic per table, with one offset model, one snapshot protocol, and one schema registry contract. This chapter is the architecture tour: what spawns when you start a Debezium connector, where the offsets live, how snapshot-then-stream is glued together, and which seams crack first under Indian-scale load.

Debezium is a set of source connectors that run inside a Kafka Connect cluster, one connector per upstream database, each emitting one Kafka topic per table. Underneath the abstraction it does three jobs: it owns the upstream replication primitive (slot, binlog position, resume token), it runs a snapshot-then-stream protocol so consumers never see a partial history, and it serialises every change as an envelope with before, after, source, and op fields whose schema is tracked by a registry. The hard production problems are not Debezium itself — they are connector restart, snapshot duration on large tables, and schema evolution.

What "Debezium" actually is

Debezium is not a daemon you install. It is a JAR — a set of Kafka Connect source connectors — that you load into a Kafka Connect worker. Kafka Connect is the host. Debezium plugs in. The split matters because every operational concern that feels like a Debezium problem (worker memory, connector restart, dead-letter queues) is actually a Kafka Connect problem, and the documentation lives in two different places.

A running Debezium deployment has these moving parts:

One Kafka cluster — destination for change events, plus internal storage for connector state.
One Kafka Connect worker cluster (1–N nodes, JVM processes). The workers form a coordination group via Kafka's group protocol; they share a config topic, an offset topic, and a status topic.
One Debezium connector instance per source database. Each connector is a configuration object posted to the Connect REST API. The worker materialises it as a long-running task.
One source database. Postgres with a logical replication slot, MySQL with a binlog reader user, MongoDB with a change-stream reader.
One Kafka topic per source table (default — configurable). The naming pattern is <server-name>.<schema>.<table> for relational sources, e.g. razorpay.public.payments.

The runtime is three boxes: a source database speaking its native replication protocol, a Kafka Connect worker hosting the Debezium connector task, and a Kafka cluster receiving per-table change-event topics plus the internal connector-state topics.

The connector is the only Debezium-authored code in the picture. Everything else — the worker process, the REST API, the offset commit machinery, the dead-letter queue support, the rebalance protocol — is Kafka Connect, a separate Apache project. Why this distinction is more than nitpicking: when a Debezium task fails to start with IllegalStateException from WorkerSourceTask you are debugging Kafka Connect, not Debezium. The relevant logs and tunables (offset.flush.interval.ms, producer.override.compression.type, errors.tolerance) are documented under Kafka Connect. Treating "Debezium broke" as one undifferentiated problem space is the single most common time sink for engineers new to the stack.

The change-event envelope

Every Debezium change event is a structured record with the same envelope, regardless of source. This uniformity is the single most useful property of the whole framework — a Flink job that consumes Debezium events from a Postgres table and from a MySQL table can use the same parser. The envelope (Avro-encoded by default, JSON-encoded if you choose JsonConverter) looks like:

{
  "before": null,
  "after": {
    "id": 9281,
    "merchant_id": "razorpay-test",
    "amount": 49900,
    "status": "captured",
    "created_at": 1714032501412
  },
  "source": {
    "version": "2.5.4.Final",
    "connector": "postgresql",
    "name": "razorpay",
    "ts_ms": 1714032501421,
    "db": "razorpay",
    "schema": "public",
    "table": "payments",
    "txId": 8814522,
    "lsn": 24138842616,
    "snapshot": "false"
  },
  "op": "c",
  "ts_ms": 1714032501500
}

The five top-level fields:

op — operation type. c create, u update, d delete, r read (snapshot rows), t truncate. Two letters of vocabulary cover every OLTP write.
before — the row's full state before the change. null for inserts. For updates and deletes, the prior column values (subject to source-side configuration: REPLICA IDENTITY FULL in Postgres, full-image binlog in MySQL, changeStreamPreAndPostImages in MongoDB).
after — the row's state after the change. null for deletes. The post-image.
source — provenance. The exact LSN/binlog-position/oplog-timestamp of the upstream write. Consumers that want to reason about "did this happen before X?" do so via source.lsn (Postgres), source.file + source.pos (MySQL), or source.ts_ms + source.ord (Mongo). The snapshot field is "true" for snapshot reads, "last" for the final snapshot row, "false" for streaming reads.
ts_ms — wall-clock time at which the connector emitted the event. Distinct from source.ts_ms (the database's clock at write time). The gap between the two is a free CDC lag metric.

A typical sink does not want the envelope — it wants the row. Debezium ships a Single Message Transform called ExtractNewRecordState (often called "unwrap") that flattens after to the top level for inserts/updates and emits a tombstone for deletes. Why this transform is so commonly applied that many teams forget it exists: most downstream sinks (a Postgres mirror, a Snowflake stream, a Materialize view) want one row per event. The full envelope is only useful when the consumer is doing event-sourcing-shaped logic — auditing, rebuilds, time-travel queries — where the before and source.lsn actually carry signal.

Snapshot then stream — the bootstrapping protocol

A new connector arriving at a database that already has a billion rows of history cannot just start at the current LSN. The downstream consumer would only see writes from "now"; it would never have a copy of any row that was inserted last year. Debezium's solution — and the solution every CDC tool eventually converges on — is snapshot then stream:

Open a transaction (or the equivalent on each source) at a known position.
Read every row of every captured table, emitting them as op: r events.
Switch to streaming from the position recorded at step 1.
Continue forever.

The protocol's correctness condition is that the position recorded at step 1 must be a position where the snapshot's view of the world is consistent with the streaming events that follow. On Postgres, this is enforced by creating a logical replication slot and immediately exporting a snapshot from inside a REPEATABLE READ transaction at the slot's consistent_point. On MySQL, the connector takes a FLUSH TABLES WITH READ LOCK (or, post-1.7, a row-locking-free chunked snapshot using gh-ost-style cursors), records the binlog position, then releases the lock. On MongoDB, the connector resolves the cluster's current resume token, reads every document in the captured collections, then opens the change stream from the recorded token.

The two phases overlap on the timeline at the recorded position. Snapshot reads everything that existed at that LSN/binlog-position/oplog-timestamp; streaming picks up everything written since. Together they form a complete, ordered history.

There are three things every team eventually has to learn about this protocol:

Snapshots take time proportional to the data, not to the change rate. A 200 GB Postgres table on a 10k IOPS volume snapshots at roughly 50–100 MB/sec, so 30 minutes to an hour. While the snapshot runs, the WAL keeps growing — every write that happens during the snapshot is buffered in the slot. Why this is the production failure mode that pages on-call: if the snapshot takes longer than max_slot_wal_keep_size allows, Postgres drops the slot to protect free disk and the connector restarts the entire snapshot. PaisaBridge's first Debezium-on-payments rollout in 2022 hit exactly this — a 400 GB ledger table and a misconfigured WAL retention. They fixed it with parallel chunked snapshots (Debezium 2.0+) and max_slot_wal_keep_size = 200GB.

Incremental snapshots replaced blocking snapshots in 1.6. The old behaviour was "lock everything, snapshot, unlock". The 1.6 release (2021) introduced incremental snapshots inspired by Streamora's DBLog paper — the connector slices the table into chunks by primary key, snapshots each chunk via SELECT, and interleaves chunk reads with streaming WAL events. The streaming side de-duplicates: if a row appears in both the snapshot and the streaming, the streaming version wins. The snapshot can pause and resume across connector restarts, making 100-billion-row tables feasible.

Read-only snapshots in 2.5+. Debezium 2.5 added a signal-only-incremental-snapshot mode that doesn't require write privileges on the source — useful when DBAs refuse to give Debezium a writable user. The connector signals via a Kafka topic instead of a database signalling table.

Build it: configuring a Postgres connector

The smallest realistic Debezium-on-Postgres deployment for a payments service. The connector is configured as a JSON object posted to the Connect REST API:

# Assume a running Kafka Connect worker on port 8083, plus a Postgres
# replica with `wal_level = logical` and a user `debezium` with REPLICATION.
# Curl the Connect REST API to create a connector for the `payments` table
# of a PaisaBridge-style payments service.

curl -X POST http://connect:8083/connectors -H 'Content-Type: application/json' -d '{
  "name": "razorpay-payments-source",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "tasks.max": "1",
    "database.hostname": "pg-replica.razorpay.internal",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "${file:/etc/dbz/secrets.properties:pg-pass}",
    "database.dbname": "razorpay",
    "topic.prefix": "razorpay",
    "schema.include.list": "public",
    "table.include.list": "public.payments,public.refunds,public.merchants",
    "plugin.name": "pgoutput",
    "publication.name": "dbz_razorpay_pub",
    "slot.name": "dbz_razorpay_slot",
    "snapshot.mode": "initial",
    "incremental.snapshot.chunk.size": "10240",
    "heartbeat.interval.ms": "30000",
    "heartbeat.action.query": "INSERT INTO dbz_heartbeat(ts) VALUES (NOW()) ON CONFLICT (id) DO UPDATE SET ts = NOW()",
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones": "false",
    "key.converter": "io.confluent.connect.avro.AvroConverter",
    "key.converter.schema.registry.url": "http://schema-registry:8081",
    "value.converter": "io.confluent.connect.avro.AvroConverter",
    "value.converter.schema.registry.url": "http://schema-registry:8081"
  }
}'

A representative response and the topics it creates a few seconds later:

HTTP/1.1 201 Created
{"name":"razorpay-payments-source","config":{...},"tasks":[{"connector":"razorpay-payments-source","task":0}],"type":"source"}

# Topics that materialise:
$ kafka-topics --list | grep razorpay
razorpay.public.payments
razorpay.public.refunds
razorpay.public.merchants
razorpay.dbhistory          # internal — schema change history
__debezium-heartbeat.razorpay  # heartbeat topic

A walkthrough of the lines that earn their place:

plugin.name=pgoutput — the in-tree logical decoding plugin shipped with Postgres 10+. The historical alternative (wal2json, decoderbufs) required installing an external extension. Production Debezium-on-Postgres in 2026 is universally pgoutput. See /wiki/postgres-logical-decoding-from-scratch for what pgoutput emits on the wire.
publication.name and slot.name — Debezium creates these on first start (with the right grants). The slot is the durable cursor; it holds WAL on the server until consumed. If the connector dies and never restarts, the slot keeps growing and eventually fills disk. An unmonitored stale slot is the second most common Debezium incident after slow snapshots.
snapshot.mode=initial — snapshot the data once at first startup, then stream forever. Other modes: never (stream only, used when an external bootstrap has already populated the sink), when_needed (auto-trigger if the slot has gone stale), initial_only (snapshot then exit, used for one-shot migrations).
incremental.snapshot.chunk.size=10240 — when an ad-hoc snapshot is signalled (via the signal table), each chunk is 10240 rows. Tune by row size; larger chunks are faster but block streaming events for longer.
heartbeat.interval.ms=30000 + heartbeat.action.query — every 30 seconds, write a row to a heartbeat table and immediately decode the resulting WAL entry. Why this is non-optional in production: Postgres logical decoding only advances the slot's confirmed_flush_lsn when the consumer acknowledges an event. If your captured tables are quiet for 10 minutes (a low-traffic database between batch loads), the slot does not advance, WAL piles up, and disk fills. The heartbeat creates synthetic traffic so the slot always advances. Confluent and Debezium docs both call this out, but most teams discover it the hard way around the 4 a.m. low-traffic dip.
ExtractNewRecordState ("unwrap") — flattens the envelope. After this transform, sinks see {"id":..., "merchant_id":..., "amount":..., ...} directly with a header __op indicating c/u/d.
Avro converters + schema registry — every field's type is registered. Schema evolution rules (BACKWARD compatibility by default) gate every producer write; an incompatible schema change in the source database causes Debezium to fail-fast rather than emit corrupt events. See /wiki/schema-evolution-across-the-cdc-boundary.

The whole connector is ~30 lines of config. The Postgres-specific guarantees behind it — slot management, snapshot consistency, WAL parsing — are tens of thousands of lines of Java in the Debezium repo.

How offsets and restarts actually work

A connector dies. The pod gets OOM-killed. The worker rebalances. Debezium has to come back to exactly where it left off, with no duplicates the sink can't dedup, no gaps. The Kafka Connect framework provides the mechanism; Debezium provides the source-specific position structure.

Three pieces of state survive a restart:

connect-offsets topic. A compacted topic. Each Debezium connector writes records keyed by source-partition, valued with source-specific position. For Postgres: {"lsn": 24138842616, "txId": 8814522, "ts_usec": 1714032501421000}. For MySQL: {"file": "mysql-bin.000123", "pos": 8421337, "gtids": "abc:1-1000"}. For Mongo: a base64 resume token.
connect-configs topic. The connector configuration JSON. So you can scale the worker cluster without re-posting configs.
<topic-prefix>.dbhistory topic. Postgres and MySQL only. Debezium needs the schema of every table at every LSN/binlog-position so it can decode a five-month-old event correctly even if the table has since had columns added. The history topic stores ALTER TABLE events keyed by position. On restart, Debezium replays this topic to rebuild its in-memory schema model before resuming streaming.

On restart, the connector reads the offset, asks the source to resume from that position (slot from LSN, binlog reader from file+pos, change stream from token), replays the schema history topic to position, and resumes emitting events. Why this works without external coordination: every position type is monotonic and durable on the source side. The slot's confirmed_flush_lsn is durable in Postgres' control file. The binlog file rotates but the GTID set is durable in mysql.gtid_executed. The Mongo resume token is durable in the oplog. Connect's offset commit just records what the source has already committed; nothing is two-phase.

A subtle bug class: at-least-once delivery, not exactly-once. Debezium emits an event, Connect commits the offset asynchronously (default every 60 seconds), and if the worker crashes between event emit and offset commit the event re-emits on restart. Consumers must dedup. This is fine — every CDC sink in the modern stack (Snowflake streams, Iceberg merge-on-read, Materialize) is idempotent on (source.lsn, before/after) or on natural keys. See /wiki/idempotent-producers-and-sequence-numbers.

Common confusions

"Debezium is a database tool." It isn't — it is a Kafka Connect plugin. It cannot do anything without a running Kafka Connect cluster. The number of teams who try to "deploy Debezium" without first deploying Connect, then wonder why the JAR doesn't run as a daemon, is non-trivial.
"Debezium gives exactly-once delivery." It gives at-least-once. The exactly-once story comes from a downstream idempotent sink, or from the Kafka transactional producer (KIP-98) configured on the Connect worker. A vanilla Debezium → vanilla Kafka topic pipeline can emit duplicates on restart; the consumer must dedup.
"You can run Debezium against the Postgres primary without consequences." The replication slot lives on whichever node the connector connects to. Running it on the primary means slot WAL retention competes with the primary's own disk for query workload. Production teams run it against a streaming replica; from Postgres 16 onwards, against a logical replica with its own slot.
"All Debezium connectors are the same." They are not. The Postgres connector and the MongoDB connector share an envelope but have different snapshot strategies, different signalling mechanisms, and different schema-history requirements. The Oracle connector ships entirely separately because Oracle's redo log requires LogMiner, which has different operational quirks. "I know Debezium for Postgres" doesn't mean "I know Debezium for Oracle".
"tasks.max controls parallelism." For most Debezium connectors, tasks.max=1 is the only working value — a single connector reads from a single replication stream, which is inherently sequential. The exception is sharded MongoDB, where tasks.max can equal the number of shards. On Postgres and MySQL, raising tasks.max does nothing useful.
"Snapshot is fast; the slow part is streaming." Backwards. Snapshot reads a 200 GB table at SELECT throughput; streaming consumes 1–10 MB/sec of WAL on a healthy database. The snapshot is what blocks initial cutover for hours; the streaming is steady-state cheap.

Going deeper

Single Message Transforms (SMTs) — the cheap way to reshape

SMTs are tiny per-message functions configured in the connector. They run inside the Connect worker, between Debezium's emit and Kafka's produce. The shipped library covers most common reshapes: ExtractNewRecordState (unwrap), RegexRouter (rename topics), MaskField (PII redaction), Filter (drop events matching a predicate). For a PaisaBridge-style PII regime where pan_number and aadhaar_last4 should never leave the source, MaskField keyed on those columns is a one-line addition. SMTs are not a full streaming framework — anything that needs joins or aggregations belongs in Flink or Spark Streaming downstream — but for "shape the row before Kafka" they save building a separate processor.

The signal table and ad-hoc snapshots

Beyond the initial snapshot, a Debezium connector can be told at runtime to re-snapshot a specific table — useful when a sink lost data and needs a re-bootstrap of one table without re-bootstrapping the whole connector. The mechanism: a small debezium_signal table on the source. The operator inserts a row like {"type":"execute-snapshot","data":{"data-collections":["public.refunds"]}}. Debezium picks up the insert through its own change stream, recognises the signal, and runs an incremental snapshot of the named tables interleaved with normal streaming. From 2.5+ the same can be triggered via a Kafka topic, removing the source-side write requirement.

Connector restart, lost slots, and the WAL-fills-the-disk failure

The single biggest production incident pattern: the Debezium connector fails (OOM, network partition, deployment bug), nobody notices for 6 hours, the slot does not advance, WAL grows by 200 GB, and Postgres goes into emergency archiving mode or runs out of disk. Mitigations: alert on pg_replication_slots.confirmed_flush_lsn lag in bytes (page if > 5 GB); set max_slot_wal_keep_size to a finite ceiling so the slot is auto-dropped before the disk fills; auto-restart the Connect worker on failure (Kubernetes liveness probe on the REST API). Most Bengaluru shops who run Debezium in production discovered every one of these the hard way.

Why Wepay and ShopForge wrote about this — the open-source CDC playbook

Debezium became the default open-source CDC because two companies wrote about how they used it: Wepay (acquired by JPMorgan, 2017 blog post on running Debezium with MySQL at scale) and ShopForge (multiple posts, 2019–2022, on Debezium Postgres for their data lake). The posts covered the operational reality nobody else was writing about — slot management, snapshot duration on TB-scale tables, schema evolution friction. Indian-scale equivalents have followed: PaisaBridge's engineering blog on payments CDC (2023), BharatBazaar's data platform posts on bootstrapping a billion-row catalogue. The pattern repeats: every team that moves to Debezium learns the same five lessons six months in.

Where this leads next

The next chapter, /wiki/schema-evolution-across-the-cdc-boundary, takes the schema problem head-on: what happens when a developer adds a column to payments on a Tuesday and the Iceberg sink three hops downstream has the old schema cached.

After that, /wiki/snapshot-cdc-the-bootstrapping-problem revisits the snapshot phase from a different angle — not "how does Debezium do it" but "what does a sink need to receive to safely apply both snapshot rows and streaming rows without double-counting".

Two adjacent crosslinks worth following:

/wiki/kafka-transactions-under-the-hood — the producer-side mechanism Debezium can opt into for transactional Kafka writes (producer.override.transactional.id).
/wiki/idempotent-producers-and-sequence-numbers — why every Debezium consumer must still be idempotent regardless of producer settings.

By the end of Build 11 you will have a complete CDC story: three sources (Postgres, MySQL, Mongo), one production wrapper (Debezium), one schema-evolution discipline, one bootstrapping protocol. The same shape repeats for every database that joins the platform — Oracle (LogMiner), SQL Server (CDC tables), Cassandra (CDC log) — only the source-specific decoder changes.

References

Debezium documentation — connector reference, configuration options, FAQ. The MongoDB, Postgres, and MySQL connector pages each have their own snapshot-mode and offset-format sections.
Kafka Connect documentation — the host framework. Read the section on offset commit, the rebalance protocol, and dead-letter queues; most "Debezium" runtime issues live here.
Das, Akidau, Foundation, "DBLog: A Watermark Based Change-Data-Capture Framework" (Streamora, 2020) — the paper that inspired Debezium's incremental snapshot design. Worth reading before tuning chunk size in production.
Confluent blog: "Online Data Migration from PostgreSQL with Debezium" — operational walkthrough of the snapshot-then-stream cutover for a real migration.
Wepay engineering blog: "Streaming databases in real-time with MySQL, Debezium, and Kafka" (2017) — one of the foundational write-ups on running Debezium at scale, still widely cited.
/wiki/postgres-logical-decoding-from-scratch — what pgoutput emits, which Debezium decodes.
/wiki/mysql-binlog-format-and-replication-protocol — the MySQL counterpart Debezium consumes.
/wiki/mongodb-change-streams — the MongoDB counterpart. The three together are the input side of Debezium.