Debezium's architecture

The previous three chapters showed three completely different CDC sources: Postgres slots and pgoutput, MySQL binlog with its row events, MongoDB change streams over the oplog. A team running all three would otherwise need three custom consumers, three resumption schemes, three flavours of schema inference, three failure modes. Debezium's reason to exist is to collapse those three into one Kafka topic per table, with one offset model, one snapshot protocol, and one schema registry contract. This chapter is the architecture tour: what spawns when you start a Debezium connector, where the offsets live, how snapshot-then-stream is glued together, and which seams crack first under Indian-scale load.

Debezium is a set of source connectors that run inside a Kafka Connect cluster, one connector per upstream database, each emitting one Kafka topic per table. Underneath the abstraction it does three jobs: it owns the upstream replication primitive (slot, binlog position, resume token), it runs a snapshot-then-stream protocol so consumers never see a partial history, and it serialises every change as an envelope with before, after, source, and op fields whose schema is tracked by a registry. The hard production problems are not Debezium itself — they are connector restart, snapshot duration on large tables, and schema evolution.

What "Debezium" actually is

Debezium is not a daemon you install. It is a JAR — a set of Kafka Connect source connectors — that you load into a Kafka Connect worker. Kafka Connect is the host. Debezium plugs in. The split matters because every operational concern that feels like a Debezium problem (worker memory, connector restart, dead-letter queues) is actually a Kafka Connect problem, and the documentation lives in two different places.

A running Debezium deployment has these moving parts:

Debezium runtime topology — what's running whereA diagram of a source database on the left, a Kafka Connect worker cluster in the middle hosting a Debezium connector task, and a Kafka cluster on the right with internal config/offset/status topics plus the per-table change-event topics. A Debezium deployment in production Source database Postgres / MySQL / Mongo replication slot / binlog / oplog tables: payments, refunds, merchants Kafka Connect worker JVM, 1–N nodes, group-coordinated Debezium connector PostgresConnectorTask poll() loop, snapshot+stream Connect framework REST, offset commit, retries Single Message Transforms unwrap, route, mask PII, add headers Kafka cluster connect-configs connect-offsets connect-status razorpay.dbhistory razorpay.public.payments change events razorpay.public.refunds change events … one topic per table
The runtime is three boxes: a source database speaking its native replication protocol, a Kafka Connect worker hosting the Debezium connector task, and a Kafka cluster receiving per-table change-event topics plus the internal connector-state topics.

The connector is the only Debezium-authored code in the picture. Everything else — the worker process, the REST API, the offset commit machinery, the dead-letter queue support, the rebalance protocol — is Kafka Connect, a separate Apache project. Why this distinction is more than nitpicking: when a Debezium task fails to start with IllegalStateException from WorkerSourceTask you are debugging Kafka Connect, not Debezium. The relevant logs and tunables (offset.flush.interval.ms, producer.override.compression.type, errors.tolerance) are documented under Kafka Connect. Treating "Debezium broke" as one undifferentiated problem space is the single most common time sink for engineers new to the stack.

The change-event envelope

Every Debezium change event is a structured record with the same envelope, regardless of source. This uniformity is the single most useful property of the whole framework — a Flink job that consumes Debezium events from a Postgres table and from a MySQL table can use the same parser. The envelope (Avro-encoded by default, JSON-encoded if you choose JsonConverter) looks like:

{
  "before": null,
  "after": {
    "id": 9281,
    "merchant_id": "razorpay-test",
    "amount": 49900,
    "status": "captured",
    "created_at": 1714032501412
  },
  "source": {
    "version": "2.5.4.Final",
    "connector": "postgresql",
    "name": "razorpay",
    "ts_ms": 1714032501421,
    "db": "razorpay",
    "schema": "public",
    "table": "payments",
    "txId": 8814522,
    "lsn": 24138842616,
    "snapshot": "false"
  },
  "op": "c",
  "ts_ms": 1714032501500
}

The five top-level fields:

A typical sink does not want the envelope — it wants the row. Debezium ships a Single Message Transform called ExtractNewRecordState (often called "unwrap") that flattens after to the top level for inserts/updates and emits a tombstone for deletes. Why this transform is so commonly applied that many teams forget it exists: most downstream sinks (a Postgres mirror, a Snowflake stream, a Materialize view) want one row per event. The full envelope is only useful when the consumer is doing event-sourcing-shaped logic — auditing, rebuilds, time-travel queries — where the before and source.lsn actually carry signal.

Snapshot then stream — the bootstrapping protocol

A new connector arriving at a database that already has a billion rows of history cannot just start at the current LSN. The downstream consumer would only see writes from "now"; it would never have a copy of any row that was inserted last year. Debezium's solution — and the solution every CDC tool eventually converges on — is snapshot then stream:

  1. Open a transaction (or the equivalent on each source) at a known position.
  2. Read every row of every captured table, emitting them as op: r events.
  3. Switch to streaming from the position recorded at step 1.
  4. Continue forever.

The protocol's correctness condition is that the position recorded at step 1 must be a position where the snapshot's view of the world is consistent with the streaming events that follow. On Postgres, this is enforced by creating a logical replication slot and immediately exporting a snapshot from inside a REPEATABLE READ transaction at the slot's consistent_point. On MySQL, the connector takes a FLUSH TABLES WITH READ LOCK (or, post-1.7, a row-locking-free chunked snapshot using gh-ost-style cursors), records the binlog position, then releases the lock. On MongoDB, the connector resolves the cluster's current resume token, reads every document in the captured collections, then opens the change stream from the recorded token.

Snapshot then stream — the position-locking momentA timeline showing the source database's writes flowing through time, the moment when Debezium records a position and starts a snapshot, the snapshot phase emitting r-events, then the streaming phase resuming from the recorded position emitting c/u/d events. Bootstrapping a billion-row table — the protocol DB writes → LSN = 24138842616 snapshot starts here Phase 1 — Snapshot SELECT * FROM payments → r-events Phase 2 — Stream decode WAL from LSN 24138842616 → c/u/d events Correctness condition The position recorded at the start of snapshot must be replayable later. Postgres: REPEATABLE READ + slot.consistent_point | MySQL: FLUSH TABLES + binlog pos | Mongo: resume token Failure mode: snapshot takes 4h, source rotates WAL/binlog/oplog past position → connector resnaps from zero.
The two phases overlap on the timeline at the recorded position. Snapshot reads everything that existed at that LSN/binlog-position/oplog-timestamp; streaming picks up everything written since. Together they form a complete, ordered history.

There are three things every team eventually has to learn about this protocol:

Snapshots take time proportional to the data, not to the change rate. A 200 GB Postgres table on a 10k IOPS volume snapshots at roughly 50–100 MB/sec, so 30 minutes to an hour. While the snapshot runs, the WAL keeps growing — every write that happens during the snapshot is buffered in the slot. Why this is the production failure mode that pages on-call: if the snapshot takes longer than max_slot_wal_keep_size allows, Postgres drops the slot to protect free disk and the connector restarts the entire snapshot. Razorpay's first Debezium-on-payments rollout in 2022 hit exactly this — a 400 GB ledger table and a misconfigured WAL retention. They fixed it with parallel chunked snapshots (Debezium 2.0+) and max_slot_wal_keep_size = 200GB.

Incremental snapshots replaced blocking snapshots in 1.6. The old behaviour was "lock everything, snapshot, unlock". The 1.6 release (2021) introduced incremental snapshots inspired by Netflix's DBLog paper — the connector slices the table into chunks by primary key, snapshots each chunk via SELECT, and interleaves chunk reads with streaming WAL events. The streaming side de-duplicates: if a row appears in both the snapshot and the streaming, the streaming version wins. The snapshot can pause and resume across connector restarts, making 100-billion-row tables feasible.

Read-only snapshots in 2.5+. Debezium 2.5 added a signal-only-incremental-snapshot mode that doesn't require write privileges on the source — useful when DBAs refuse to give Debezium a writable user. The connector signals via a Kafka topic instead of a database signalling table.

Build it: configuring a Postgres connector

The smallest realistic Debezium-on-Postgres deployment for a payments service. The connector is configured as a JSON object posted to the Connect REST API:

# Assume a running Kafka Connect worker on port 8083, plus a Postgres
# replica with `wal_level = logical` and a user `debezium` with REPLICATION.
# Curl the Connect REST API to create a connector for the `payments` table
# of a Razorpay-style payments service.

curl -X POST http://connect:8083/connectors -H 'Content-Type: application/json' -d '{
  "name": "razorpay-payments-source",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "tasks.max": "1",
    "database.hostname": "pg-replica.razorpay.internal",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "${file:/etc/dbz/secrets.properties:pg-pass}",
    "database.dbname": "razorpay",
    "topic.prefix": "razorpay",
    "schema.include.list": "public",
    "table.include.list": "public.payments,public.refunds,public.merchants",
    "plugin.name": "pgoutput",
    "publication.name": "dbz_razorpay_pub",
    "slot.name": "dbz_razorpay_slot",
    "snapshot.mode": "initial",
    "incremental.snapshot.chunk.size": "10240",
    "heartbeat.interval.ms": "30000",
    "heartbeat.action.query": "INSERT INTO dbz_heartbeat(ts) VALUES (NOW()) ON CONFLICT (id) DO UPDATE SET ts = NOW()",
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones": "false",
    "key.converter": "io.confluent.connect.avro.AvroConverter",
    "key.converter.schema.registry.url": "http://schema-registry:8081",
    "value.converter": "io.confluent.connect.avro.AvroConverter",
    "value.converter.schema.registry.url": "http://schema-registry:8081"
  }
}'

A representative response and the topics it creates a few seconds later:

HTTP/1.1 201 Created
{"name":"razorpay-payments-source","config":{...},"tasks":[{"connector":"razorpay-payments-source","task":0}],"type":"source"}

# Topics that materialise:
$ kafka-topics --list | grep razorpay
razorpay.public.payments
razorpay.public.refunds
razorpay.public.merchants
razorpay.dbhistory          # internal — schema change history
__debezium-heartbeat.razorpay  # heartbeat topic

A walkthrough of the lines that earn their place:

The whole connector is ~30 lines of config. The Postgres-specific guarantees behind it — slot management, snapshot consistency, WAL parsing — are tens of thousands of lines of Java in the Debezium repo.

How offsets and restarts actually work

A connector dies. The pod gets OOM-killed. The worker rebalances. Debezium has to come back to exactly where it left off, with no duplicates the sink can't dedup, no gaps. The Kafka Connect framework provides the mechanism; Debezium provides the source-specific position structure.

Three pieces of state survive a restart:

On restart, the connector reads the offset, asks the source to resume from that position (slot from LSN, binlog reader from file+pos, change stream from token), replays the schema history topic to position, and resumes emitting events. Why this works without external coordination: every position type is monotonic and durable on the source side. The slot's confirmed_flush_lsn is durable in Postgres' control file. The binlog file rotates but the GTID set is durable in mysql.gtid_executed. The Mongo resume token is durable in the oplog. Connect's offset commit just records what the source has already committed; nothing is two-phase.

A subtle bug class: at-least-once delivery, not exactly-once. Debezium emits an event, Connect commits the offset asynchronously (default every 60 seconds), and if the worker crashes between event emit and offset commit the event re-emits on restart. Consumers must dedup. This is fine — every CDC sink in the modern stack (Snowflake streams, Iceberg merge-on-read, Materialize) is idempotent on (source.lsn, before/after) or on natural keys. See /wiki/idempotent-producers-and-sequence-numbers.

Common confusions

Going deeper

Single Message Transforms (SMTs) — the cheap way to reshape

SMTs are tiny per-message functions configured in the connector. They run inside the Connect worker, between Debezium's emit and Kafka's produce. The shipped library covers most common reshapes: ExtractNewRecordState (unwrap), RegexRouter (rename topics), MaskField (PII redaction), Filter (drop events matching a predicate). For a Razorpay-style PII regime where pan_number and aadhaar_last4 should never leave the source, MaskField keyed on those columns is a one-line addition. SMTs are not a full streaming framework — anything that needs joins or aggregations belongs in Flink or Spark Streaming downstream — but for "shape the row before Kafka" they save building a separate processor.

The signal table and ad-hoc snapshots

Beyond the initial snapshot, a Debezium connector can be told at runtime to re-snapshot a specific table — useful when a sink lost data and needs a re-bootstrap of one table without re-bootstrapping the whole connector. The mechanism: a small debezium_signal table on the source. The operator inserts a row like {"type":"execute-snapshot","data":{"data-collections":["public.refunds"]}}. Debezium picks up the insert through its own change stream, recognises the signal, and runs an incremental snapshot of the named tables interleaved with normal streaming. From 2.5+ the same can be triggered via a Kafka topic, removing the source-side write requirement.

Connector restart, lost slots, and the WAL-fills-the-disk failure

The single biggest production incident pattern: the Debezium connector fails (OOM, network partition, deployment bug), nobody notices for 6 hours, the slot does not advance, WAL grows by 200 GB, and Postgres goes into emergency archiving mode or runs out of disk. Mitigations: alert on pg_replication_slots.confirmed_flush_lsn lag in bytes (page if > 5 GB); set max_slot_wal_keep_size to a finite ceiling so the slot is auto-dropped before the disk fills; auto-restart the Connect worker on failure (Kubernetes liveness probe on the REST API). Most Bengaluru shops who run Debezium in production discovered every one of these the hard way.

Why Wepay and Shopify wrote about this — the open-source CDC playbook

Debezium became the default open-source CDC because two companies wrote about how they used it: Wepay (acquired by JPMorgan, 2017 blog post on running Debezium with MySQL at scale) and Shopify (multiple posts, 2019–2022, on Debezium Postgres for their data lake). The posts covered the operational reality nobody else was writing about — slot management, snapshot duration on TB-scale tables, schema evolution friction. Indian-scale equivalents have followed: Razorpay's engineering blog on payments CDC (2023), Flipkart's data platform posts on bootstrapping a billion-row catalogue. The pattern repeats: every team that moves to Debezium learns the same five lessons six months in.

Where this leads next

The next chapter, /wiki/schema-evolution-across-the-cdc-boundary, takes the schema problem head-on: what happens when a developer adds a column to payments on a Tuesday and the Iceberg sink three hops downstream has the old schema cached.

After that, /wiki/snapshot-cdc-the-bootstrapping-problem revisits the snapshot phase from a different angle — not "how does Debezium do it" but "what does a sink need to receive to safely apply both snapshot rows and streaming rows without double-counting".

Two adjacent crosslinks worth following:

By the end of Build 11 you will have a complete CDC story: three sources (Postgres, MySQL, Mongo), one production wrapper (Debezium), one schema-evolution discipline, one bootstrapping protocol. The same shape repeats for every database that joins the platform — Oracle (LogMiner), SQL Server (CDC tables), Cassandra (CDC log) — only the source-specific decoder changes.

References