Change Data Capture: Debezium and Logical Decoding
A Razorpay payment row flips from pending to captured at 11:42:03.117. By 11:42:03.118 the search index, the fraud model, the merchant dashboard cache, and the BigQuery warehouse all need to know — without anyone polling, without dual writes, and without a trigger that can take down the primary. Change Data Capture is the bridge that makes this possible, and it is built on a single audacious idea: the database already wrote a durable, ordered log of the change. Just go read it.
Change Data Capture turns an OLTP database (Postgres, MySQL) into a streaming source by tailing its replication log — pgoutput logical decoding for Postgres, row-format binlog for MySQL. Debezium is the connector that does this and writes each change as a Kafka record. The hard parts are not the parsing: they are snapshot-plus-stream initial loads, schema evolution, and replication slot operations that do not retain WAL forever and crash the primary.
What CDC is actually for
You have a Postgres database holding the truth about orders. You also have:
- An OpenSearch cluster that needs every order indexed for the merchant search box.
- A Redis cache holding the latest order status keyed by
order_id. - A BigQuery warehouse running daily revenue dashboards for finance.
- A Flink job computing real-time fraud features from the last five minutes of activity.
- A webhook fan-out that fires on every status change to merchant integrations.
Five downstream systems, one source of truth. The naive ways to keep them in sync all fail in production:
Polling every minute with SELECT * FROM orders WHERE updated_at > :last_seen misses deletes (a deleted row matches no predicate), collapses two updates between polls into one, and puts constant scan load on the primary that grows with your data and your freshness target. Why polling rots: latency floor is your poll interval, deletes are invisible because the row is gone, and the load on the writer scales with how fresh consumers want to be — exactly backwards from what you want.
Dual writes — the application writes to Postgres, then to Kafka — are worse. A network blip between the two leaves the systems permanently inconsistent. There is no atomic boundary across two systems without a distributed transaction protocol you do not want to operate at scale. Riya from the merchant dashboard team finds, three months in, that 0.04% of "captured" payments never made it to Kafka.
Triggers are tighter but rigid. Postgres AFTER INSERT OR UPDATE OR DELETE triggers can call a function that writes to an outbox table, but every change pays the cost synchronously, and a slow downstream slows every commit. They also do not survive logical replication or major-version upgrades cleanly.
The good way is log-based CDC. Postgres already writes a durable, ordered Write-Ahead Log of every change — that is how replication works. If you can tail that log, decode it, and publish the decoded events to Kafka, you get every change, in commit order, with no extra load on the primary (the log was being written anyway), and bounded latency. The database is now a streaming source. Debezium is the canonical implementation of this idea.
How Postgres logical decoding works
Postgres has two log levels. The physical WAL is byte-level — "page 4242 in file 16384/16385/2604 had bytes 1024–1196 changed to these values." That is what streaming replication ships, and it requires the replica to have an identical on-disk layout. Useless for downstream systems.
Logical decoding is a layer on top. You ask Postgres "give me the WAL, but as logical row-level events: this insert into orders, that update on payments, that delete on cart_items." Postgres reads the physical WAL, runs it through an output plugin that knows how to translate page changes into row events, and streams them to whoever is subscribed.
To use logical decoding you need three things:
wal_level = logicalinpostgresql.conf. Default isreplica. Changing it requires a restart and writes slightly more WAL (roughly 5–15% more, because logical decoding needs the old row image for updates and deletes).- A replication slot. A named cursor that says "I am consumer X, the last LSN I confirmed was Y, do not garbage-collect WAL newer than Y." Slots are persistent on disk.
- A publication. A list of tables you want changes for, optionally filtered by
INSERT,UPDATE,DELETE,TRUNCATE. This is whatCREATE PUBLICATIONbuilds.
The default output plugin since Postgres 10 is pgoutput — the same plugin used by built-in logical replication between Postgres instances. Debezium uses pgoutput by default in modern versions; older deployments used wal2json or decoderbufs.
The LSN — Log Sequence Number — is a 64-bit byte offset into the WAL stream, written like 0/16B3748 (segment/byte). Every event has one. The slot remembers the last LSN the consumer has confirmed. WAL older than the oldest active slot's confirmed LSN can be recycled. WAL newer than it cannot. Why this matters: a Debezium connector that crashes for two days while writes continue causes Postgres to retain two days of WAL on disk. If pg_wal/ fills the disk, the primary stops accepting writes. The slot is a debt the primary holds against the consumer.
A working Debezium pipeline in 50 lines of config
Concrete is better than abstract. Here is an end-to-end pipeline: a Postgres orders table on a Hyderabad host, Debezium running as a Kafka Connect plugin, Kafka local to the connector, and a tiny Python consumer printing every change.
-- run in psql as a superuser
ALTER SYSTEM SET wal_level = 'logical';
-- restart Postgres after this; wal_level changes need a restart.
CREATE TABLE orders (
order_id BIGSERIAL PRIMARY KEY,
merchant TEXT NOT NULL,
amount_p BIGINT NOT NULL, -- amount in paise; ₹1 = 100 paise
status TEXT NOT NULL,
updated_at TIMESTAMPTZ DEFAULT now()
);
CREATE PUBLICATION dbz_pub FOR TABLE orders;
SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput');
Now a Debezium Postgres connector configuration, posted to the Kafka Connect REST API:
{
"name": "orders-cdc",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "pg-hyderabad-1.internal",
"database.port": "5432",
"database.user": "debezium",
"database.password": "***",
"database.dbname": "razorpay_prod",
"topic.prefix": "rp1",
"plugin.name": "pgoutput",
"publication.name": "dbz_pub",
"slot.name": "debezium_slot",
"table.include.list": "public.orders",
"snapshot.mode": "initial",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter"
}
}
Run INSERT INTO orders (merchant, amount_p, status) VALUES ('cafe-coimbatore', 25000, 'pending'); and a record lands on the Kafka topic rp1.public.orders. Here is the value, formatted (key is {"order_id": 1}):
{
"before": null,
"after": {
"order_id": 1,
"merchant": "cafe-coimbatore",
"amount_p": 25000,
"status": "pending",
"updated_at": "2026-04-25T11:42:03.117Z"
},
"source": {
"version": "2.6.1.Final",
"connector": "postgresql",
"name": "rp1",
"ts_ms": 1745581323117,
"snapshot": "false",
"db": "razorpay_prod",
"schema": "public",
"table": "orders",
"txId": 4827193,
"lsn": 23847362
},
"op": "c",
"ts_ms": 1745581323120
}
The schema is the Debezium envelope: before, after, source, op, ts_ms. op is c for create, u for update, d for delete, r for read (used during snapshots), t for truncate. For an update you get before and after populated; for a delete only before. The source.lsn is the LSN of the change in the source WAL. The source.snapshot flag tells you whether this row came from the initial snapshot or the streaming phase.
A consumer is now trivial. Here is one that materialises a fresh OpenSearch document on every change:
# orders_indexer.py — consume Debezium CDC into OpenSearch
from kafka import KafkaConsumer
import json, opensearchpy
es = opensearchpy.OpenSearch("https://opensearch-mum.internal:9200")
c = KafkaConsumer(
"rp1.public.orders",
bootstrap_servers="kafka-1.internal:9092",
group_id="orders-indexer",
enable_auto_commit=False,
value_deserializer=lambda b: json.loads(b),
key_deserializer=lambda b: json.loads(b),
)
for msg in c:
op = msg.value["op"]
if op in ("c", "u", "r"):
doc = msg.value["after"]
es.index(index="orders", id=doc["order_id"], body=doc)
elif op == "d":
es.delete(index="orders", id=msg.value["before"]["order_id"], ignore=[404])
c.commit() # only after the side effect succeeded
That is the whole shape of CDC in production. The key is idempotency by primary key: replaying the same record twice produces the same final state, because we always write the full after row to OpenSearch keyed by order_id. Combined with at-least-once delivery from Kafka, this gives effectively-once outcomes without exactly-once delivery semantics.
MySQL binlog row format and the wider connector family
MySQL has its own log: the binary log (binlog). It exists for replication, point-in-time recovery, and auditing. There are three formats:
- Statement-based (
SBR). The binlog stores the SQL statement that ran (UPDATE orders SET status = 'captured' WHERE order_id < 1000). Replays may diverge —NOW(),RAND(), or non-deterministic triggers can produce different rows on the replica. Useless for CDC. - Row-based (
RBR). The binlog stores the actual row images: which row, before image, after image. Deterministic, replay-safe, exactly what CDC wants. - Mixed. Statement when safe, row when not. Avoid for CDC — predictability matters more than binlog size.
Debezium's MySQL connector requires binlog_format = ROW and binlog_row_image = FULL (so updates carry the full before image, not just the changed columns). It connects to the MySQL server using the replication protocol — same protocol a replica would use — and reads binlog events: WRITE_ROWS_EVENT, UPDATE_ROWS_EVENT, DELETE_ROWS_EVENT. Each event is decoded into the same Debezium envelope you saw for Postgres. The position in the log is a (binlog_filename, binlog_position) pair, or a GTID set if GTIDs are enabled.
Beyond Postgres and MySQL, the same pattern shows up everywhere a database has a durable log:
- MongoDB: the oplog. Debezium has a MongoDB connector, but MongoDB also exposes change streams natively — you usually do not need Debezium.
- SQL Server: CDC tables and Change Tracking, fed by Debezium's SQL Server connector.
- Oracle: LogMiner, or XStream Out for licensed deployments. Painful, but it works.
- Cassandra: the commit log via Debezium's Cassandra connector — though most Cassandra users emit events from the application layer instead.
Every connector boils down to: find the database's log, get permission to tail it, decode it, hand it to Kafka. Different log shape, same shape of solution.
Snapshot plus stream: the initial-load problem
When you point Debezium at an existing table with five years of data and ten million rows, what happens? You cannot just start tailing the WAL — you would only see future changes, never the rows already in the table. You also cannot do a separate SELECT * and then start tailing — between the snapshot and the stream there is a gap, or worse, an overlap with double delivery.
The trick: freeze a single LSN as the boundary, snapshot at that LSN, then stream from the same LSN.
In Postgres, Debezium does this:
- Open a transaction with
REPEATABLE READ ISOLATION LEVEL READ ONLY. SELECT pg_export_snapshot()— gives a snapshot ID like00000003-0000001F-1.pg_create_logical_replication_slot(...) USING SNAPSHOT '00000003-0000001F-1'— creates the slot at exactly the LSN of that snapshot.- In the same transaction,
SELECT * FROM ordersand stream rows to Kafka withop = 'r'. - Commit the transaction. Now start consuming the slot, which begins exactly where the snapshot ended.
The transactional snapshot is the magic. Postgres guarantees that SELECT * inside the transaction sees the database state as of the snapshot LSN, and the slot starts streaming WAL records strictly after that LSN. No gap, no overlap. Why this works: the slot and the snapshot are coupled by a shared LSN. Without that coupling, rows committed between the SELECT and the slot creation would be missed; rows captured by both would be double-delivered. The MVCC snapshot is what closes the seam.
For very large tables this still hurts — a SELECT * on a 200 GB table holds a snapshot for hours, blocks autovacuum from cleaning up dead tuples, and bloats the table. Modern Debezium adds incremental snapshots (the Watermark-Based Snapshotting technique from a 2019 Debezium blog post): the snapshot is broken into chunks, each chunk reads a key range, and watermark events are written to the WAL so the streaming consumer knows when a chunk is "done." This lets snapshots and streaming run concurrently and lets you re-snapshot one table without restarting the whole connector.
MySQL has no native MVCC snapshot mechanism for the binlog the way Postgres does. Debezium uses a FLUSH TABLES WITH READ LOCK (briefly, to record the binlog position), then SELECT * from each table inside a REPEATABLE READ transaction (InnoDB MVCC), then releases the lock and streams from the recorded position. The lock is held for milliseconds — long enough to call SHOW MASTER STATUS. This is why MySQL CDC with Debezium has a "blocking snapshot" cost that Postgres does not.
Schema evolution: the silent killer
A migration runs at 02:30 IST: ALTER TABLE orders ADD COLUMN refund_id BIGINT;. Three things happen.
In Postgres logical decoding, the schema change itself does not appear in the logical stream. pgoutput only emits row-level events. But the next INSERT after the ALTER produces a row event whose tuple has one extra column. Debezium handles this by inspecting the publication's schema on each relation message — when the schema id changes, it re-reads the table definition and updates the in-memory schema for that table. The connector then writes a schema change event to a separate topic (<prefix>.<db>.schema_changes) so downstream consumers can react.
For MySQL the binlog is even better-behaved: ALTER TABLE statements appear as QUERY_EVENT records right in the binlog stream, in commit order with the row events. Debezium parses them, updates its internal schema, and writes a schema change event.
What can go wrong:
- Backwards-incompatible changes. Dropping a column, renaming a column, changing a type from
INTtoTEXT— these break consumers that expect the old schema. Debezium will pass the change through, but your OpenSearch mapping or your Avro schema registry will reject the new event. - Columns dropped before the snapshot finishes. A
DROP COLUMNwhile a snapshot is in progress can leave snapshot rows with fewer columns than streaming rows. Debezium 2.x handles this for Postgres; older versions need a connector restart. pg_repackor table rewrites. Runningpg_repackon a published table changes the underlying relfilenode. Old connectors with cached schemas would crash; modern ones re-read on the next relation message.- Tables added to the publication after slot creation.
ALTER PUBLICATION ... ADD TABLEworks, but Debezium needspublication.autocreate.mode = filteredordisabled, and you must take a fresh snapshot of the new table — usually via the signalling table mechanism (INSERT INTO debezium_signal ...), which triggers an ad-hoc incremental snapshot.
The single most useful operational discipline is: register every event schema in a schema registry (Confluent, Apicurio) with BACKWARD compatibility mode. Producers (Debezium) cannot publish schemas that break consumers. The registry rejects the change at publish time, before the bad event spreads.
Common confusions
-
"CDC is the same as replication." Both tail a log, but their consumers are different. Replication consumers are other instances of the same database — they need byte-level WAL or row-image binlog. CDC consumers are arbitrary downstream systems (Kafka, search, warehouse) that need logical row events with stable schemas. Postgres exposes both: physical streaming replication for replicas, logical decoding for CDC.
-
"Replication slots are just bookmarks." They are bookmarks the database is contractually obligated to honour. An idle slot pins WAL retention and can fill the disk on the primary, halting all writes. Always alarm on
pg_replication_slots.confirmed_flush_lsnlag and disk usage. Drop slots that belong to dead consumers — Debezium does not do this for you. -
"Statement-based binlog is fine for CDC if my queries are simple." It is not. Triggers, defaults like
NOW(), and stored procedures can all turn one statement into different rows on different replays. Even if your queries look deterministic today, a future schema change can introduce non-determinism silently. Always usebinlog_format = ROW. -
"Debezium guarantees exactly-once delivery." It does not. Debezium produces to Kafka with at-least-once semantics by default. Duplicates can happen on connector restart, before the LSN was confirmed. Effectively-once outcomes require either Kafka transactions (Connect's
exactly.once.source.support) or idempotent consumers keyed by primary key. -
"I can just use triggers and write to a Kafka producer from inside Postgres." You cannot. Triggers run inside the transaction; if the Kafka producer hangs, your transaction hangs and your application stalls. The classic fix is the transactional outbox: triggers (or application code) write to an
outboxtable in the same transaction, and a separate poller reads the outbox and publishes — but at that point you have rebuilt a worse Debezium. -
"Logical decoding works on Postgres replicas." Until Postgres 16, no — slots could only be created on the primary, so a primary failover meant losing the slot and re-snapshotting. Postgres 16 added logical decoding on standbys, finally letting CDC survive failover. Most production deployments still pin Debezium to the primary or use connector-level redundancy.
Going deeper
Why pgoutput and not wal2json
Early Debezium used wal2json, an external output plugin that emits JSON. It has to be installed on the server (apt install postgresql-15-wal2json), which makes it a non-starter on managed databases like RDS or Cloud SQL where you cannot install C extensions. pgoutput is built into Postgres core since version 10, so it works everywhere managed Postgres is offered. The format is binary and slightly more compact, but you almost never read it directly — Debezium handles the parsing. The decoderbufs plugin emits Protocol Buffers and is faster but, again, requires installation. Modern recommendation: pgoutput everywhere unless you have a specific reason not to.
Replication slot operations and disasters
The slot is a debt. Three ways it bites:
- Disk fills. Slot stops being consumed (connector down, paused, or buggy). WAL accumulates in
pg_wal/. Postgres 13+ addedmax_slot_wal_keep_sizeto cap this — slots that exceed the cap are markedlostand dropped, sacrificing the consumer to save the primary. Pre-13 deployments must monitorpg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)and alarm before the disk dies. - Failover loses the slot. Pre-Postgres 16, slots are local to a node. A failover to a replica creates a fresh node with no slot. You either re-snapshot, or use
pg_failover_slots(a third-party extension, now mainlined as failover slots in PG 17), or a tool like Patroni that copies slots across. - Long transactions block decoding. Logical decoding can only emit a transaction's row events after the transaction commits — and it cannot decode any later transaction until the earlier one is committed. A 4-hour
BEGINon the primary blocks CDC for 4 hours. The pathology shows up aspg_replication_slots.active = truebutconfirmed_flush_lsnnever advances. Always have aidle_in_transaction_session_timeoutset.
Outbox pattern: when you cannot tail the database
Some shops run on managed databases (Aurora MySQL prior to its CDC support, older RDS Postgres versions) where you cannot enable logical decoding or get binlog access. The fallback is the transactional outbox: every business write also inserts a row into an outbox table within the same transaction. A separate process polls (or, when supported, CDC-tails) the outbox and publishes events. Debezium's EventRouter SMT (single message transform) is built for exactly this — it consumes from the outbox table, routes by aggregate_type, and unwraps the payload. It is more code than direct CDC, but it gives the application explicit control over what becomes an event and lets you serialise the payload at write time. Useful when CDC is forbidden or when you want to hide schema changes from consumers.
What Kafka topics look like and why
Debezium follows a strict naming convention: <topic.prefix>.<schema>.<table>, one topic per table. Within a topic, the key is the row's primary key, encoded as JSON or Avro. This is the load-bearing detail. Kafka guarantees per-key ordering within a partition, and partitioning by primary key means all changes for a given row land on the same partition in commit order. A consumer processing row order_id=42 always sees its inserts, updates, and delete in the right sequence, even with hundreds of consumer instances. Without that key, a parallel consumer would see updates out of order, which is unrecoverable for a materialised view.
The Pgoutput protocol details
For the curious: pgoutput speaks a binary streaming protocol over the standard Postgres replication connection. Messages include Begin (B), Commit (C), Origin (O), Relation (R) — sends the table schema once per session per relation, Insert (I), Update (U), Delete (D), Truncate (T), Type (Y), and Stream Start/Stop/Commit/Abort for in-progress transaction streaming (Postgres 14+). Each tuple is encoded with a tag byte per column: n = null, u = unchanged TOAST value (for updates that did not modify a large column), t = text-format value. The "unchanged TOAST" tag is the source of subtle CDC bugs: an update that did not touch a TOASTed column produces an after event with placeholder values, which downstream consumers must merge with prior state. Debezium has REPLICA IDENTITY FULL (which forces the full before-image) as a workaround, at the cost of more WAL.
Where this leads next
CDC is the bridge that makes the rest of streaming databases possible. From here:
- Where this is all going: the database as a materialized view — chapter 181: if every database is a CDC source, every dependent system becomes a derived view, and the line between "database" and "view" dissolves.
- Materialize and Differential Dataflow — chapter 179: a database engine that consumes CDC streams and exposes incrementally maintained SQL views.
- Kafka as a distributed log — chapter 174: the substrate Debezium publishes to and why per-key ordering is the foundation everything else stands on.
- The stream / table duality — chapter 175: why a CDC topic and a database table are two views of the same object.
- Exactly-once semantics — chapter 176: how to make idempotent consumers turn at-least-once delivery into effectively-once outcomes.
- Change streams: CDC built in — chapter 141: MongoDB's first-class CDC API, contrasted with the bolt-on connector pattern.
References
- Debezium project, Debezium documentation — the canonical reference for connector configuration, snapshot modes, and SMTs across Postgres, MySQL, MongoDB, SQL Server, Oracle, and Cassandra. debezium.io/documentation.
- PostgreSQL Global Development Group, Logical Decoding — the official Postgres docs on
wal_level=logical, replication slots, output plugins, and the protocol. postgresql.org/docs/current/logicaldecoding.html. - MySQL AB, Replication and Binary Logging — binlog formats, GTIDs, and the replication protocol Debezium speaks. dev.mysql.com/doc/refman/8.0/en/replication.html.
- Jiri Pechanec, DBZ-3157: Watermark-based incremental snapshots (Debezium blog, 2021) — the chunked-snapshot design that lets large tables be loaded without holding a Postgres snapshot for hours. debezium.io/blog/2021/10/07/incremental-snapshots/.
- Gunnar Morling, Reliable Microservices Data Exchange With the Outbox Pattern (Debezium blog, 2019) — the outbox pattern and how Debezium's EventRouter SMT consumes it. debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/.
- Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 11 Stream Processing — the textbook treatment of CDC, log-based message brokers, and database-inside-out. dataintensive.net.
- Confluent, Schema Registry — schema evolution rules (BACKWARD, FORWARD, FULL) for Avro/JSON/Protobuf; the operational discipline that keeps CDC pipelines from breaking. docs.confluent.io/platform/current/schema-registry.
- Kafka as a distributed log — chapter 174 of this track, the substrate Debezium publishes to.