Change Data Capture: Debezium and Logical Decoding

A Razorpay payment row flips from pending to captured at 11:42:03.117. By 11:42:03.118 the search index, the fraud model, the merchant dashboard cache, and the BigQuery warehouse all need to know — without anyone polling, without dual writes, and without a trigger that can take down the primary. Change Data Capture is the bridge that makes this possible, and it is built on a single audacious idea: the database already wrote a durable, ordered log of the change. Just go read it.

Change Data Capture turns an OLTP database (Postgres, MySQL) into a streaming source by tailing its replication log — pgoutput logical decoding for Postgres, row-format binlog for MySQL. Debezium is the connector that does this and writes each change as a Kafka record. The hard parts are not the parsing: they are snapshot-plus-stream initial loads, schema evolution, and replication slot operations that do not retain WAL forever and crash the primary.

What CDC is actually for

You have a Postgres database holding the truth about orders. You also have:

Five downstream systems, one source of truth. The naive ways to keep them in sync all fail in production:

Polling every minute with SELECT * FROM orders WHERE updated_at > :last_seen misses deletes (a deleted row matches no predicate), collapses two updates between polls into one, and puts constant scan load on the primary that grows with your data and your freshness target. Why polling rots: latency floor is your poll interval, deletes are invisible because the row is gone, and the load on the writer scales with how fresh consumers want to be — exactly backwards from what you want.

Dual writes — the application writes to Postgres, then to Kafka — are worse. A network blip between the two leaves the systems permanently inconsistent. There is no atomic boundary across two systems without a distributed transaction protocol you do not want to operate at scale. Riya from the merchant dashboard team finds, three months in, that 0.04% of "captured" payments never made it to Kafka.

Triggers are tighter but rigid. Postgres AFTER INSERT OR UPDATE OR DELETE triggers can call a function that writes to an outbox table, but every change pays the cost synchronously, and a slow downstream slows every commit. They also do not survive logical replication or major-version upgrades cleanly.

The good way is log-based CDC. Postgres already writes a durable, ordered Write-Ahead Log of every change — that is how replication works. If you can tail that log, decode it, and publish the decoded events to Kafka, you get every change, in commit order, with no extra load on the primary (the log was being written anyway), and bounded latency. The database is now a streaming source. Debezium is the canonical implementation of this idea.

How Postgres logical decoding works

Postgres has two log levels. The physical WAL is byte-level — "page 4242 in file 16384/16385/2604 had bytes 1024–1196 changed to these values." That is what streaming replication ships, and it requires the replica to have an identical on-disk layout. Useless for downstream systems.

Logical decoding is a layer on top. You ask Postgres "give me the WAL, but as logical row-level events: this insert into orders, that update on payments, that delete on cart_items." Postgres reads the physical WAL, runs it through an output plugin that knows how to translate page changes into row events, and streams them to whoever is subscribed.

To use logical decoding you need three things:

  1. wal_level = logical in postgresql.conf. Default is replica. Changing it requires a restart and writes slightly more WAL (roughly 5–15% more, because logical decoding needs the old row image for updates and deletes).
  2. A replication slot. A named cursor that says "I am consumer X, the last LSN I confirmed was Y, do not garbage-collect WAL newer than Y." Slots are persistent on disk.
  3. A publication. A list of tables you want changes for, optionally filtered by INSERT, UPDATE, DELETE, TRUNCATE. This is what CREATE PUBLICATION builds.

The default output plugin since Postgres 10 is pgoutput — the same plugin used by built-in logical replication between Postgres instances. Debezium uses pgoutput by default in modern versions; older deployments used wal2json or decoderbufs.

Debezium logical decoding pipelineFive stacked boxes showing the flow from a SQL update statement on the left, through the Postgres backend writing WAL, the WAL sender process reading the WAL, the pgoutput plugin decoding it into row events, the Debezium connector subscribed to the replication slot, and finally a Kafka topic on the right. An arrow loops back from Kafka to the slot showing the LSN confirmation that lets Postgres garbage-collect old WAL.ApplicationUPDATE orders ...Postgres backendwrites WAL recordWAL sender + pgoutputdecodes to row eventsDebezium connectorslot subscriber, JSON/AvroKafka topicdb1.public.ordersReplication slotdebezium_slot @ LSNTwo log levels in Postgresphysical WAL: byte-level page changes — for streaming replicas. logical: row-level events via pgoutput — for CDC.confirmed LSN feedback
The change flows left-to-right. The LSN confirmation flows right-to-left, and that loop is what lets Postgres safely free old WAL.

The LSN — Log Sequence Number — is a 64-bit byte offset into the WAL stream, written like 0/16B3748 (segment/byte). Every event has one. The slot remembers the last LSN the consumer has confirmed. WAL older than the oldest active slot's confirmed LSN can be recycled. WAL newer than it cannot. Why this matters: a Debezium connector that crashes for two days while writes continue causes Postgres to retain two days of WAL on disk. If pg_wal/ fills the disk, the primary stops accepting writes. The slot is a debt the primary holds against the consumer.

A working Debezium pipeline in 50 lines of config

Concrete is better than abstract. Here is an end-to-end pipeline: a Postgres orders table on a Hyderabad host, Debezium running as a Kafka Connect plugin, Kafka local to the connector, and a tiny Python consumer printing every change.

-- run in psql as a superuser
ALTER SYSTEM SET wal_level = 'logical';
-- restart Postgres after this; wal_level changes need a restart.

CREATE TABLE orders (
  order_id   BIGSERIAL PRIMARY KEY,
  merchant   TEXT NOT NULL,
  amount_p   BIGINT NOT NULL,        -- amount in paise; ₹1 = 100 paise
  status     TEXT NOT NULL,
  updated_at TIMESTAMPTZ DEFAULT now()
);

CREATE PUBLICATION dbz_pub FOR TABLE orders;
SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput');

Now a Debezium Postgres connector configuration, posted to the Kafka Connect REST API:

{
  "name": "orders-cdc",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "pg-hyderabad-1.internal",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "***",
    "database.dbname": "razorpay_prod",
    "topic.prefix": "rp1",
    "plugin.name": "pgoutput",
    "publication.name": "dbz_pub",
    "slot.name": "debezium_slot",
    "table.include.list": "public.orders",
    "snapshot.mode": "initial",
    "key.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter"
  }
}

Run INSERT INTO orders (merchant, amount_p, status) VALUES ('cafe-coimbatore', 25000, 'pending'); and a record lands on the Kafka topic rp1.public.orders. Here is the value, formatted (key is {"order_id": 1}):

{
  "before": null,
  "after": {
    "order_id": 1,
    "merchant": "cafe-coimbatore",
    "amount_p": 25000,
    "status": "pending",
    "updated_at": "2026-04-25T11:42:03.117Z"
  },
  "source": {
    "version": "2.6.1.Final",
    "connector": "postgresql",
    "name": "rp1",
    "ts_ms": 1745581323117,
    "snapshot": "false",
    "db": "razorpay_prod",
    "schema": "public",
    "table": "orders",
    "txId": 4827193,
    "lsn": 23847362
  },
  "op": "c",
  "ts_ms": 1745581323120
}

The schema is the Debezium envelope: before, after, source, op, ts_ms. op is c for create, u for update, d for delete, r for read (used during snapshots), t for truncate. For an update you get before and after populated; for a delete only before. The source.lsn is the LSN of the change in the source WAL. The source.snapshot flag tells you whether this row came from the initial snapshot or the streaming phase.

A consumer is now trivial. Here is one that materialises a fresh OpenSearch document on every change:

# orders_indexer.py — consume Debezium CDC into OpenSearch
from kafka import KafkaConsumer
import json, opensearchpy

es = opensearchpy.OpenSearch("https://opensearch-mum.internal:9200")
c = KafkaConsumer(
    "rp1.public.orders",
    bootstrap_servers="kafka-1.internal:9092",
    group_id="orders-indexer",
    enable_auto_commit=False,
    value_deserializer=lambda b: json.loads(b),
    key_deserializer=lambda b: json.loads(b),
)

for msg in c:
    op = msg.value["op"]
    if op in ("c", "u", "r"):
        doc = msg.value["after"]
        es.index(index="orders", id=doc["order_id"], body=doc)
    elif op == "d":
        es.delete(index="orders", id=msg.value["before"]["order_id"], ignore=[404])
    c.commit()  # only after the side effect succeeded

That is the whole shape of CDC in production. The key is idempotency by primary key: replaying the same record twice produces the same final state, because we always write the full after row to OpenSearch keyed by order_id. Combined with at-least-once delivery from Kafka, this gives effectively-once outcomes without exactly-once delivery semantics.

MySQL binlog row format and the wider connector family

MySQL has its own log: the binary log (binlog). It exists for replication, point-in-time recovery, and auditing. There are three formats:

Debezium's MySQL connector requires binlog_format = ROW and binlog_row_image = FULL (so updates carry the full before image, not just the changed columns). It connects to the MySQL server using the replication protocol — same protocol a replica would use — and reads binlog events: WRITE_ROWS_EVENT, UPDATE_ROWS_EVENT, DELETE_ROWS_EVENT. Each event is decoded into the same Debezium envelope you saw for Postgres. The position in the log is a (binlog_filename, binlog_position) pair, or a GTID set if GTIDs are enabled.

Beyond Postgres and MySQL, the same pattern shows up everywhere a database has a durable log:

Every connector boils down to: find the database's log, get permission to tail it, decode it, hand it to Kafka. Different log shape, same shape of solution.

Snapshot plus stream initial loadTimeline diagram. Time runs left to right. The top track shows the WAL/binlog with markers at LSNs labelled L0, L1, L2. The middle track shows three phases: an initial snapshot reading the table at LSN L0, a transition phase, and a streaming phase reading WAL from L0 onward. The bottom track shows the resulting Kafka topic with snapshot records first (op equals r) followed by streaming records with monotonically increasing LSNs.Snapshot + stream: how Debezium does an initial loadLSN L0LSN L1LSN L2WALSnapshot phaseSELECT * FROM table at L0; emit op=rStreaming phasetail WAL from L0; emit op=c/u/dr r rr r rr r r rr r rcucduuc→ Kafkasnapshot rows tagged op=r so consumers know they came from a SELECT, not the WAL
Initial load is the trickiest part of CDC. The connector takes a snapshot at LSN L0, then streams the WAL from L0 forward — so every row is delivered exactly once across the boundary, in order.

Snapshot plus stream: the initial-load problem

When you point Debezium at an existing table with five years of data and ten million rows, what happens? You cannot just start tailing the WAL — you would only see future changes, never the rows already in the table. You also cannot do a separate SELECT * and then start tailing — between the snapshot and the stream there is a gap, or worse, an overlap with double delivery.

The trick: freeze a single LSN as the boundary, snapshot at that LSN, then stream from the same LSN.

In Postgres, Debezium does this:

  1. Open a transaction with REPEATABLE READ ISOLATION LEVEL READ ONLY.
  2. SELECT pg_export_snapshot() — gives a snapshot ID like 00000003-0000001F-1.
  3. pg_create_logical_replication_slot(...) USING SNAPSHOT '00000003-0000001F-1' — creates the slot at exactly the LSN of that snapshot.
  4. In the same transaction, SELECT * FROM orders and stream rows to Kafka with op = 'r'.
  5. Commit the transaction. Now start consuming the slot, which begins exactly where the snapshot ended.

The transactional snapshot is the magic. Postgres guarantees that SELECT * inside the transaction sees the database state as of the snapshot LSN, and the slot starts streaming WAL records strictly after that LSN. No gap, no overlap. Why this works: the slot and the snapshot are coupled by a shared LSN. Without that coupling, rows committed between the SELECT and the slot creation would be missed; rows captured by both would be double-delivered. The MVCC snapshot is what closes the seam.

For very large tables this still hurts — a SELECT * on a 200 GB table holds a snapshot for hours, blocks autovacuum from cleaning up dead tuples, and bloats the table. Modern Debezium adds incremental snapshots (the Watermark-Based Snapshotting technique from a 2019 Debezium blog post): the snapshot is broken into chunks, each chunk reads a key range, and watermark events are written to the WAL so the streaming consumer knows when a chunk is "done." This lets snapshots and streaming run concurrently and lets you re-snapshot one table without restarting the whole connector.

MySQL has no native MVCC snapshot mechanism for the binlog the way Postgres does. Debezium uses a FLUSH TABLES WITH READ LOCK (briefly, to record the binlog position), then SELECT * from each table inside a REPEATABLE READ transaction (InnoDB MVCC), then releases the lock and streams from the recorded position. The lock is held for milliseconds — long enough to call SHOW MASTER STATUS. This is why MySQL CDC with Debezium has a "blocking snapshot" cost that Postgres does not.

Schema evolution: the silent killer

A migration runs at 02:30 IST: ALTER TABLE orders ADD COLUMN refund_id BIGINT;. Three things happen.

In Postgres logical decoding, the schema change itself does not appear in the logical stream. pgoutput only emits row-level events. But the next INSERT after the ALTER produces a row event whose tuple has one extra column. Debezium handles this by inspecting the publication's schema on each relation message — when the schema id changes, it re-reads the table definition and updates the in-memory schema for that table. The connector then writes a schema change event to a separate topic (<prefix>.<db>.schema_changes) so downstream consumers can react.

For MySQL the binlog is even better-behaved: ALTER TABLE statements appear as QUERY_EVENT records right in the binlog stream, in commit order with the row events. Debezium parses them, updates its internal schema, and writes a schema change event.

What can go wrong:

The single most useful operational discipline is: register every event schema in a schema registry (Confluent, Apicurio) with BACKWARD compatibility mode. Producers (Debezium) cannot publish schemas that break consumers. The registry rejects the change at publish time, before the bad event spreads.

Common confusions

Going deeper

Why pgoutput and not wal2json

Early Debezium used wal2json, an external output plugin that emits JSON. It has to be installed on the server (apt install postgresql-15-wal2json), which makes it a non-starter on managed databases like RDS or Cloud SQL where you cannot install C extensions. pgoutput is built into Postgres core since version 10, so it works everywhere managed Postgres is offered. The format is binary and slightly more compact, but you almost never read it directly — Debezium handles the parsing. The decoderbufs plugin emits Protocol Buffers and is faster but, again, requires installation. Modern recommendation: pgoutput everywhere unless you have a specific reason not to.

Replication slot operations and disasters

The slot is a debt. Three ways it bites:

  1. Disk fills. Slot stops being consumed (connector down, paused, or buggy). WAL accumulates in pg_wal/. Postgres 13+ added max_slot_wal_keep_size to cap this — slots that exceed the cap are marked lost and dropped, sacrificing the consumer to save the primary. Pre-13 deployments must monitor pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) and alarm before the disk dies.
  2. Failover loses the slot. Pre-Postgres 16, slots are local to a node. A failover to a replica creates a fresh node with no slot. You either re-snapshot, or use pg_failover_slots (a third-party extension, now mainlined as failover slots in PG 17), or a tool like Patroni that copies slots across.
  3. Long transactions block decoding. Logical decoding can only emit a transaction's row events after the transaction commits — and it cannot decode any later transaction until the earlier one is committed. A 4-hour BEGIN on the primary blocks CDC for 4 hours. The pathology shows up as pg_replication_slots.active = true but confirmed_flush_lsn never advances. Always have a idle_in_transaction_session_timeout set.

Outbox pattern: when you cannot tail the database

Some shops run on managed databases (Aurora MySQL prior to its CDC support, older RDS Postgres versions) where you cannot enable logical decoding or get binlog access. The fallback is the transactional outbox: every business write also inserts a row into an outbox table within the same transaction. A separate process polls (or, when supported, CDC-tails) the outbox and publishes events. Debezium's EventRouter SMT (single message transform) is built for exactly this — it consumes from the outbox table, routes by aggregate_type, and unwraps the payload. It is more code than direct CDC, but it gives the application explicit control over what becomes an event and lets you serialise the payload at write time. Useful when CDC is forbidden or when you want to hide schema changes from consumers.

What Kafka topics look like and why

Debezium follows a strict naming convention: <topic.prefix>.<schema>.<table>, one topic per table. Within a topic, the key is the row's primary key, encoded as JSON or Avro. This is the load-bearing detail. Kafka guarantees per-key ordering within a partition, and partitioning by primary key means all changes for a given row land on the same partition in commit order. A consumer processing row order_id=42 always sees its inserts, updates, and delete in the right sequence, even with hundreds of consumer instances. Without that key, a parallel consumer would see updates out of order, which is unrecoverable for a materialised view.

The Pgoutput protocol details

For the curious: pgoutput speaks a binary streaming protocol over the standard Postgres replication connection. Messages include Begin (B), Commit (C), Origin (O), Relation (R) — sends the table schema once per session per relation, Insert (I), Update (U), Delete (D), Truncate (T), Type (Y), and Stream Start/Stop/Commit/Abort for in-progress transaction streaming (Postgres 14+). Each tuple is encoded with a tag byte per column: n = null, u = unchanged TOAST value (for updates that did not modify a large column), t = text-format value. The "unchanged TOAST" tag is the source of subtle CDC bugs: an update that did not touch a TOASTed column produces an after event with placeholder values, which downstream consumers must merge with prior state. Debezium has REPLICA IDENTITY FULL (which forces the full before-image) as a workaround, at the cost of more WAL.

Where this leads next

CDC is the bridge that makes the rest of streaming databases possible. From here:

References

  1. Debezium project, Debezium documentation — the canonical reference for connector configuration, snapshot modes, and SMTs across Postgres, MySQL, MongoDB, SQL Server, Oracle, and Cassandra. debezium.io/documentation.
  2. PostgreSQL Global Development Group, Logical Decoding — the official Postgres docs on wal_level=logical, replication slots, output plugins, and the protocol. postgresql.org/docs/current/logicaldecoding.html.
  3. MySQL AB, Replication and Binary Logging — binlog formats, GTIDs, and the replication protocol Debezium speaks. dev.mysql.com/doc/refman/8.0/en/replication.html.
  4. Jiri Pechanec, DBZ-3157: Watermark-based incremental snapshots (Debezium blog, 2021) — the chunked-snapshot design that lets large tables be loaded without holding a Postgres snapshot for hours. debezium.io/blog/2021/10/07/incremental-snapshots/.
  5. Gunnar Morling, Reliable Microservices Data Exchange With the Outbox Pattern (Debezium blog, 2019) — the outbox pattern and how Debezium's EventRouter SMT consumes it. debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/.
  6. Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 11 Stream Processing — the textbook treatment of CDC, log-based message brokers, and database-inside-out. dataintensive.net.
  7. Confluent, Schema Registry — schema evolution rules (BACKWARD, FORWARD, FULL) for Avro/JSON/Protobuf; the operational discipline that keeps CDC pipelines from breaking. docs.confluent.io/platform/current/schema-registry.
  8. Kafka as a distributed log — chapter 174 of this track, the substrate Debezium publishes to.