Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Change Data Capture: Debezium and Logical Decoding

A PaisaBridge payment row flips from pending to captured at 11:42:03.117. By 11:42:03.118 the search index, the fraud model, the merchant dashboard cache, and the BigQuery warehouse all need to know — without anyone polling, without dual writes, and without a trigger that can take down the primary. Change Data Capture is the bridge that makes this possible, and it is built on a single audacious idea: the database already wrote a durable, ordered log of the change. Just go read it.

Change Data Capture turns an OLTP database (Postgres, MySQL) into a streaming source by tailing its replication log — pgoutput logical decoding for Postgres, row-format binlog for MySQL. Debezium is the connector that does this and writes each change as a Kafka record. The hard parts are not the parsing: they are snapshot-plus-stream initial loads, schema evolution, and replication slot operations that do not retain WAL forever and crash the primary.

What CDC is actually for

You have a Postgres database holding the truth about orders. You also have:

  • An OpenSearch cluster that needs every order indexed for the merchant search box.
  • A Redis cache holding the latest order status keyed by order_id.
  • A BigQuery warehouse running daily revenue dashboards for finance.
  • A Flink job computing real-time fraud features from the last five minutes of activity.
  • A webhook fan-out that fires on every status change to merchant integrations.

Five downstream systems, one source of truth. The naive ways to keep them in sync all fail in production:

Polling every minute with SELECT * FROM orders WHERE updated_at > :last_seen misses deletes (a deleted row matches no predicate), collapses two updates between polls into one, and puts constant scan load on the primary that grows with your data and your freshness target. Why polling rots: latency floor is your poll interval, deletes are invisible because the row is gone, and the load on the writer scales with how fresh consumers want to be — exactly backwards from what you want.

Dual writes — the application writes to Postgres, then to Kafka — are worse. A network blip between the two leaves the systems permanently inconsistent. There is no atomic boundary across two systems without a distributed transaction protocol you do not want to operate at scale. Riya from the merchant dashboard team finds, three months in, that 0.04% of "captured" payments never made it to Kafka.

Triggers are tighter but rigid. Postgres AFTER INSERT OR UPDATE OR DELETE triggers can call a function that writes to an outbox table, but every change pays the cost synchronously, and a slow downstream slows every commit. They also do not survive logical replication or major-version upgrades cleanly.

The good way is log-based CDC. Postgres already writes a durable, ordered Write-Ahead Log of every change — that is how replication works. If you can tail that log, decode it, and publish the decoded events to Kafka, you get every change, in commit order, with no extra load on the primary (the log was being written anyway), and bounded latency. The database is now a streaming source. Debezium is the canonical implementation of this idea.

How Postgres logical decoding works

Postgres has two log levels. The physical WAL is byte-level — "page 4242 in file 16384/16385/2604 had bytes 1024–1196 changed to these values." That is what streaming replication ships, and it requires the replica to have an identical on-disk layout. Useless for downstream systems.

Logical decoding is a layer on top. You ask Postgres "give me the WAL, but as logical row-level events: this insert into orders, that update on payments, that delete on cart_items." Postgres reads the physical WAL, runs it through an output plugin that knows how to translate page changes into row events, and streams them to whoever is subscribed.

To use logical decoding you need three things:

  1. wal_level = logical in postgresql.conf. Default is replica. Changing it requires a restart and writes slightly more WAL (roughly 5–15% more, because logical decoding needs the old row image for updates and deletes).
  2. A replication slot. A named cursor that says "I am consumer X, the last LSN I confirmed was Y, do not garbage-collect WAL newer than Y." Slots are persistent on disk.
  3. A publication. A list of tables you want changes for, optionally filtered by INSERT, UPDATE, DELETE, TRUNCATE. This is what CREATE PUBLICATION builds.

The default output plugin since Postgres 10 is pgoutput — the same plugin used by built-in logical replication between Postgres instances. Debezium uses pgoutput by default in modern versions; older deployments used wal2json or decoderbufs.

Debezium logical decoding pipelineFive stacked boxes showing the flow from a SQL update statement on the left, through the Postgres backend writing WAL, the WAL sender process reading the WAL, the pgoutput plugin decoding it into row events, the Debezium connector subscribed to the replication slot, and finally a Kafka topic on the right. An arrow loops back from Kafka to the slot showing the LSN confirmation that lets Postgres garbage-collect old WAL.ApplicationUPDATE orders ...Postgres backendwrites WAL recordWAL sender + pgoutputdecodes to row eventsDebezium connectorslot subscriber, JSON/AvroKafka topicdb1.public.ordersReplication slotdebezium_slot @ LSNTwo log levels in Postgresphysical WAL: byte-level page changes — for streaming replicas. logical: row-level events via pgoutput — for CDC.confirmed LSN feedback
The change flows left-to-right. The LSN confirmation flows right-to-left, and that loop is what lets Postgres safely free old WAL.

The LSN — Log Sequence Number — is a 64-bit byte offset into the WAL stream, written like 0/16B3748 (segment/byte). Every event has one. The slot remembers the last LSN the consumer has confirmed. WAL older than the oldest active slot's confirmed LSN can be recycled. WAL newer than it cannot. Why this matters: a Debezium connector that crashes for two days while writes continue causes Postgres to retain two days of WAL on disk. If pg_wal/ fills the disk, the primary stops accepting writes. The slot is a debt the primary holds against the consumer.

A working Debezium pipeline in 50 lines of config

Concrete is better than abstract. Here is an end-to-end pipeline: a Postgres orders table on a Hyderabad host, Debezium running as a Kafka Connect plugin, Kafka local to the connector, and a tiny Python consumer printing every change.

-- run in psql as a superuser
ALTER SYSTEM SET wal_level = 'logical';
-- restart Postgres after this; wal_level changes need a restart.

CREATE TABLE orders (
  order_id   BIGSERIAL PRIMARY KEY,
  merchant   TEXT NOT NULL,
  amount_p   BIGINT NOT NULL,        -- amount in paise; ₹1 = 100 paise
  status     TEXT NOT NULL,
  updated_at TIMESTAMPTZ DEFAULT now()
);

CREATE PUBLICATION dbz_pub FOR TABLE orders;
SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput');

Now a Debezium Postgres connector configuration, posted to the Kafka Connect REST API:

{
  "name": "orders-cdc",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "pg-hyderabad-1.internal",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "***",
    "database.dbname": "razorpay_prod",
    "topic.prefix": "rp1",
    "plugin.name": "pgoutput",
    "publication.name": "dbz_pub",
    "slot.name": "debezium_slot",
    "table.include.list": "public.orders",
    "snapshot.mode": "initial",
    "key.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter"
  }
}

Run INSERT INTO orders (merchant, amount_p, status) VALUES ('cafe-coimbatore', 25000, 'pending'); and a record lands on the Kafka topic rp1.public.orders. Here is the value, formatted (key is {"order_id": 1}):

{
  "before": null,
  "after": {
    "order_id": 1,
    "merchant": "cafe-coimbatore",
    "amount_p": 25000,
    "status": "pending",
    "updated_at": "2026-04-25T11:42:03.117Z"
  },
  "source": {
    "version": "2.6.1.Final",
    "connector": "postgresql",
    "name": "rp1",
    "ts_ms": 1745581323117,
    "snapshot": "false",
    "db": "razorpay_prod",
    "schema": "public",
    "table": "orders",
    "txId": 4827193,
    "lsn": 23847362
  },
  "op": "c",
  "ts_ms": 1745581323120
}

The schema is the Debezium envelope: before, after, source, op, ts_ms. op is c for create, u for update, d for delete, r for read (used during snapshots), t for truncate. For an update you get before and after populated; for a delete only before. The source.lsn is the LSN of the change in the source WAL. The source.snapshot flag tells you whether this row came from the initial snapshot or the streaming phase.

A consumer is now trivial. Here is one that materialises a fresh OpenSearch document on every change:

# orders_indexer.py — consume Debezium CDC into OpenSearch
from kafka import KafkaConsumer
import json, opensearchpy

es = opensearchpy.OpenSearch("https://opensearch-mum.internal:9200")
c = KafkaConsumer(
    "rp1.public.orders",
    bootstrap_servers="kafka-1.internal:9092",
    group_id="orders-indexer",
    enable_auto_commit=False,
    value_deserializer=lambda b: json.loads(b),
    key_deserializer=lambda b: json.loads(b),
)

for msg in c:
    op = msg.value["op"]
    if op in ("c", "u", "r"):
        doc = msg.value["after"]
        es.index(index="orders", id=doc["order_id"], body=doc)
    elif op == "d":
        es.delete(index="orders", id=msg.value["before"]["order_id"], ignore=[404])
    c.commit()  # only after the side effect succeeded

That is the whole shape of CDC in production. The key is idempotency by primary key: replaying the same record twice produces the same final state, because we always write the full after row to OpenSearch keyed by order_id. Combined with at-least-once delivery from Kafka, this gives effectively-once outcomes without exactly-once delivery semantics.

MySQL binlog row format and the wider connector family

MySQL has its own log: the binary log (binlog). It exists for replication, point-in-time recovery, and auditing. There are three formats:

  • Statement-based (SBR). The binlog stores the SQL statement that ran (UPDATE orders SET status = 'captured' WHERE order_id < 1000). Replays may diverge — NOW(), RAND(), or non-deterministic triggers can produce different rows on the replica. Useless for CDC.
  • Row-based (RBR). The binlog stores the actual row images: which row, before image, after image. Deterministic, replay-safe, exactly what CDC wants.
  • Mixed. Statement when safe, row when not. Avoid for CDC — predictability matters more than binlog size.

Debezium's MySQL connector requires binlog_format = ROW and binlog_row_image = FULL (so updates carry the full before image, not just the changed columns). It connects to the MySQL server using the replication protocol — same protocol a replica would use — and reads binlog events: WRITE_ROWS_EVENT, UPDATE_ROWS_EVENT, DELETE_ROWS_EVENT. Each event is decoded into the same Debezium envelope you saw for Postgres. The position in the log is a (binlog_filename, binlog_position) pair, or a GTID set if GTIDs are enabled.

Beyond Postgres and MySQL, the same pattern shows up everywhere a database has a durable log:

  • MongoDB: the oplog. Debezium has a MongoDB connector, but MongoDB also exposes change streams natively — you usually do not need Debezium.
  • SQL Server: CDC tables and Change Tracking, fed by Debezium's SQL Server connector.
  • Oracle: LogMiner, or XStream Out for licensed deployments. Painful, but it works.
  • Cassandra: the commit log via Debezium's Cassandra connector — though most Cassandra users emit events from the application layer instead.

Every connector boils down to: find the database's log, get permission to tail it, decode it, hand it to Kafka. Different log shape, same shape of solution.

Snapshot plus stream initial loadTimeline diagram. Time runs left to right. The top track shows the WAL/binlog with markers at LSNs labelled L0, L1, L2. The middle track shows three phases: an initial snapshot reading the table at LSN L0, a transition phase, and a streaming phase reading WAL from L0 onward. The bottom track shows the resulting Kafka topic with snapshot records first (op equals r) followed by streaming records with monotonically increasing LSNs.Snapshot + stream: how Debezium does an initial loadLSN L0LSN L1LSN L2WALSnapshot phaseSELECT * FROM table at L0; emit op=rStreaming phasetail WAL from L0; emit op=c/u/dr r rr r rr r r rr r rcucduuc→ Kafkasnapshot rows tagged op=r so consumers know they came from a SELECT, not the WAL
Initial load is the trickiest part of CDC. The connector takes a snapshot at LSN L0, then streams the WAL from L0 forward — so every row is delivered exactly once across the boundary, in order.

Snapshot plus stream: the initial-load problem

When you point Debezium at an existing table with five years of data and ten million rows, what happens? You cannot just start tailing the WAL — you would only see future changes, never the rows already in the table. You also cannot do a separate SELECT * and then start tailing — between the snapshot and the stream there is a gap, or worse, an overlap with double delivery.

The trick: freeze a single LSN as the boundary, snapshot at that LSN, then stream from the same LSN.

In Postgres, Debezium does this:

  1. Open a transaction with REPEATABLE READ ISOLATION LEVEL READ ONLY.
  2. SELECT pg_export_snapshot() — gives a snapshot ID like 00000003-0000001F-1.
  3. pg_create_logical_replication_slot(...) USING SNAPSHOT '00000003-0000001F-1' — creates the slot at exactly the LSN of that snapshot.
  4. In the same transaction, SELECT * FROM orders and stream rows to Kafka with op = 'r'.
  5. Commit the transaction. Now start consuming the slot, which begins exactly where the snapshot ended.

The transactional snapshot is the magic. Postgres guarantees that SELECT * inside the transaction sees the database state as of the snapshot LSN, and the slot starts streaming WAL records strictly after that LSN. No gap, no overlap. Why this works: the slot and the snapshot are coupled by a shared LSN. Without that coupling, rows committed between the SELECT and the slot creation would be missed; rows captured by both would be double-delivered. The MVCC snapshot is what closes the seam.

For very large tables this still hurts — a SELECT * on a 200 GB table holds a snapshot for hours, blocks autovacuum from cleaning up dead tuples, and bloats the table. Modern Debezium adds incremental snapshots (the Watermark-Based Snapshotting technique from a 2019 Debezium blog post): the snapshot is broken into chunks, each chunk reads a key range, and watermark events are written to the WAL so the streaming consumer knows when a chunk is "done." This lets snapshots and streaming run concurrently and lets you re-snapshot one table without restarting the whole connector.

MySQL has no native MVCC snapshot mechanism for the binlog the way Postgres does. Debezium uses a FLUSH TABLES WITH READ LOCK (briefly, to record the binlog position), then SELECT * from each table inside a REPEATABLE READ transaction (InnoDB MVCC), then releases the lock and streams from the recorded position. The lock is held for milliseconds — long enough to call SHOW MASTER STATUS. This is why MySQL CDC with Debezium has a "blocking snapshot" cost that Postgres does not.

Schema evolution: the silent killer

A migration runs at 02:30 IST: ALTER TABLE orders ADD COLUMN refund_id BIGINT;. Three things happen.

In Postgres logical decoding, the schema change itself does not appear in the logical stream. pgoutput only emits row-level events. But the next INSERT after the ALTER produces a row event whose tuple has one extra column. Debezium handles this by inspecting the publication's schema on each relation message — when the schema id changes, it re-reads the table definition and updates the in-memory schema for that table. The connector then writes a schema change event to a separate topic (<prefix>.<db>.schema_changes) so downstream consumers can react.

For MySQL the binlog is even better-behaved: ALTER TABLE statements appear as QUERY_EVENT records right in the binlog stream, in commit order with the row events. Debezium parses them, updates its internal schema, and writes a schema change event.

What can go wrong:

  • Backwards-incompatible changes. Dropping a column, renaming a column, changing a type from INT to TEXT — these break consumers that expect the old schema. Debezium will pass the change through, but your OpenSearch mapping or your Avro schema registry will reject the new event.
  • Columns dropped before the snapshot finishes. A DROP COLUMN while a snapshot is in progress can leave snapshot rows with fewer columns than streaming rows. Debezium 2.x handles this for Postgres; older versions need a connector restart.
  • pg_repack or table rewrites. Running pg_repack on a published table changes the underlying relfilenode. Old connectors with cached schemas would crash; modern ones re-read on the next relation message.
  • Tables added to the publication after slot creation. ALTER PUBLICATION ... ADD TABLE works, but Debezium needs publication.autocreate.mode = filtered or disabled, and you must take a fresh snapshot of the new table — usually via the signalling table mechanism (INSERT INTO debezium_signal ...), which triggers an ad-hoc incremental snapshot.

The single most useful operational discipline is: register every event schema in a schema registry (Confluent, Apicurio) with BACKWARD compatibility mode. Producers (Debezium) cannot publish schemas that break consumers. The registry rejects the change at publish time, before the bad event spreads.

Common confusions

  • "CDC is the same as replication." Both tail a log, but their consumers are different. Replication consumers are other instances of the same database — they need byte-level WAL or row-image binlog. CDC consumers are arbitrary downstream systems (Kafka, search, warehouse) that need logical row events with stable schemas. Postgres exposes both: physical streaming replication for replicas, logical decoding for CDC.

  • "Replication slots are just bookmarks." They are bookmarks the database is contractually obligated to honour. An idle slot pins WAL retention and can fill the disk on the primary, halting all writes. Always alarm on pg_replication_slots.confirmed_flush_lsn lag and disk usage. Drop slots that belong to dead consumers — Debezium does not do this for you.

  • "Statement-based binlog is fine for CDC if my queries are simple." It is not. Triggers, defaults like NOW(), and stored procedures can all turn one statement into different rows on different replays. Even if your queries look deterministic today, a future schema change can introduce non-determinism silently. Always use binlog_format = ROW.

  • "Debezium guarantees exactly-once delivery." It does not. Debezium produces to Kafka with at-least-once semantics by default. Duplicates can happen on connector restart, before the LSN was confirmed. Effectively-once outcomes require either Kafka transactions (Connect's exactly.once.source.support) or idempotent consumers keyed by primary key.

  • "I can just use triggers and write to a Kafka producer from inside Postgres." You cannot. Triggers run inside the transaction; if the Kafka producer hangs, your transaction hangs and your application stalls. The classic fix is the transactional outbox: triggers (or application code) write to an outbox table in the same transaction, and a separate poller reads the outbox and publishes — but at that point you have rebuilt a worse Debezium.

  • "Logical decoding works on Postgres replicas." Until Postgres 16, no — slots could only be created on the primary, so a primary failover meant losing the slot and re-snapshotting. Postgres 16 added logical decoding on standbys, finally letting CDC survive failover. Most production deployments still pin Debezium to the primary or use connector-level redundancy.

Going deeper

Why pgoutput and not wal2json

Early Debezium used wal2json, an external output plugin that emits JSON. It has to be installed on the server (apt install postgresql-15-wal2json), which makes it a non-starter on managed databases like RDS or Cloud SQL where you cannot install C extensions. pgoutput is built into Postgres core since version 10, so it works everywhere managed Postgres is offered. The format is binary and slightly more compact, but you almost never read it directly — Debezium handles the parsing. The decoderbufs plugin emits Protocol Buffers and is faster but, again, requires installation. Modern recommendation: pgoutput everywhere unless you have a specific reason not to.

Replication slot operations and disasters

The slot is a debt. Three ways it bites:

  1. Disk fills. Slot stops being consumed (connector down, paused, or buggy). WAL accumulates in pg_wal/. Postgres 13+ added max_slot_wal_keep_size to cap this — slots that exceed the cap are marked lost and dropped, sacrificing the consumer to save the primary. Pre-13 deployments must monitor pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) and alarm before the disk dies.
  2. Failover loses the slot. Pre-Postgres 16, slots are local to a node. A failover to a replica creates a fresh node with no slot. You either re-snapshot, or use pg_failover_slots (a third-party extension, now mainlined as failover slots in PG 17), or a tool like Patroni that copies slots across.
  3. Long transactions block decoding. Logical decoding can only emit a transaction's row events after the transaction commits — and it cannot decode any later transaction until the earlier one is committed. A 4-hour BEGIN on the primary blocks CDC for 4 hours. The pathology shows up as pg_replication_slots.active = true but confirmed_flush_lsn never advances. Always have a idle_in_transaction_session_timeout set.

Outbox pattern: when you cannot tail the database

Some shops run on managed databases (Aurora MySQL prior to its CDC support, older RDS Postgres versions) where you cannot enable logical decoding or get binlog access. The fallback is the transactional outbox: every business write also inserts a row into an outbox table within the same transaction. A separate process polls (or, when supported, CDC-tails) the outbox and publishes events. Debezium's EventRouter SMT (single message transform) is built for exactly this — it consumes from the outbox table, routes by aggregate_type, and unwraps the payload. It is more code than direct CDC, but it gives the application explicit control over what becomes an event and lets you serialise the payload at write time. Useful when CDC is forbidden or when you want to hide schema changes from consumers.

What Kafka topics look like and why

Debezium follows a strict naming convention: <topic.prefix>.<schema>.<table>, one topic per table. Within a topic, the key is the row's primary key, encoded as JSON or Avro. This is the load-bearing detail. Kafka guarantees per-key ordering within a partition, and partitioning by primary key means all changes for a given row land on the same partition in commit order. A consumer processing row order_id=42 always sees its inserts, updates, and delete in the right sequence, even with hundreds of consumer instances. Without that key, a parallel consumer would see updates out of order, which is unrecoverable for a materialised view.

The Pgoutput protocol details

For the curious: pgoutput speaks a binary streaming protocol over the standard Postgres replication connection. Messages include Begin (B), Commit (C), Origin (O), Relation (R) — sends the table schema once per session per relation, Insert (I), Update (U), Delete (D), Truncate (T), Type (Y), and Stream Start/Stop/Commit/Abort for in-progress transaction streaming (Postgres 14+). Each tuple is encoded with a tag byte per column: n = null, u = unchanged TOAST value (for updates that did not modify a large column), t = text-format value. The "unchanged TOAST" tag is the source of subtle CDC bugs: an update that did not touch a TOASTed column produces an after event with placeholder values, which downstream consumers must merge with prior state. Debezium has REPLICA IDENTITY FULL (which forces the full before-image) as a workaround, at the cost of more WAL.

Where this leads next

CDC is the bridge that makes the rest of streaming databases possible. From here:

References

  1. Debezium project, Debezium documentation — the canonical reference for connector configuration, snapshot modes, and SMTs across Postgres, MySQL, MongoDB, SQL Server, Oracle, and Cassandra. debezium.io/documentation.
  2. PostgreSQL Global Development Group, Logical Decoding — the official Postgres docs on wal_level=logical, replication slots, output plugins, and the protocol. postgresql.org/docs/current/logicaldecoding.html.
  3. MySQL AB, Replication and Binary Logging — binlog formats, GTIDs, and the replication protocol Debezium speaks. dev.mysql.com/doc/refman/8.0/en/replication.html.
  4. Jiri Pechanec, DBZ-3157: Watermark-based incremental snapshots (Debezium blog, 2021) — the chunked-snapshot design that lets large tables be loaded without holding a Postgres snapshot for hours. debezium.io/blog/2021/10/07/incremental-snapshots/.
  5. Gunnar Morling, Reliable Microservices Data Exchange With the Outbox Pattern (Debezium blog, 2019) — the outbox pattern and how Debezium's EventRouter SMT consumes it. debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/.
  6. Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 11 Stream Processing — the textbook treatment of CDC, log-based message brokers, and database-inside-out. dataintensive.net.
  7. Confluent, Schema Registry — schema evolution rules (BACKWARD, FORWARD, FULL) for Avro/JSON/Protobuf; the operational discipline that keeps CDC pipelines from breaking. docs.confluent.io/platform/current/schema-registry.
  8. Kafka as a distributed log — chapter 174 of this track, the substrate Debezium publishes to.