Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Postgres logical decoding from scratch

The previous chapter ended with the polling pipeline at a Mumbai e-commerce startup taking out the source replica during Diwali week. The fix is not "poll faster". The fix is to stop polling. Postgres has been writing every committed change to a file on disk since 1996 — the write-ahead log, the WAL. Logical decoding is the official, in-process, replication-slot-backed API that lets your pipeline read that file as a stream of row-level events, in commit order, with millisecond freshness, costing the source database approximately nothing more than what it was already paying to keep its own replicas alive. This chapter builds the consumer side of that API by hand, with psql, before any tool sits in front of it.

The WAL already records every committed change as a physical record. Logical decoding is a server-side plugin pipeline that translates those physical records into row-level INSERT/UPDATE/DELETE events keyed by primary key. A replication slot is the bookmark that guarantees the WAL is retained until your consumer has read past it. Once you understand the slot, the output plugin, and the LSN as a position pointer, every CDC tool — Debezium, AWS DMS, Materialize, Estuary — is a thin wrapper.

What the WAL is, and why decoding exists

Postgres writes every change to the WAL before it touches the heap. That is the durability contract: a commit returns success only after its WAL records are fsync'd to disk. Crash recovery and physical replication both replay the WAL by reading those records and re-applying them byte-for-byte. The records are physical — "tuple at offset 47 of page 13892 of relation 16384, replace these 312 bytes with these 312 bytes". A standby Postgres can replay them; a Python script cannot, because the script does not know what relation 16384 is or what the tuple's columns are called.

Logical decoding is the bridge. It loads an output plugin inside the Postgres backend, feeds the physical WAL records through it, and the plugin emits logical records: "table public.orders, primary key id=98421, action UPDATE, old row {...}, new row {...}". Your consumer reads the logical records over the standard Postgres replication protocol and never has to think about page offsets again.

The output plugin runs inside the Postgres backend, not in your application. Your consumer reads the right-hand box. The first two boxes are private to the database.

This separation matters because the decoding cost is paid once per change, on the source, regardless of how many consumers want the stream. Why the cost is once per change, not once per consumer per change: the WAL records exist for crash recovery anyway. The decoding plugin processes each record once and the resulting logical events are buffered for the slot. Multiple slots mean multiple plugin invocations on the same record, but a single slot serving multiple downstream subscribers (via Kafka, say) pays the decoding cost once.

Three pieces: WAL, output plugin, replication slot

Three concepts, plus their interactions, are the entire mental model.

The WAL is a byte-addressed log

WAL is a sequence of files in pg_wal/, each 16 MB. They concatenate into one logical stream. Every byte in that stream has an LSN — log sequence number, an 8-byte position written as XX/YYYYYYYY (e.g. 0/1A2B3C4D). LSNs are monotonic across the lifetime of the cluster. A consumer that says "I have processed up to LSN 0/1A2B3C4D, give me what comes after" gets exactly the next bytes in commit order.

The output plugin is loaded into the backend

The plugin is a .so shared library that registers callbacks: begin, change, commit. When the decoder hits a WAL record for a table you have selected, it calls the plugin's change callback with a structured representation. The plugin formats the output (binary protobuf for pgoutput, JSON for wal2json, debug text for test_decoding) and pushes bytes into the slot's outgoing buffer.

Postgres ships with three plugins worth knowing:

pgoutput — the binary protocol used by Postgres's own logical replication and by Debezium 1.6+. The default for production.
wal2json — emits JSON. Simpler to read by hand. Good for first projects and for non-Java consumers that don't want to implement the binary protocol.
test_decoding — debug text format. Never use in production; it changes between versions.

The replication slot is the bookmark

A slot is server-side state that says: "consumer X has acknowledged up to LSN 0/1A2B3C4D; do not delete WAL files containing bytes after this." It is durable across restarts. It is the reason the consumer can disconnect for an hour, reconnect, and resume exactly where it left off.

Slots come in two flavours:

Physical slots — used by streaming replication standbys.
Logical slots — bound to one database, one output plugin, one consumer's position.

The slot is also the dangerous primitive. Why slots can take down a Postgres: the slot pins the WAL on disk. If the consumer dies and never reconnects, WAL files accumulate forever. A slot that has fallen behind by 3 days, on a write-heavy database, can fill the disk. The fix is not "increase disk"; the fix is to alert when pg_replication_slots.confirmed_flush_lsn falls more than N MB behind pg_current_wal_lsn() and either resurrect the consumer or drop the slot.

A healthy slot has lag in the megabytes. A dying slot has lag in the gigabytes. A fatal slot has lag exceeding the disk's free space — at which point Postgres refuses new writes.

Build it: a logical-decoding consumer in Python

Walk through it end to end. The setup is one psql session for postgresql.conf settings, one for the slot, and one Python consumer that reads.

Step 1 — configure Postgres

In postgresql.conf (or via ALTER SYSTEM):

wal_level = logical
max_replication_slots = 10
max_wal_senders = 10

Restart. wal_level = logical is the only one that requires a restart. Why wal_level matters: replica (the default for streaming replication) writes enough information to physically replay pages. logical adds extra information — old-tuple data for UPDATEs and DELETEs based on REPLICA IDENTITY — that the output plugin needs to emit row-level events. Without logical, the WAL is missing the bytes the plugin reads.

Step 2 — choose REPLICA IDENTITY per table

Each table tells the WAL writer how much old-row data to record on UPDATE/DELETE. The default — DEFAULT — uses the primary key. For a payments table this is right. For a table without a primary key, you must set FULL (records the entire old row) or USING INDEX <name>. Without one of these, the decoded UPDATE/DELETE events arrive with null identity, and the consumer cannot tell which row changed.

ALTER TABLE orders REPLICA IDENTITY DEFAULT;  -- the default; uses PK
ALTER TABLE wallet_balances REPLICA IDENTITY FULL;  -- if no PK

Step 3 — create a slot and a publication

-- Pick wal2json for human-readable JSON output during development.
SELECT pg_create_logical_replication_slot('orders_cdc_slot', 'wal2json');

-- Publication lists which tables this slot will see.
CREATE PUBLICATION orders_pub FOR TABLE orders, payments, refunds;

Step 4 — write the consumer

import json
import time
import psycopg2
import psycopg2.extras

# Connection string for PaisaBridge-shaped local dev:
DSN = "dbname=razorpay_test user=cdc_reader replication=database host=localhost"
SLOT_NAME = "orders_cdc_slot"

# psycopg2's LogicalReplicationConnection uses the streaming-replication
# protocol under the hood. The 'replication=database' DSN flag is required —
# without it, Postgres rejects the START_REPLICATION command.
conn = psycopg2.connect(DSN, connection_factory=psycopg2.extras.LogicalReplicationConnection)
cur = conn.cursor()

class Consumer:
    """One change at a time. In production you'd batch and pipeline this."""
    def __init__(self):
        self.last_processed_lsn = None
        self.events_seen = 0

    def __call__(self, msg):
        # msg.payload is the bytes the output plugin wrote — JSON for wal2json.
        event = json.loads(msg.payload)
        for change in event.get("change", []):
            action = change["kind"]              # "insert" | "update" | "delete"
            table  = f'{change["schema"]}.{change["table"]}'
            keys   = dict(zip(change.get("oldkeys", {}).get("keynames", []),
                              change.get("oldkeys", {}).get("keyvalues", []))) or \
                     dict(zip(change.get("columnnames", []), change.get("columnvalues", [])))
            print(f"{msg.data_start:>12}  {action:<6}  {table:<24}  {keys}")
            self.events_seen += 1
        # Tell Postgres we have durably handled everything up to msg.data_start.
        # In production, do this AFTER the downstream sink confirms commit.
        msg.cursor.send_feedback(flush_lsn=msg.data_start)
        self.last_processed_lsn = msg.data_start

cur.start_replication(slot_name=SLOT_NAME, decode=True, options={
    "include-xids":   "1",
    "include-lsn":    "1",
    "format-version": "2",
})
consumer = Consumer()
print("listening on slot:", SLOT_NAME)
try:
    cur.consume_stream(consumer)
except KeyboardInterrupt:
    print(f"\nstopped after {consumer.events_seen} events at lsn {consumer.last_processed_lsn}")
finally:
    cur.close(); conn.close()

Sample run, after a separate session does INSERT, UPDATE, DELETE on orders:

listening on slot: orders_cdc_slot
   1758932480  insert  public.orders        {'id': 98421, 'merchant_id': 'razorpay-test', 'amount': 4999, 'status': 'created'}
   1758932640  update  public.orders        {'id': 98421}
   1758932812  delete  public.orders        {'id': 98421}
^C
stopped after 3 events at lsn 1758932812

A walkthrough of the lines that matter:

connection_factory=LogicalReplicationConnection — without this flag, the connection is a normal SQL session and the START_REPLICATION command will be rejected. The connection talks the streaming-replication subprotocol, which is binary and stateful.
cur.start_replication(slot_name=SLOT_NAME, ...) — opens the stream from the slot's current confirmed_flush_lsn. If this is the first time the slot is consumed, it streams from the snapshot LSN that the slot was created at. Why this matters for correctness: a freshly-created slot does not give you historical data; it gives you everything that committed after the slot was created. To bootstrap a CDC pipeline, you read a consistent snapshot of the table at the slot's creation LSN, then start the stream from that LSN. The next chapter — snapshot CDC — is entirely about doing this without a long table lock.
msg.cursor.send_feedback(flush_lsn=msg.data_start) — this is the acknowledgement. Postgres now considers the consumer caught up to that LSN. If your downstream sink (Kafka, S3, warehouse) hasn't actually committed yet, calling this is a correctness bug — a crash between send_feedback and the sink commit loses data. The right place to call it is after the sink confirms.
The change.kind cases — wal2json v2 emits insert/update/delete/truncate/message. A truncate is one event for the whole table; a message is a logical decoding message you can emit yourself with pg_logical_emit_message() — Debezium uses this for heartbeat events.
The oldkeys vs columnvalues fallback — oldkeys is populated for UPDATE and DELETE based on REPLICA IDENTITY. For INSERT there are no old keys, so the consumer falls back to the new row's columns. A real consumer would handle these cases more carefully and emit a normalised event shape.

The thing to internalise: the consumer is small. ~30 lines of Python. Everything sophisticated about CDC tools — Debezium's connector framework, AWS DMS's task model, Materialize's incremental views — is built on top of this one primitive. Understanding the primitive understands every tool.

Operating the slot in production

The hardest part of running logical decoding in production at a payments scale (PaisaBridge processes ~10 crore transactions per day; KreditClub at Big Billion week sees ~5 lakh writes per minute; DigiPaisa sustains ~50 crore UPI events per day) is not the consumer code. The consumer code rarely changes. The slot management does.

Three numbers to alert on:

pg_replication_slots.confirmed_flush_lsn lag: how far behind the slot is. Threshold: 1 GB or 5 minutes of WAL, whichever comes first. Past 1 GB, it means the consumer is meaningfully behind; past 5 minutes, it means freshness SLAs are at risk.
active = false on the slot: the consumer is not connected. A slot that is inactive AND lagging is the disk-fill scenario.
pg_replication_slots.restart_lsn: the oldest LSN the slot still needs. The difference between this and pg_current_wal_lsn() is the WAL retention size. Some teams set a hard cap (max_slot_wal_keep_size = 10GB since Postgres 13) to refuse to retain WAL beyond a limit — at which point the slot is invalidated and the consumer must rebootstrap, but the database stays alive. Why this trade-off is real: without max_slot_wal_keep_size, a stuck slot will fill disk and crash the database. With it, a stuck slot will eventually be killed and force the consumer to do a full snapshot bootstrap. Most teams choose the second failure mode because it does not page the on-call DBA at 3 a.m.

A common PaisaBridge-style monitoring query:

SELECT slot_name, active, restart_lsn, confirmed_flush_lsn,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag_bytes,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retain_bytes
FROM pg_replication_slots;

Sample output during a healthy run:

   slot_name      | active | restart_lsn | confirmed_flush_lsn | lag_bytes | retain_bytes
------------------+--------+-------------+---------------------+-----------+--------------
 orders_cdc_slot  | t      | 0/1A2B3000  | 0/1A2B3C4D          | 412 kB    | 3 MB
 analytics_slot   | t      | 0/1A2B3500  | 0/1A2B3B00          | 1.2 MB    | 2 MB

retain_bytes is the disk pressure metric. lag_bytes is the freshness metric. Both are alertable independently.

Common confusions

"Logical decoding is the same as streaming replication." Streaming replication ships physical WAL between Postgres instances; the receiver replays bytes. Logical decoding ships row-level events to anything that speaks the protocol — a Python script, Kafka Connect, Materialize. A standby cannot consume logical decoding output and a logical consumer cannot replay physical WAL.
"wal_level = replica is enough for CDC." It is not. replica records enough to redo pages; logical additionally records the old-row identity needed for UPDATE/DELETE events. Setting wal_level = logical requires a restart; many teams discover this on the day they want to enable CDC and have to schedule a window.
"A slot persists my events even when the consumer is offline." A slot persists the WAL retention bookmark, not the events. The events are reconstructed from the WAL each time the consumer connects. If the WAL has been deleted (because some other event invalidated the slot), the events are gone.
"REPLICA IDENTITY DEFAULT works for every table." Only if every table has a primary key. Tables without a primary key — surprisingly common in legacy schemas, audit tables, association tables — emit UPDATE/DELETE events with empty identity unless you set REPLICA IDENTITY FULL or USING INDEX.
"Logical decoding is free for the source database." It is cheap, but not free. The output plugin runs in the backend and consumes CPU. Heavy-update workloads with many tables in the publication can add 5–10% CPU to the source. Negligible compared to polling, but not zero.
"I can have 100 consumers on the same slot." No. A slot has one active consumer. To fan out to multiple sinks, run one consumer that writes to Kafka and let everyone read from Kafka, or create one slot per consumer (each pays the decoding cost).

Going deeper

How the output plugin sees a transaction

pgoutput and wal2json differ in when they emit events. Both buffer changes per transaction and emit them in commit order — meaning a 30-minute transaction emits nothing until it commits, then emits the entire batch in a burst. Why this matters for freshness: a long-running transaction that holds many writes will block the slot's view of any later commits, because logical decoding emits in commit order. A 10-minute transaction on the source pushes 10 minutes of effective lag onto every consumer of the slot, even though the source is committing other transactions normally. The fix is to keep transactions short on tables in the publication, or to use streaming-of-in-progress-transactions (Postgres 14+, streaming = on option), which emits events incrementally with rollback semantics if the transaction aborts.

`pgoutput` versus `wal2json` versus a bespoke plugin

Production CDC at scale (PaisaBridge, Slipway, ShopForge) uses pgoutput because it is the binary format Debezium understands natively, and Debezium handles the reconnection / serialisation / Kafka-sink layer. wal2json is the right choice for one-off integrations, custom Python consumers, or tools that don't want to implement the binary protocol. A bespoke plugin (.so you write yourself) is rare; it is the right answer when you have an unusual emit format need (Avro directly, or a shape that maps onto your warehouse's CDC convention) and you can afford to maintain C code against Postgres major-version changes.

Snapshot bootstrap — the missing half

Logical decoding from a slot creation point gives you everything after. To bootstrap a CDC pipeline against a 10 crore-row table, you also need a consistent snapshot of the table at the same LSN the slot started. Postgres provides this via pg_export_snapshot() inside the same transaction that creates the slot — the consumer reads the snapshot in one or many parallel transactions, then attaches to the slot, and the events received are guaranteed to be after the snapshot. The next chapter (/wiki/snapshot-cdc-the-bootstrapping-problem) walks the bootstrap protocol Debezium implements and the failure modes (slot creation succeeds, snapshot read takes 4 hours, WAL fills during the read, slot becomes invalid).

Multi-tenant: one slot per database, not per schema

A common confusion at SaaS shops in Bengaluru with one Postgres serving 30 customer schemas: a logical slot is per-database, not per-schema. The publication mechanism filters by table, so you can put one customer's tables in one publication and another's in another — but they share the same WAL retention. If one consumer goes down, the other consumer's WAL gets retained too. The cleanest pattern is to run one Postgres per customer if isolation is critical, or one publication per customer with shared WAL but per-publication monitoring of consumer liveness.

What changes between Postgres 11, 13, 14, and 16

Postgres 11 introduced logical decoding-by-default in the binary protocol (pgoutput). Postgres 13 added max_slot_wal_keep_size for the disk-fill failsafe described above. Postgres 14 added streaming of in-progress transactions for the long-transaction problem. Postgres 16 added subscriber-side parallelism via max_parallel_apply_workers_per_subscription (relevant when one Postgres is a CDC sink for another). When choosing a Postgres version for CDC, 13 is the minimum that is operationally safe; 14+ is preferable; 16+ is current best practice. PaisaBridge's published architecture pins on 14 with max_slot_wal_keep_size = 50GB and streaming = on.

Where this leads next

/wiki/mysql-binlog-format-and-replication-protocol — the same primitive on MySQL: row-format binlog, the replication protocol, and how Debezium's MySQL connector consumes it.
/wiki/snapshot-cdc-the-bootstrapping-problem — the bootstrap that pairs with this chapter: snapshot-then-stream against a 10 crore-row table without locking it for 4 hours.
/wiki/debezium-as-a-cdc-platform — what wraps the primitive in production: connector framework, Kafka Connect runtime, schema registry, retry semantics.

By the end of Build 11, the freshness floor disappears. A pipeline that subscribes to a slot sees commits within milliseconds, costs the source ~constant CPU regardless of table size, and survives consumer restarts via the slot's confirmed_flush_lsn. The polling pipeline that paged on-call at Diwali becomes a footnote.

References

PostgreSQL documentation: Logical Decoding — the canonical reference for the API, plugin contract, and slot lifecycle.
PostgreSQL documentation: Replication Slots — physical and logical slots, and why they are the bookmark mechanism.
wal2json plugin (eulerto/wal2json) — the JSON output plugin used in this chapter; its README documents format-version: 2 and option flags.
Debezium Postgres connector documentation — what wraps pgoutput in production: snapshot mode, schema history, Kafka semantics.
PaisaBridge engineering: scaling CDC to 100B events/month (2024) — the Indian-context reference for slot operation at payments scale.
Materialize: streaming Postgres directly from the WAL (2022) — a worked example of a logical-decoding consumer that maintains incremental views in real time.
/wiki/why-polling-breaks-past-a-certain-scale — the chapter immediately before this one; the load curve that motivates moving to logical decoding.
/wiki/wall-oltp-databases-are-a-source-you-dont-control — the structural framing for why CDC, not polling, is the right primitive against an OLTP source you don't own.

Postgres logical decoding from scratch

What the WAL is, and why decoding exists

Three pieces: WAL, output plugin, replication slot

The WAL is a byte-addressed log

The output plugin is loaded into the backend

The replication slot is the bookmark

Build it: a logical-decoding consumer in Python

Step 1 — configure Postgres

Step 2 — choose REPLICA IDENTITY per table

Step 3 — create a slot and a publication

Step 4 — write the consumer

Operating the slot in production

Common confusions

Going deeper

How the output plugin sees a transaction

pgoutput versus wal2json versus a bespoke plugin

Snapshot bootstrap — the missing half

Multi-tenant: one slot per database, not per schema

What changes between Postgres 11, 13, 14, and 16

Where this leads next

References

`pgoutput` versus `wal2json` versus a bespoke plugin