Postgres logical decoding from scratch

The previous chapter ended with the polling pipeline at a Mumbai e-commerce startup taking out the source replica during Diwali week. The fix is not "poll faster". The fix is to stop polling. Postgres has been writing every committed change to a file on disk since 1996 — the write-ahead log, the WAL. Logical decoding is the official, in-process, replication-slot-backed API that lets your pipeline read that file as a stream of row-level events, in commit order, with millisecond freshness, costing the source database approximately nothing more than what it was already paying to keep its own replicas alive. This chapter builds the consumer side of that API by hand, with psql, before any tool sits in front of it.

The WAL already records every committed change as a physical record. Logical decoding is a server-side plugin pipeline that translates those physical records into row-level INSERT/UPDATE/DELETE events keyed by primary key. A replication slot is the bookmark that guarantees the WAL is retained until your consumer has read past it. Once you understand the slot, the output plugin, and the LSN as a position pointer, every CDC tool — Debezium, AWS DMS, Materialize, Estuary — is a thin wrapper.

What the WAL is, and why decoding exists

Postgres writes every change to the WAL before it touches the heap. That is the durability contract: a commit returns success only after its WAL records are fsync'd to disk. Crash recovery and physical replication both replay the WAL by reading those records and re-applying them byte-for-byte. The records are physical — "tuple at offset 47 of page 13892 of relation 16384, replace these 312 bytes with these 312 bytes". A standby Postgres can replay them; a Python script cannot, because the script does not know what relation 16384 is or what the tuple's columns are called.

Logical decoding is the bridge. It loads an output plugin inside the Postgres backend, feeds the physical WAL records through it, and the plugin emits logical records: "table public.orders, primary key id=98421, action UPDATE, old row {...}, new row {...}". Your consumer reads the logical records over the standard Postgres replication protocol and never has to think about page offsets again.

From WAL bytes to logical events — the decoding pipelineThree vertical lanes showing the path of a row change: from the application's COMMIT, to a physical WAL record on disk, to the output plugin transforming it, to the consumer reading row-level INSERT/UPDATE/DELETE events. Logical decoding turns physical WAL into row-level events Application commits UPDATE orders SET status='paid' WHERE id = 98421; (application's transaction) Physical WAL record XLOG: rel=16384 page=13892 offset=47 old=0x... new=0x... opaque bytes; LSN=0/1A2B3C4D Output plugin (in backend) pgoutput / wal2json / test_decoding decodes via REPLICA IDENTITY emits logical row event Logical event your consumer sees { "action": "U", "schema": "public", "table": "orders", "lsn": "0/1A2B3C4D", "xid": 50231, "identity": { "id": 98421 }, "columns": { "id": 98421, "status": "paid", "amount": 4999, ... } } No page offsets. Primary-key identity. Replayable in commit order.
The output plugin runs inside the Postgres backend, not in your application. Your consumer reads the right-hand box. The first two boxes are private to the database.

This separation matters because the decoding cost is paid once per change, on the source, regardless of how many consumers want the stream. Why the cost is once per change, not once per consumer per change: the WAL records exist for crash recovery anyway. The decoding plugin processes each record once and the resulting logical events are buffered for the slot. Multiple slots mean multiple plugin invocations on the same record, but a single slot serving multiple downstream subscribers (via Kafka, say) pays the decoding cost once.

Three pieces: WAL, output plugin, replication slot

Three concepts, plus their interactions, are the entire mental model.

The WAL is a byte-addressed log

WAL is a sequence of files in pg_wal/, each 16 MB. They concatenate into one logical stream. Every byte in that stream has an LSN — log sequence number, an 8-byte position written as XX/YYYYYYYY (e.g. 0/1A2B3C4D). LSNs are monotonic across the lifetime of the cluster. A consumer that says "I have processed up to LSN 0/1A2B3C4D, give me what comes after" gets exactly the next bytes in commit order.

The output plugin is loaded into the backend

The plugin is a .so shared library that registers callbacks: begin, change, commit. When the decoder hits a WAL record for a table you have selected, it calls the plugin's change callback with a structured representation. The plugin formats the output (binary protobuf for pgoutput, JSON for wal2json, debug text for test_decoding) and pushes bytes into the slot's outgoing buffer.

Postgres ships with three plugins worth knowing:

The replication slot is the bookmark

A slot is server-side state that says: "consumer X has acknowledged up to LSN 0/1A2B3C4D; do not delete WAL files containing bytes after this." It is durable across restarts. It is the reason the consumer can disconnect for an hour, reconnect, and resume exactly where it left off.

Slots come in two flavours:

The slot is also the dangerous primitive. Why slots can take down a Postgres: the slot pins the WAL on disk. If the consumer dies and never reconnects, WAL files accumulate forever. A slot that has fallen behind by 3 days, on a write-heavy database, can fill the disk. The fix is not "increase disk"; the fix is to alert when pg_replication_slots.confirmed_flush_lsn falls more than N MB behind pg_current_wal_lsn() and either resurrect the consumer or drop the slot.

Replication slot — the bookmark that retains WALA horizontal timeline showing WAL segments with three position markers: the slot's confirmed_flush_lsn pointing at a past segment, the current write position, and the gap between them showing retained WAL. A logical slot retains WAL until the consumer acknowledges it seg N-3 seg N-2 seg N-1 seg N seg N+1 seg N+2 seg N+3 confirmed_flush_lsn (consumer ack'd here) current_wal_lsn (writer is here now) retained WAL the slot will not let Postgres delete grows linearly when consumer is offline; this is the disk-fill failure mode slot lag (bytes) = current_wal_lsn − confirmed_flush_lsn
A healthy slot has lag in the megabytes. A dying slot has lag in the gigabytes. A fatal slot has lag exceeding the disk's free space — at which point Postgres refuses new writes.

Build it: a logical-decoding consumer in Python

Walk through it end to end. The setup is one psql session for postgresql.conf settings, one for the slot, and one Python consumer that reads.

Step 1 — configure Postgres

In postgresql.conf (or via ALTER SYSTEM):

wal_level = logical
max_replication_slots = 10
max_wal_senders = 10

Restart. wal_level = logical is the only one that requires a restart. Why wal_level matters: replica (the default for streaming replication) writes enough information to physically replay pages. logical adds extra information — old-tuple data for UPDATEs and DELETEs based on REPLICA IDENTITY — that the output plugin needs to emit row-level events. Without logical, the WAL is missing the bytes the plugin reads.

Step 2 — choose REPLICA IDENTITY per table

Each table tells the WAL writer how much old-row data to record on UPDATE/DELETE. The default — DEFAULT — uses the primary key. For a payments table this is right. For a table without a primary key, you must set FULL (records the entire old row) or USING INDEX <name>. Without one of these, the decoded UPDATE/DELETE events arrive with null identity, and the consumer cannot tell which row changed.

ALTER TABLE orders REPLICA IDENTITY DEFAULT;  -- the default; uses PK
ALTER TABLE wallet_balances REPLICA IDENTITY FULL;  -- if no PK

Step 3 — create a slot and a publication

-- Pick wal2json for human-readable JSON output during development.
SELECT pg_create_logical_replication_slot('orders_cdc_slot', 'wal2json');

-- Publication lists which tables this slot will see.
CREATE PUBLICATION orders_pub FOR TABLE orders, payments, refunds;

Step 4 — write the consumer

import json
import time
import psycopg2
import psycopg2.extras

# Connection string for Razorpay-shaped local dev:
DSN = "dbname=razorpay_test user=cdc_reader replication=database host=localhost"
SLOT_NAME = "orders_cdc_slot"

# psycopg2's LogicalReplicationConnection uses the streaming-replication
# protocol under the hood. The 'replication=database' DSN flag is required —
# without it, Postgres rejects the START_REPLICATION command.
conn = psycopg2.connect(DSN, connection_factory=psycopg2.extras.LogicalReplicationConnection)
cur = conn.cursor()

class Consumer:
    """One change at a time. In production you'd batch and pipeline this."""
    def __init__(self):
        self.last_processed_lsn = None
        self.events_seen = 0

    def __call__(self, msg):
        # msg.payload is the bytes the output plugin wrote — JSON for wal2json.
        event = json.loads(msg.payload)
        for change in event.get("change", []):
            action = change["kind"]              # "insert" | "update" | "delete"
            table  = f'{change["schema"]}.{change["table"]}'
            keys   = dict(zip(change.get("oldkeys", {}).get("keynames", []),
                              change.get("oldkeys", {}).get("keyvalues", []))) or \
                     dict(zip(change.get("columnnames", []), change.get("columnvalues", [])))
            print(f"{msg.data_start:>12}  {action:<6}  {table:<24}  {keys}")
            self.events_seen += 1
        # Tell Postgres we have durably handled everything up to msg.data_start.
        # In production, do this AFTER the downstream sink confirms commit.
        msg.cursor.send_feedback(flush_lsn=msg.data_start)
        self.last_processed_lsn = msg.data_start

cur.start_replication(slot_name=SLOT_NAME, decode=True, options={
    "include-xids":   "1",
    "include-lsn":    "1",
    "format-version": "2",
})
consumer = Consumer()
print("listening on slot:", SLOT_NAME)
try:
    cur.consume_stream(consumer)
except KeyboardInterrupt:
    print(f"\nstopped after {consumer.events_seen} events at lsn {consumer.last_processed_lsn}")
finally:
    cur.close(); conn.close()

Sample run, after a separate session does INSERT, UPDATE, DELETE on orders:

listening on slot: orders_cdc_slot
   1758932480  insert  public.orders        {'id': 98421, 'merchant_id': 'razorpay-test', 'amount': 4999, 'status': 'created'}
   1758932640  update  public.orders        {'id': 98421}
   1758932812  delete  public.orders        {'id': 98421}
^C
stopped after 3 events at lsn 1758932812

A walkthrough of the lines that matter:

The thing to internalise: the consumer is small. ~30 lines of Python. Everything sophisticated about CDC tools — Debezium's connector framework, AWS DMS's task model, Materialize's incremental views — is built on top of this one primitive. Understanding the primitive understands every tool.

Operating the slot in production

The hardest part of running logical decoding in production at a payments scale (Razorpay processes ~10 crore transactions per day; Cred at Big Billion week sees ~5 lakh writes per minute; PhonePe sustains ~50 crore UPI events per day) is not the consumer code. The consumer code rarely changes. The slot management does.

Three numbers to alert on:

  1. pg_replication_slots.confirmed_flush_lsn lag: how far behind the slot is. Threshold: 1 GB or 5 minutes of WAL, whichever comes first. Past 1 GB, it means the consumer is meaningfully behind; past 5 minutes, it means freshness SLAs are at risk.
  2. active = false on the slot: the consumer is not connected. A slot that is inactive AND lagging is the disk-fill scenario.
  3. pg_replication_slots.restart_lsn: the oldest LSN the slot still needs. The difference between this and pg_current_wal_lsn() is the WAL retention size. Some teams set a hard cap (max_slot_wal_keep_size = 10GB since Postgres 13) to refuse to retain WAL beyond a limit — at which point the slot is invalidated and the consumer must rebootstrap, but the database stays alive. Why this trade-off is real: without max_slot_wal_keep_size, a stuck slot will fill disk and crash the database. With it, a stuck slot will eventually be killed and force the consumer to do a full snapshot bootstrap. Most teams choose the second failure mode because it does not page the on-call DBA at 3 a.m.

A common Razorpay-style monitoring query:

SELECT slot_name, active, restart_lsn, confirmed_flush_lsn,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag_bytes,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retain_bytes
FROM pg_replication_slots;

Sample output during a healthy run:

   slot_name      | active | restart_lsn | confirmed_flush_lsn | lag_bytes | retain_bytes
------------------+--------+-------------+---------------------+-----------+--------------
 orders_cdc_slot  | t      | 0/1A2B3000  | 0/1A2B3C4D          | 412 kB    | 3 MB
 analytics_slot   | t      | 0/1A2B3500  | 0/1A2B3B00          | 1.2 MB    | 2 MB

retain_bytes is the disk pressure metric. lag_bytes is the freshness metric. Both are alertable independently.

Common confusions

Going deeper

How the output plugin sees a transaction

pgoutput and wal2json differ in when they emit events. Both buffer changes per transaction and emit them in commit order — meaning a 30-minute transaction emits nothing until it commits, then emits the entire batch in a burst. Why this matters for freshness: a long-running transaction that holds many writes will block the slot's view of any later commits, because logical decoding emits in commit order. A 10-minute transaction on the source pushes 10 minutes of effective lag onto every consumer of the slot, even though the source is committing other transactions normally. The fix is to keep transactions short on tables in the publication, or to use streaming-of-in-progress-transactions (Postgres 14+, streaming = on option), which emits events incrementally with rollback semantics if the transaction aborts.

pgoutput versus wal2json versus a bespoke plugin

Production CDC at scale (Razorpay, Stripe, Shopify) uses pgoutput because it is the binary format Debezium understands natively, and Debezium handles the reconnection / serialisation / Kafka-sink layer. wal2json is the right choice for one-off integrations, custom Python consumers, or tools that don't want to implement the binary protocol. A bespoke plugin (.so you write yourself) is rare; it is the right answer when you have an unusual emit format need (Avro directly, or a shape that maps onto your warehouse's CDC convention) and you can afford to maintain C code against Postgres major-version changes.

Snapshot bootstrap — the missing half

Logical decoding from a slot creation point gives you everything after. To bootstrap a CDC pipeline against a 10 crore-row table, you also need a consistent snapshot of the table at the same LSN the slot started. Postgres provides this via pg_export_snapshot() inside the same transaction that creates the slot — the consumer reads the snapshot in one or many parallel transactions, then attaches to the slot, and the events received are guaranteed to be after the snapshot. The next chapter (/wiki/snapshot-cdc-the-bootstrapping-problem) walks the bootstrap protocol Debezium implements and the failure modes (slot creation succeeds, snapshot read takes 4 hours, WAL fills during the read, slot becomes invalid).

Multi-tenant: one slot per database, not per schema

A common confusion at SaaS shops in Bengaluru with one Postgres serving 30 customer schemas: a logical slot is per-database, not per-schema. The publication mechanism filters by table, so you can put one customer's tables in one publication and another's in another — but they share the same WAL retention. If one consumer goes down, the other consumer's WAL gets retained too. The cleanest pattern is to run one Postgres per customer if isolation is critical, or one publication per customer with shared WAL but per-publication monitoring of consumer liveness.

What changes between Postgres 11, 13, 14, and 16

Postgres 11 introduced logical decoding-by-default in the binary protocol (pgoutput). Postgres 13 added max_slot_wal_keep_size for the disk-fill failsafe described above. Postgres 14 added streaming of in-progress transactions for the long-transaction problem. Postgres 16 added subscriber-side parallelism via max_parallel_apply_workers_per_subscription (relevant when one Postgres is a CDC sink for another). When choosing a Postgres version for CDC, 13 is the minimum that is operationally safe; 14+ is preferable; 16+ is current best practice. Razorpay's published architecture pins on 14 with max_slot_wal_keep_size = 50GB and streaming = on.

Where this leads next

By the end of Build 11, the freshness floor disappears. A pipeline that subscribes to a slot sees commits within milliseconds, costs the source ~constant CPU regardless of table size, and survives consumer restarts via the slot's confirmed_flush_lsn. The polling pipeline that paged on-call at Diwali becomes a footnote.

References