Wall: OLTP databases are a source you don't control

It is 2:47 a.m. on a Wednesday and the on-call data engineer at a Bengaluru fintech has been paged. The merchant-payouts dashboard says ₹14.2 crore moved overnight, but Finance reconciled against the Postgres source and the real number is ₹14.6 crore. Forty lakh rupees vanished between the source-of-truth and the warehouse, and the only clue is that someone, on some service, ran an UPDATE merchants SET payout_status = 'settled' WHERE batch_id IN (...) two hours ago — and the warehouse pipeline has no idea that update happened. The pipeline reads the merchants table at 1 a.m. with SELECT * WHERE updated_at > '2026-04-24' and trusts the developers who wrote the OLTP service to update updated_at on every change. They didn't. This is the wall that opens Build 11.

Every pipeline you have built so far assumes the source emits a faithful change stream. The OLTP databases that hold most production state — Postgres, MySQL, Oracle — do not, by default, emit anything. They mutate rows in place, and the only complete record of what changed is buried in their write-ahead log. Until you can read that log, your pipeline is guessing at the truth.

What "a source you don't control" actually means

Every chapter through Build 10 has assumed inputs you can shape: CSVs you generate, message logs you append to, stream events you publish from your own service. The producer cooperates because the producer is you. The contract is whatever you wrote.

OLTP databases break this assumption in five specific ways:

  1. The source is owned by another team. The Razorpay payments service writes to a Postgres cluster operated by the platform team. You, the data engineer, get a read replica at best. You cannot ask them to add a Kafka publish on every transaction — that doubles the failure modes of their critical-path code.
  2. Rows mutate in place. A UPDATE orders SET status = 'cancelled' WHERE id = 91823 overwrites the previous status. The history of the row exists nowhere a SQL reader can see.
  3. updated_at lies. Some writes update it. Some don't. ORMs, raw SQL fixes, batch jobs migrating data — every code path is a chance for updated_at to disagree with reality. A Flipkart catalogue table once had 2.3 million rows where updated_at was older than inserted_at because a script bulk-corrected SKUs and only touched the SKU column.
  4. Deletes are invisible. DELETE FROM cart_items WHERE user_id = 4517 removes the row entirely. There is no row left for SELECT WHERE updated_at > X to find. Tomorrow's snapshot will simply be missing rows that yesterday's snapshot had — and the pipeline has no signal that they were intentionally deleted versus filtered by a predicate change.
  5. Schema evolves without warning. A developer adds a column, a renames it, drops one. Your pipeline's SELECT * either silently widens (if the warehouse is permissive) or breaks at 7 a.m. (if it isn't).
The five gaps between an OLTP source and your pipelineA diagram showing a Postgres database on the left and a downstream pipeline on the right, with five gap arrows in between labelled with the failure modes: in-place updates, lying updated_at, invisible deletes, schema drift, ownership boundary. The wall — five gaps the pipeline cannot see across OLTP source Postgres / MySQL owned by Payments team orders (8.4M rows) merchants (240k) payouts (1.1M) UPDATE / DELETE happen in place Your pipeline warehouse / lake streaming jobs reads at 01:00 IST trusts updated_at cannot see deletes SELECT * breaks on schema drift 1. ownership boundary 2. in-place mutation 3. updated_at lies 4. invisible deletes 5. silent schema drift
Five gaps. Every CDC pattern in Build 11 closes one or more of them.

Why this matrix matters: the data-quality bug at 2:47 a.m. is never one gap on its own. It is two stacked. A DELETE (gap 4) on a row that was previously updated (gap 2) by a script that didn't bump updated_at (gap 3) — the row vanishes from your warehouse and the only forensic evidence is on the OLTP team's WAL retention window, which is 4 hours.

The "incremental SELECT" pattern and why it fails

Most pipelines start with the seemingly-reasonable plan: every hour, SELECT * FROM source WHERE updated_at > :last_high_water_mark. This is the pattern of Build 3 (incremental processing) applied to OLTP. Walk through what it actually misses.

# The naive incremental pull, the way most pipelines start.
import psycopg2, json, datetime

def pull_incremental(conn, table: str, watermark: datetime.datetime):
    """Return new and updated rows since `watermark`."""
    sql = f"SELECT * FROM {table} WHERE updated_at > %s ORDER BY updated_at"
    with conn.cursor() as cur:
        cur.execute(sql, (watermark,))
        cols = [d.name for d in cur.description]
        rows = [dict(zip(cols, r)) for r in cur.fetchall()]
    new_watermark = max((r['updated_at'] for r in rows), default=watermark)
    return rows, new_watermark


# --- Demo against a Razorpay-style merchants table ---
# Pretend the table has 240,000 rows, with these recent activities:
# 17:00 IST  someone runs:   UPDATE merchants SET kyc_status='verified' WHERE id=8141
#                           (single column update, ORM bumps updated_at — captured.)
# 17:14 IST  a script runs:  UPDATE merchants SET pincode = LPAD(pincode, 6, '0')
#                                  WHERE LENGTH(pincode) = 5
#                           (raw SQL, did NOT touch updated_at — INVISIBLE to pull.)
# 17:31 IST  someone runs:   DELETE FROM merchants WHERE id = 8902 AND status='banned'
#                           (row vanishes — INVISIBLE to pull.)

last_wm = datetime.datetime.fromisoformat("2026-04-24 16:00:00+05:30")
conn = psycopg2.connect("postgresql://reader@payments-replica:5432/payments")

rows, new_wm = pull_incremental(conn, "merchants", last_wm)
print(f"Caught {len(rows)} changes between {last_wm} and {new_wm}")
print("Visible changes:")
for r in rows:
    print(f"  id={r['id']} kyc_status={r['kyc_status']} updated_at={r['updated_at']}")

# Sample run output:
# Caught 1 changes between 2026-04-24 16:00:00+05:30 and 2026-04-24 17:00:14+05:30
# Visible changes:
#   id=8141 kyc_status=verified updated_at=2026-04-24 17:00:14+05:30
#
# What the warehouse will silently get wrong:
#  - 4,217 merchants whose pincode was zero-padded look the same as before.
#  - merchant id=8902 still exists in the warehouse, but is gone from source.
#  - if id=8902's daily payout was ₹83,000, the warehouse keeps emitting
#    that payout in reports until someone notices the orphan.

A walkthrough of what just happened:

This is the failure mode CDC exists to fix. Every approach in Build 11 — log-based replication, Debezium connectors, snapshot-and-stream, schema registries — is a different wedge driven into one of the five gaps in the figure above.

There is a second-order failure that the same code creates, worth naming explicitly. Imagine Riya, an SDE-2 on the data team, discovers the missing pincode updates two months later and writes a fix: a nightly job that compares row-hashes between source and warehouse and reconciles the diff. The fix works. But it costs a full-table scan of the source every night — 240,000 rows for merchants is fine, 8.4 million rows for orders is borderline, 100 million rows for events is impossible without dedicated infrastructure. Reconciliation is a tax on top of the leaky primary path; it does not replace the path. Most teams that try this approach end up with two pipelines (the primary updated_at pull and the reconciliation diff) and the wrong answer when both disagree. The lesson the Razorpay platform team published in their 2024 post-incident review: stop trying to make updated_at faithful; read the WAL.

What the source actually has, that you can't see

Every modern OLTP database keeps a write-ahead log: an append-only file that records every change in commit order, before it is applied to the data files. Postgres calls it the WAL. MySQL calls it the binlog. Oracle calls it the redo log. SQL Server calls it the transaction log. The names differ; the shape is identical:

What the WAL knows that SELECT cannot seeA timeline showing the WAL with three commits — an insert, an update without updated_at, and a delete — while the SELECT-with-watermark approach only sees one of them. The same hour, two views WAL (truth) 17:00 UPDATE id=8141 17:14 BULK UPDATE 4,217 rows 17:31 DELETE id=8902 SELECT view 17:00 caught 17:14 invisible (no updated_at) 17:31 invisible (DELETE) The WAL has all three. The SELECT path has one of three. CDC is the discipline of consuming the first row, not the second.
The WAL is the truth. Build 11 is about reading from it without breaking the source database that needs it for crash recovery.

The WAL is also load-bearing for the source. Postgres needs it for crash recovery. The WAL retention is finite — typically 4 to 24 hours on a busy production cluster. If you, the consumer, fall behind by more than that, the source either stalls (because it can't recycle WAL segments you haven't read) or you lose data (if the source is configured to recycle anyway). This tension — the WAL is the truth, but reading it puts back-pressure on the source's write path — is the constraint that shapes every CDC system. Why this is operationally heavy: the platform team that owns the Postgres cluster has to enable logical replication slots, give them retention budget, and trust that your pipeline will keep up. If your CDC consumer crashes for 6 hours during Big Billion Days, you can wedge the source database. The platform team has scars from this — that is why they say no to first asks.

A useful mental model: the WAL is the bank ledger and the table is the running balance. The teller's app shows you the balance — fast, indexed, easy to query — but if you want to know "what changed in the last hour", the balance alone cannot tell you. You need the ledger entries. Every database that participates in serious CDC is one where the ledger has been opened to authorised consumers. This is why MongoDB's change streams, DynamoDB's streams, Cassandra's CDC log, MySQL's binlog, Postgres's logical decoding, and Oracle's LogMiner all exist as distinct named features — each vendor decided, separately, that the ledger had to be exposed in a controlled way.

A small concrete walkthrough: what one WAL record looks like

Run pg_logical_slot_peek_changes on a slot configured with the test_decoding plugin against a Postgres that has just received the three statements in the figure above. The output is roughly:

 lsn        | xid    | data
------------+--------+-------------------------------------------------------------------
 0/3A1B240  | 18402  | BEGIN 18402
 0/3A1B280  | 18402  | table public.merchants: UPDATE: id[integer]:8141
            |        |   kyc_status[text]:'verified' updated_at[timestamptz]:'2026-04-24 17:00:14+05:30'
 0/3A1B2C0  | 18402  | COMMIT 18402
 0/3A1F100  | 18403  | BEGIN 18403
 0/3A1F140  | 18403  | table public.merchants: UPDATE: id[integer]:1014
            |        |   pincode[text]:'560001'  -- (4,216 more UPDATE rows in this xact)
 0/3A1F1A0  | 18403  | COMMIT 18403
 0/3A2C040  | 18404  | BEGIN 18404
 0/3A2C080  | 18404  | table public.merchants: DELETE: id[integer]:8902
 0/3A2C0C0  | 18404  | COMMIT 18404

Three things stand out. The bulk update is one transaction with thousands of row events — the WAL preserves transactional grouping, so a downstream consumer can apply all 4,217 rows atomically or none. The DELETE record carries the primary key (because REPLICA IDENTITY is default), so the consumer can route the tombstone correctly. And every record has an LSN — a monotonically-increasing 64-bit pointer into the WAL that becomes the cursor for the next read. Why the LSN is the right cursor and updated_at is the wrong cursor: the LSN is set by the database itself at commit time, atomically with the change. There is no code path that mutates a row without advancing the LSN, because the LSN is the WAL's own write order. The application cannot forget to update it; the database physically cannot.

Three patterns Build 11 will teach

Knowing the wall, the next four chapters of Build 11 each build a piece of the bridge:

  1. Logical decoding (Postgres) and binlog readers (MySQL). How the database exposes its WAL as a stream of structured events. The mechanical primitive on which everything else stands.
  2. Snapshot-and-stream. You cannot start a CDC consumer at WAL position 0; the WAL only goes back hours. So the bootstrap is: take a consistent snapshot of the table now, start streaming WAL from the LSN at the moment of the snapshot, and stitch them together so no row is missed and no row is double-counted.
  3. Debezium and the schema registry. How the open-source standard wraps logical decoding into Kafka-shaped change events, with embedded schema, and how schema changes propagate to the warehouse without breaking pipelines at 7 a.m.

By the end of Build 11, the 2:47 a.m. on-call page from the start of this chapter has a different shape: instead of "the warehouse is wrong, here is a diff", it is "the CDC connector for merchants lagged by 11 minutes around 17:14, here is the gap, here are the 4,217 events to replay".

Edge cases the wall produces in real production

A few specific failure shapes show up over and over in incident reports. Recognising them is half the diagnosis.

Common confusions

Going deeper

Why SELECT FOR UPDATE and "outbox at write time" don't fully escape the wall

A common second attempt at the problem is the transactional outbox: in the same transaction that writes to orders, also write to an outbox table. A separate process polls the outbox and forwards events. This works — if every service participates. The hard part is that a multi-tenant Postgres serves dozens of services, some not under your control, and any one of them can update orders directly without touching the outbox. You have replaced "trust updated_at" with "trust every service writer to remember the outbox", which is the same wall in different clothes. CDC reading the WAL is the only approach where the data engineer does not depend on application-side discipline.

The Razorpay incident that named "the wall"

In a public 2024 engineering writeup, Razorpay described a Sunday night where their merchant-onboarding pipeline (Postgres → BigQuery) silently dropped 12,400 merchants from a daily snapshot because a back-office tool ran a soft-delete via UPDATE merchants SET deleted_at = now() — and the warehouse model was loading on deleted_at IS NULL from the snapshot table, but had no event for the transitions. The fix was a 3-month migration to Debezium. The 12,400 merchants represented ~₹38 crore of stuck onboarding. The lesson the team published: "Treat the OLTP database as a source you do not control. Read the log, not the table."

What the WAL retention budget actually costs

A Postgres logical replication slot pins WAL segments until the consumer acks them. WAL is roughly 16 MB per segment, default retention is wal_keep_size = 64 MB, but with active slots it grows without bound. On a Razorpay-scale cluster doing 8,000 commits/second, WAL accumulates at roughly 2 GB/minute. A consumer that lags 30 minutes pins 60 GB of disk on the primary. Lag for 6 hours during a connector failure during Big Billion Days, and you have pinned 700+ GB on the primary's volume, risking a no space left on device and a primary outage. This is why the platform team's ask, before they hand you a slot, is: "what is your maximum tolerable lag, and what is the alert threshold at half that?"

Why MySQL's binlog and Postgres's WAL look similar but require different consumers

MySQL's binlog has three formats: STATEMENT (replays SQL strings, terrible for CDC because non-deterministic functions like NOW() give different answers downstream), ROW (logs the actual changed rows, the format CDC wants), and MIXED. Postgres has had logical decoding since 9.4, with pgoutput becoming the standard plugin. The shape of the events differs — MySQL emits binlog events with table maps + row images, Postgres emits typed messages per logical decoding plugin. Debezium normalises both into the same Kafka envelope, which is why it dominates the CDC ecosystem despite being heavyweight.

What the on-call runbook for a CDC pipeline actually contains

A team running CDC at production scale keeps a runbook that maps from "alert fired" to "first three things to check". The shape repeats across Razorpay, Cred, Zerodha, and Swiggy public writeups: (1) is the slot still active on the source, and what is its lag in WAL bytes? (2) is the connector consuming, or has it crashed and left the slot pinned? (3) is the downstream sink (warehouse, Kafka, S3) accepting writes, or is back-pressure propagating up? The runbook closes with a numbered escalation: at 30 minutes of lag, page the data team; at 2 hours, page the platform team because the source is at risk; at 4 hours, drop the slot and plan a re-snapshot rather than risk the primary. Build 11 ends with this runbook as a deliverable.

Where this leads next

The next chapters of Build 11 take this wall apart, mechanism by mechanism:

References