Wall: OLTP databases are a source you don't control
It is 2:47 a.m. on a Wednesday and the on-call data engineer at a Bengaluru fintech has been paged. The merchant-payouts dashboard says ₹14.2 crore moved overnight, but Finance reconciled against the Postgres source and the real number is ₹14.6 crore. Forty lakh rupees vanished between the source-of-truth and the warehouse, and the only clue is that someone, on some service, ran an UPDATE merchants SET payout_status = 'settled' WHERE batch_id IN (...) two hours ago — and the warehouse pipeline has no idea that update happened. The pipeline reads the merchants table at 1 a.m. with SELECT * WHERE updated_at > '2026-04-24' and trusts the developers who wrote the OLTP service to update updated_at on every change. They didn't. This is the wall that opens Build 11.
Every pipeline you have built so far assumes the source emits a faithful change stream. The OLTP databases that hold most production state — Postgres, MySQL, Oracle — do not, by default, emit anything. They mutate rows in place, and the only complete record of what changed is buried in their write-ahead log. Until you can read that log, your pipeline is guessing at the truth.
What "a source you don't control" actually means
Every chapter through Build 10 has assumed inputs you can shape: CSVs you generate, message logs you append to, stream events you publish from your own service. The producer cooperates because the producer is you. The contract is whatever you wrote.
OLTP databases break this assumption in five specific ways:
- The source is owned by another team. The Razorpay payments service writes to a Postgres cluster operated by the platform team. You, the data engineer, get a read replica at best. You cannot ask them to add a Kafka publish on every transaction — that doubles the failure modes of their critical-path code.
- Rows mutate in place. A
UPDATE orders SET status = 'cancelled' WHERE id = 91823overwrites the previous status. The history of the row exists nowhere a SQL reader can see. updated_atlies. Some writes update it. Some don't. ORMs, raw SQL fixes, batch jobs migrating data — every code path is a chance forupdated_atto disagree with reality. A Flipkart catalogue table once had 2.3 million rows whereupdated_atwas older thaninserted_atbecause a script bulk-corrected SKUs and only touched the SKU column.- Deletes are invisible.
DELETE FROM cart_items WHERE user_id = 4517removes the row entirely. There is no row left forSELECT WHERE updated_at > Xto find. Tomorrow's snapshot will simply be missing rows that yesterday's snapshot had — and the pipeline has no signal that they were intentionally deleted versus filtered by a predicate change. - Schema evolves without warning. A developer adds a column, a renames it, drops one. Your pipeline's
SELECT *either silently widens (if the warehouse is permissive) or breaks at 7 a.m. (if it isn't).
Why this matrix matters: the data-quality bug at 2:47 a.m. is never one gap on its own. It is two stacked. A DELETE (gap 4) on a row that was previously updated (gap 2) by a script that didn't bump updated_at (gap 3) — the row vanishes from your warehouse and the only forensic evidence is on the OLTP team's WAL retention window, which is 4 hours.
The "incremental SELECT" pattern and why it fails
Most pipelines start with the seemingly-reasonable plan: every hour, SELECT * FROM source WHERE updated_at > :last_high_water_mark. This is the pattern of Build 3 (incremental processing) applied to OLTP. Walk through what it actually misses.
# The naive incremental pull, the way most pipelines start.
import psycopg2, json, datetime
def pull_incremental(conn, table: str, watermark: datetime.datetime):
"""Return new and updated rows since `watermark`."""
sql = f"SELECT * FROM {table} WHERE updated_at > %s ORDER BY updated_at"
with conn.cursor() as cur:
cur.execute(sql, (watermark,))
cols = [d.name for d in cur.description]
rows = [dict(zip(cols, r)) for r in cur.fetchall()]
new_watermark = max((r['updated_at'] for r in rows), default=watermark)
return rows, new_watermark
# --- Demo against a Razorpay-style merchants table ---
# Pretend the table has 240,000 rows, with these recent activities:
# 17:00 IST someone runs: UPDATE merchants SET kyc_status='verified' WHERE id=8141
# (single column update, ORM bumps updated_at — captured.)
# 17:14 IST a script runs: UPDATE merchants SET pincode = LPAD(pincode, 6, '0')
# WHERE LENGTH(pincode) = 5
# (raw SQL, did NOT touch updated_at — INVISIBLE to pull.)
# 17:31 IST someone runs: DELETE FROM merchants WHERE id = 8902 AND status='banned'
# (row vanishes — INVISIBLE to pull.)
last_wm = datetime.datetime.fromisoformat("2026-04-24 16:00:00+05:30")
conn = psycopg2.connect("postgresql://reader@payments-replica:5432/payments")
rows, new_wm = pull_incremental(conn, "merchants", last_wm)
print(f"Caught {len(rows)} changes between {last_wm} and {new_wm}")
print("Visible changes:")
for r in rows:
print(f" id={r['id']} kyc_status={r['kyc_status']} updated_at={r['updated_at']}")
# Sample run output:
# Caught 1 changes between 2026-04-24 16:00:00+05:30 and 2026-04-24 17:00:14+05:30
# Visible changes:
# id=8141 kyc_status=verified updated_at=2026-04-24 17:00:14+05:30
#
# What the warehouse will silently get wrong:
# - 4,217 merchants whose pincode was zero-padded look the same as before.
# - merchant id=8902 still exists in the warehouse, but is gone from source.
# - if id=8902's daily payout was ₹83,000, the warehouse keeps emitting
# that payout in reports until someone notices the orphan.
A walkthrough of what just happened:
SELECT * FROM merchants WHERE updated_at > :wm— the entire correctness assumption of this pattern is in the predicate. It depends onupdated_atbeing touched on every mutating path. In a 240k-row table accessed by 14 services and a handful of bulk-correction scripts, that assumption fails roughly once a quarter on a healthy team and once a sprint on an unhealthy one.- The 4,217 missing pincode updates — invisible because the pincode-correction script wrote raw SQL that didn't bump
updated_at. The warehouse will keep showing the old 5-digit pincodes for as long as the rows are not touched again. Why the warehouse is wrong but no alert fires: row-count is unchanged, schema is unchanged, the freshness check passes (the table got an update at 17:00). Every monitor you would naively put on this pipeline is green while the data is silently stale. - The vanished
id=8902—DELETEremoves the row from the source. The pipeline pulls only rows whereupdated_atexists. Deletes have noupdated_at, no row, no signal. The merchant's row in the warehouse becomes an orphan that quietly inflates payout totals. new_watermark = max(r['updated_at'] for r in rows)— this advances the bookmark. Once advanced, the missed pincode updates and the deleted row are forever in the past. There is no path back to detect them short of a full reload.- The cost of this pull, when it runs, is
O(rows-since-watermark). The cost of being silently wrong about deletes and bulk updates is unbounded.
This is the failure mode CDC exists to fix. Every approach in Build 11 — log-based replication, Debezium connectors, snapshot-and-stream, schema registries — is a different wedge driven into one of the five gaps in the figure above.
There is a second-order failure that the same code creates, worth naming explicitly. Imagine Riya, an SDE-2 on the data team, discovers the missing pincode updates two months later and writes a fix: a nightly job that compares row-hashes between source and warehouse and reconciles the diff. The fix works. But it costs a full-table scan of the source every night — 240,000 rows for merchants is fine, 8.4 million rows for orders is borderline, 100 million rows for events is impossible without dedicated infrastructure. Reconciliation is a tax on top of the leaky primary path; it does not replace the path. Most teams that try this approach end up with two pipelines (the primary updated_at pull and the reconciliation diff) and the wrong answer when both disagree. The lesson the Razorpay platform team published in their 2024 post-incident review: stop trying to make updated_at faithful; read the WAL.
What the source actually has, that you can't see
Every modern OLTP database keeps a write-ahead log: an append-only file that records every change in commit order, before it is applied to the data files. Postgres calls it the WAL. MySQL calls it the binlog. Oracle calls it the redo log. SQL Server calls it the transaction log. The names differ; the shape is identical:
- Every row insert, update, and delete is in there.
- Every transaction boundary is in there.
- Every schema change (DDL) is in there.
- It is the canonical source of truth for replication — if it weren't faithful, replicas would diverge from the primary.
The WAL is also load-bearing for the source. Postgres needs it for crash recovery. The WAL retention is finite — typically 4 to 24 hours on a busy production cluster. If you, the consumer, fall behind by more than that, the source either stalls (because it can't recycle WAL segments you haven't read) or you lose data (if the source is configured to recycle anyway). This tension — the WAL is the truth, but reading it puts back-pressure on the source's write path — is the constraint that shapes every CDC system. Why this is operationally heavy: the platform team that owns the Postgres cluster has to enable logical replication slots, give them retention budget, and trust that your pipeline will keep up. If your CDC consumer crashes for 6 hours during Big Billion Days, you can wedge the source database. The platform team has scars from this — that is why they say no to first asks.
A useful mental model: the WAL is the bank ledger and the table is the running balance. The teller's app shows you the balance — fast, indexed, easy to query — but if you want to know "what changed in the last hour", the balance alone cannot tell you. You need the ledger entries. Every database that participates in serious CDC is one where the ledger has been opened to authorised consumers. This is why MongoDB's change streams, DynamoDB's streams, Cassandra's CDC log, MySQL's binlog, Postgres's logical decoding, and Oracle's LogMiner all exist as distinct named features — each vendor decided, separately, that the ledger had to be exposed in a controlled way.
A small concrete walkthrough: what one WAL record looks like
Run pg_logical_slot_peek_changes on a slot configured with the test_decoding plugin against a Postgres that has just received the three statements in the figure above. The output is roughly:
lsn | xid | data
------------+--------+-------------------------------------------------------------------
0/3A1B240 | 18402 | BEGIN 18402
0/3A1B280 | 18402 | table public.merchants: UPDATE: id[integer]:8141
| | kyc_status[text]:'verified' updated_at[timestamptz]:'2026-04-24 17:00:14+05:30'
0/3A1B2C0 | 18402 | COMMIT 18402
0/3A1F100 | 18403 | BEGIN 18403
0/3A1F140 | 18403 | table public.merchants: UPDATE: id[integer]:1014
| | pincode[text]:'560001' -- (4,216 more UPDATE rows in this xact)
0/3A1F1A0 | 18403 | COMMIT 18403
0/3A2C040 | 18404 | BEGIN 18404
0/3A2C080 | 18404 | table public.merchants: DELETE: id[integer]:8902
0/3A2C0C0 | 18404 | COMMIT 18404
Three things stand out. The bulk update is one transaction with thousands of row events — the WAL preserves transactional grouping, so a downstream consumer can apply all 4,217 rows atomically or none. The DELETE record carries the primary key (because REPLICA IDENTITY is default), so the consumer can route the tombstone correctly. And every record has an LSN — a monotonically-increasing 64-bit pointer into the WAL that becomes the cursor for the next read. Why the LSN is the right cursor and updated_at is the wrong cursor: the LSN is set by the database itself at commit time, atomically with the change. There is no code path that mutates a row without advancing the LSN, because the LSN is the WAL's own write order. The application cannot forget to update it; the database physically cannot.
Three patterns Build 11 will teach
Knowing the wall, the next four chapters of Build 11 each build a piece of the bridge:
- Logical decoding (Postgres) and binlog readers (MySQL). How the database exposes its WAL as a stream of structured events. The mechanical primitive on which everything else stands.
- Snapshot-and-stream. You cannot start a CDC consumer at WAL position 0; the WAL only goes back hours. So the bootstrap is: take a consistent snapshot of the table now, start streaming WAL from the LSN at the moment of the snapshot, and stitch them together so no row is missed and no row is double-counted.
- Debezium and the schema registry. How the open-source standard wraps logical decoding into Kafka-shaped change events, with embedded schema, and how schema changes propagate to the warehouse without breaking pipelines at 7 a.m.
By the end of Build 11, the 2:47 a.m. on-call page from the start of this chapter has a different shape: instead of "the warehouse is wrong, here is a diff", it is "the CDC connector for merchants lagged by 11 minutes around 17:14, here is the gap, here are the 4,217 events to replay".
Edge cases the wall produces in real production
A few specific failure shapes show up over and over in incident reports. Recognising them is half the diagnosis.
- Soft deletes pretending to be deletes. A back-office tool sets
deleted_at = now()instead of issuing aDELETE. The row is technically still there. If the warehouse model filters ondeleted_at IS NULL, downstream reports change without any DDL or row-count signal — the silent kind of break. An audit query likeSELECT count(*) FILTER (WHERE deleted_at IS NULL) FROM sourceagainst the source-of-truth, compared againstcount(*)in the warehouse copy, exposes the drift. - The TRUNCATE event.
TRUNCATE merchantsempties a table without producing per-row delete events in some logical-decoding configurations. A warehouse that consumes the change stream and applies row-level deletes will miss the wipe and keep the old rows. CDC connectors typically emit a syntheticop = 't'truncate event; if your consumer ignores unknown ops, the wipe is invisible. - Long-running transactions blocking WAL recycling. A developer opens a transaction with
BEGIN;and forgets it on a laptop. The transaction holds a horizon — Postgres cannot recycle WAL past it because logical decoding might still need it. The slot grows, the disk fills, the platform team pages at 4 a.m. The wall here isn't bad data, it is the source-database operational risk you inherited the moment you asked for a slot. - Replica identity = default on a table without a primary key. Postgres's logical decoding emits row events keyed by primary key, by default. If a table has no PK (or
REPLICA IDENTITYis misconfigured), updates and deletes emitnullfor the key, and the consumer cannot route the event correctly. Razorpay's first Debezium rollout hit this on a junction table; 11 days of joins were silently keyed wrong before someone noticed.
Common confusions
- "Read replicas solve this — just SELECT from the replica." A read replica gives you a transactionally-consistent SQL view at a point in time. It still cannot tell you which rows changed since you last looked, because the same row mutates in place on the replica too. Replicas solve scaling, not change-detection.
- "Triggers can publish events on every change." They can, but they run on the OLTP write path. Adding a trigger that writes to an outbox table doubles every UPDATE's commit cost — for a payments table at UPI scale, that is 100M+ extra writes per day. Most platform teams will refuse. The transactional outbox pattern (ch.81) does this carefully; naive triggers don't.
- "
updated_atworks fine if everyone disciplines themselves to update it." It works until the day someone runs a one-off SQL fix and forgets. There is no compile-time check, no test that catches it. It fails silently, and you only find out at month-end reconciliation when the numbers don't match. - "CDC is the same as replication." Replication ships changes to another database of the same shape. CDC ships changes to a heterogeneous consumer (a warehouse, a stream, a search index, a feature store). The transport mechanism overlaps; the consumer's needs (ordering guarantees, schema evolution, fan-out) are different.
- "Snapshot once, then stream forever, you'll never need to snapshot again." True until you add a new consumer, change a schema in a way the consumer needs to backfill, or your WAL retention runs out during an outage. Re-snapshotting is part of operating a CDC pipeline, not a one-time bootstrap.
- "This only matters for big companies." A 100-person Bengaluru startup with 4 microservices and a Postgres-per-service has the same five gaps as a Flipkart-scale pipeline. The blast radius is smaller, not the underlying defect.
Going deeper
Why SELECT FOR UPDATE and "outbox at write time" don't fully escape the wall
A common second attempt at the problem is the transactional outbox: in the same transaction that writes to orders, also write to an outbox table. A separate process polls the outbox and forwards events. This works — if every service participates. The hard part is that a multi-tenant Postgres serves dozens of services, some not under your control, and any one of them can update orders directly without touching the outbox. You have replaced "trust updated_at" with "trust every service writer to remember the outbox", which is the same wall in different clothes. CDC reading the WAL is the only approach where the data engineer does not depend on application-side discipline.
The Razorpay incident that named "the wall"
In a public 2024 engineering writeup, Razorpay described a Sunday night where their merchant-onboarding pipeline (Postgres → BigQuery) silently dropped 12,400 merchants from a daily snapshot because a back-office tool ran a soft-delete via UPDATE merchants SET deleted_at = now() — and the warehouse model was loading on deleted_at IS NULL from the snapshot table, but had no event for the transitions. The fix was a 3-month migration to Debezium. The 12,400 merchants represented ~₹38 crore of stuck onboarding. The lesson the team published: "Treat the OLTP database as a source you do not control. Read the log, not the table."
What the WAL retention budget actually costs
A Postgres logical replication slot pins WAL segments until the consumer acks them. WAL is roughly 16 MB per segment, default retention is wal_keep_size = 64 MB, but with active slots it grows without bound. On a Razorpay-scale cluster doing 8,000 commits/second, WAL accumulates at roughly 2 GB/minute. A consumer that lags 30 minutes pins 60 GB of disk on the primary. Lag for 6 hours during a connector failure during Big Billion Days, and you have pinned 700+ GB on the primary's volume, risking a no space left on device and a primary outage. This is why the platform team's ask, before they hand you a slot, is: "what is your maximum tolerable lag, and what is the alert threshold at half that?"
Why MySQL's binlog and Postgres's WAL look similar but require different consumers
MySQL's binlog has three formats: STATEMENT (replays SQL strings, terrible for CDC because non-deterministic functions like NOW() give different answers downstream), ROW (logs the actual changed rows, the format CDC wants), and MIXED. Postgres has had logical decoding since 9.4, with pgoutput becoming the standard plugin. The shape of the events differs — MySQL emits binlog events with table maps + row images, Postgres emits typed messages per logical decoding plugin. Debezium normalises both into the same Kafka envelope, which is why it dominates the CDC ecosystem despite being heavyweight.
What the on-call runbook for a CDC pipeline actually contains
A team running CDC at production scale keeps a runbook that maps from "alert fired" to "first three things to check". The shape repeats across Razorpay, Cred, Zerodha, and Swiggy public writeups: (1) is the slot still active on the source, and what is its lag in WAL bytes? (2) is the connector consuming, or has it crashed and left the slot pinned? (3) is the downstream sink (warehouse, Kafka, S3) accepting writes, or is back-pressure propagating up? The runbook closes with a numbered escalation: at 30 minutes of lag, page the data team; at 2 hours, page the platform team because the source is at risk; at 4 hours, drop the slot and plan a re-snapshot rather than risk the primary. Build 11 ends with this runbook as a deliverable.
Where this leads next
The next chapters of Build 11 take this wall apart, mechanism by mechanism:
- /wiki/logical-decoding-postgres-replication-slots-from-scratch — read a Postgres WAL stream by hand, using
pg_create_logical_replication_slotandpg_logical_slot_get_changes. The primitive everything else builds on. - /wiki/snapshot-and-stream-the-bootstrap-problem — you cannot start at WAL position 0. How do you bootstrap a consumer without missing rows or counting them twice?
- /wiki/debezium-the-cdc-standard — how the open-source standard wraps logical decoding for Postgres, MySQL, MongoDB, SQL Server into a single Kafka-shaped event stream, and what it costs operationally.
References
- Logical Decoding (Postgres docs) — the primary reference for how Postgres exposes its WAL as a stream.
- MySQL Binary Log Formats (MySQL 8.0 docs) — STATEMENT vs ROW vs MIXED; the format that determines whether CDC works.
- Debezium architecture (Debezium docs) — the reference architecture every modern CDC system imitates.
- Razorpay engineering: how we moved 100B events/month to Debezium (2024) — the public writeup that motivated the "wall" framing in this chapter.
- Designing Data-Intensive Applications, Chapter 11 (Kleppmann, 2017) — the canonical introduction to change-capture as a unifying primitive.
- Online, Asynchronous Schema Change in F1 (Rae et al., VLDB 2013) — Google's paper on schema evolution as a first-class CDC concern.
- /wiki/what-makes-a-data-pipeline-different-from-a-script — the Build 1 wall, mirrored here for OLTP sources.
- /wiki/why-logs-the-one-data-structure-streaming-is-built-on — the deep reason CDC is log-based at all.
- Streaming data integration: the right and wrong ways (Confluent blog, 2023) — a practitioner's view on why polling-based pulls are the wrong default and CDC is the right one.
- Postgres write-ahead logging: WAL Internals (Postgres docs) — the storage layer below logical decoding; useful when reasoning about retention and disk pressure.