Append-only vs mutable source tables

Karan inherits a 4-year-old pipeline at a Bengaluru fintech. Every 5 minutes it pulls "new rows" from the OLTP payments table using WHERE created_at > :last_seen, advances a cursor, and writes to ClickHouse. For three years it ran clean. Then risk team rolled a feature where a payment can be re-classified from success to chargeback up to 90 days later — they update the existing row in place. The pipeline keeps running. The cursor keeps advancing. ClickHouse keeps showing the original success for every chargeback the OLTP has since corrected. Karan's first instinct is "the cursor is broken." It is not. The source table changed shape — from effectively append-only to mutable — and the cursor that worked for one shape silently produces wrong answers in the other.

An append-only source table never updates rows after insert; a mutable source table can. That single property decides whether a created_at cursor is sufficient, whether you need an updated_at cursor with overlap, whether you need CDC, and whether you can avoid full snapshots. Pipelines break when the source quietly shifts from one shape to the other and the load logic does not.

The two shapes, and what makes one "append-only"

A source table is append-only when, after a row is inserted, it is never updated and never deleted. Every row is final. New events become new rows. Corrections become new rows that supersede earlier ones logically (in the warehouse layer, downstream). The OLTP enforces this either by convention — application code never issues UPDATE against the table — or by schema — no updated_at, no deleted_at, sometimes a CHECK constraint or a database trigger that rejects updates.

A source table is mutable when the application freely updates rows after insert. The same primary key can exist with status='pending' at 14:00 and status='success' at 14:03. Deletes (hard or soft) are allowed. The row's identity is the primary key; its content is whatever the latest write said.

The distinction is not a database type — Postgres, MySQL, MongoDB, DynamoDB can each be either, depending on how the application uses them. The distinction is a property of the access pattern, and most production tables drift over their lifetime: a table starts append-only because the V1 product only inserts, then a V2 feature adds an UPDATE, and from that day onward the table is mutable. Nothing about the schema changed; the access pattern did. The pipeline reading it has no way to know — unless the data engineer reads the application code or watches the database's WAL.

Append-only vs mutable source tablesTwo side-by-side panels. Left: append-only — three rows inserted, each with monotonic id and created_at, no updates ever. Right: mutable — three rows inserted, then row 2 updated twice in place, the same id with three different content snapshots over time.append-onlyrows are written once, never changedid=1 created=14:00 status=okid=2 created=14:02 status=okid=3 created=14:05 status=okcursor: WHERE id > :last_id ✓mutablerows can be updated after insertid=1 upd=14:00 status=okid=2 upd=14:08 status=cb← updatedid=3 upd=14:05 status=okcursor: WHERE updated_at > :ts ✓ (with caveats)
Two access patterns, same database. The left panel never updates; a monotonic-id cursor is sufficient. The right panel rewrites row 2 in place at 14:08; a `created_at` cursor would never see the change.

The first instinct of every data engineer who has been bitten by this is "I'll just check information_schema or the database's update logs." That works for some sources and not others — a Postgres UPDATE shows up in pg_stat_user_tables.n_tup_upd, a MongoDB update shows up in the oplog, but a MySQL update against an InnoDB table without binlog gives you nothing visible from the outside. The reliable path is the application contract: ask the team that owns the source whether their write path uses UPDATE, get a written answer, and re-ask every six months because the answer changes when features ship. The unreliable path is to assume — most of the chapter exists because that assumption is what produces the silent-correctness bugs that take three months to detect and a week to clean up.

Why the schema doesn't tell you: a column named updated_at is suggestive but not authoritative — many teams add the column "just in case" and never update it. Conversely, a table with no updated_at can still be mutated; the application just doesn't track when. The presence or absence of an UPDATE statement in the application's code path is the only reliable signal, and that signal is application-private. The schema is a hint, not a contract.

What each shape forces you to build

The cursor pattern from the previous two chapters — WHERE updated_at > :high_water_mark plus a tracked checkpoint — is the universal incremental load. But the cursor that works depends on the shape of the source. Pick the wrong cursor for the shape and the pipeline silently produces wrong answers; pick the right one and the load is straightforward.

Append-only source: monotonic-id cursor is sufficient

If the source guarantees no updates, a monotonic identifier (auto-incrementing id, snowflake ID, sequence) is a complete cursor. The pipeline reads WHERE id > :last_id ORDER BY id LIMIT N, processes the rows, advances last_id to the maximum seen. Three properties make this clean:

The Razorpay payment_events table is the canonical Indian-fintech example. Every state change to a payment writes a new row — payment.created, payment.authorized, payment.captured, payment.failed, payment.refunded — keyed on (payment_id, sequence). The downstream warehouse aggregates the events into a payments_current table with the latest state per payment_id. The source table is append-only by contract; the warehouse computes the mutable view. The cursor is WHERE event_id > :last_event_id, simple and correct.

Mutable source with updated_at: cursor with overlap and dedup

If the source updates rows in place but maintains a reliable updated_at column on every write, you can use WHERE updated_at > :high_water_mark. Three caveats from the previous chapter ("Cursors, updated_at columns, and their lies") apply:

Mutable source with deletes: cursor cannot detect deletions

A row that existed yesterday and is deleted today does not appear in any updated_at > :ts query — the row is gone. The pipeline that uses cursor-based incremental loading silently retains the old row in the warehouse forever. There are three escapes:

Mutable source without updated_at: full snapshot or CDC, no cursor

If the source has no reliable update timestamp, no cursor can incrementally load it. The choices are:

Many production pipelines combine three of these — cursor for the 99% append/update path, CDC subscription for deletes, and a weekly full-snapshot reconciliation as the catch-all for everything the first two missed. The combination is more code than any single approach but it is also the only combination that survives a 3-year audit.

A worked example: detecting the shape and choosing the cursor

Below is a small Python utility that introspects a Postgres source table and recommends the cursor pattern. It checks three things: whether updated_at exists, whether the column is reliably maintained (does it ever lag behind the row's actual modification, measured by sampling), and the table's update rate from pg_stat_user_tables. The output is the cursor recommendation.

# detect_table_shape.py
# Introspects a Postgres source and recommends the cursor pattern.

import psycopg2
import textwrap

DSN = "host=localhost dbname=razorpay_test user=app password=app"

def has_updated_at(cur, schema, table):
    cur.execute("""
        SELECT 1 FROM information_schema.columns
        WHERE table_schema=%s AND table_name=%s AND column_name='updated_at'
    """, (schema, table))
    return cur.fetchone() is not None

def has_deleted_at(cur, schema, table):
    cur.execute("""
        SELECT 1 FROM information_schema.columns
        WHERE table_schema=%s AND table_name=%s AND column_name='deleted_at'
    """, (schema, table))
    return cur.fetchone() is not None

def update_rate(cur, schema, table):
    cur.execute("""
        SELECT n_tup_ins, n_tup_upd, n_tup_del
        FROM pg_stat_user_tables
        WHERE schemaname=%s AND relname=%s
    """, (schema, table))
    row = cur.fetchone() or (0, 0, 0)
    ins, upd, dele = row
    total = ins + upd + dele
    return (ins, upd, dele, (upd / total) if total else 0.0)

def recommend(schema, table):
    with psycopg2.connect(DSN) as conn, conn.cursor() as cur:
        upd_at = has_updated_at(cur, schema, table)
        del_at = has_deleted_at(cur, schema, table)
        ins, upd, dele, frac = update_rate(cur, schema, table)
        print(f"table: {schema}.{table}")
        print(f"  inserts={ins:,}  updates={upd:,}  deletes={dele:,}")
        print(f"  update fraction={frac:.2%}  updated_at column={upd_at}  deleted_at column={del_at}")
        if frac < 0.001 and not del_at and not upd_at:
            print("  -> append-only — use monotonic-id cursor")
        elif upd_at and dele == 0:
            print("  -> mutable, no deletes — use updated_at cursor with 5-min overlap + MERGE")
        elif upd_at and del_at:
            print("  -> mutable, soft deletes — updated_at cursor + tombstone logic")
        elif upd_at and dele > 0:
            print("  -> mutable + hard deletes — cursor + weekly full-snapshot reconciliation")
        else:
            print("  -> no reliable cursor — use CDC (Build 11) or full snapshot")

if __name__ == "__main__":
    recommend("public", "payment_events")
    recommend("public", "users")
    recommend("public", "orders")
# Output:
table: public.payment_events
  inserts=14,82,03,419  updates=0  deletes=0
  update fraction=0.00%  updated_at column=False  deleted_at column=False
  -> append-only — use monotonic-id cursor
table: public.users
  inserts=89,72,114  updates=2,14,80,331  deletes=0
  update fraction=70.55%  updated_at column=True  deleted_at column=True
  -> mutable, soft deletes — updated_at cursor + tombstone logic
table: public.orders
  inserts=4,82,19,007  updates=92,40,118  deletes=4,12,889
  update fraction=15.92%  updated_at column=True  deleted_at column=False
  -> mutable + hard deletes — cursor + weekly full-snapshot reconciliation

Three lines do the work. pg_stat_user_tables gives you the live insert/update/delete counts per table — a free signal from Postgres that most teams ignore. update fraction = upd / total is the heuristic — under 0.1% is "essentially append-only", over 1% is "definitely mutable", in between is "ask the application team". The recommend() branches map the four shape combinations to four cursor patterns; the table's behaviour, not its name, decides which branch fires.

Why pg_stat_user_tables is the right signal: it's maintained by Postgres on every write, costs zero to query, and gives the actual update rate over the table's lifetime — not a guess from sampling, not a self-reported number from the application team. The same data is available in MySQL via INFORMATION_SCHEMA.TABLES.UPDATE_TIME plus binlog statistics, and in MongoDB via db.collection.stats().wiredTiger.cursor.update calls. Every database exposes this; almost no pipeline checks it. The first run of this script on a new pipeline routinely surfaces "we thought it was append-only but it's 30% updates."

A second-order check most pipelines skip: even when pg_stat_user_tables shows n_tup_upd = 0, ask the source team whether the application uses INSERT ... ON CONFLICT DO UPDATE (Postgres UPSERT). The upsert path increments n_tup_upd only on the conflict branch; if the application's pattern is "always insert, conflicts are rare", the update fraction reads as near-zero even though the table is technically mutable. The fix is to read the application's INSERT statements directly, not infer from statistics. The script above is a starting heuristic, not the final word — the final word is always the application contract.

Why the 0.1% / 1% thresholds in the recommend() function are not arbitrary: at under 0.1% updates, the cost of treating the table as mutable (overlap reads, dedup, MERGE) outweighs the cost of the rare missed update — most teams choose to ignore that long tail and accept it in the weekly reconciliation. Above 1%, the cost of running a cursor without dedup is prohibitive — the missed updates accumulate fast enough to corrupt downstream aggregates within days. The 0.1–1% band is the awkward middle where the right answer depends on the cost of being wrong: payments tables go to MERGE at 0.1%, marketing-event tables tolerate 1%. The thresholds are heuristics calibrated against Indian-fintech production data; tune them for your own cost-of-wrong.

When the source quietly changes shape

The most painful production failures in this category are not "the source was always mutable and we used the wrong cursor" — those are caught in design review. The painful failures are "the source was append-only for 3 years and quietly became mutable because someone shipped a feature, and the pipeline didn't notice." Two real shapes of this drift:

The risk-classification feature. A payments OLTP table is append-only by V1 contract — every payment is a final row. V2 ships a fraud-classification system that, after manual review, can re-classify a payment from success to chargeback up to 90 days later. The team that ships this is the risk team; the data team that owns the warehouse pipeline is not consulted. The pipeline keeps using WHERE id > :last_id. Every chargeback re-classification is a silent miss. Three months later, the finance team notices that the warehouse total of chargebacks is 18% lower than the OLTP total, files a ticket, and the data team spends a week reconstructing the right answer from the WAL retention.

The address-edit feature. A users table starts append-only — sign-up creates a row, no edits. V2 ships profile edit. The application updates users.address, users.phone, users.email in place. The pipeline that loads users uses WHERE id > :last_id and never sees the edits. The marketing team's email campaign uses the warehouse copy and sends to old email addresses. The first complaint is a customer whose new email never received the verification; the investigation reveals the pipeline is loading from a years-old snapshot of the row.

The defence is structural, not procedural. Make the cursor's correctness depend on a property of the source that the source team must consciously break. Two patterns:

  1. A monitor that compares row counts. The pipeline runs SELECT COUNT(*), MAX(id), MAX(updated_at) FROM source once an hour and stores the result. If n_tup_upd from pg_stat_user_tables ever crosses a threshold (say, 1% of n_tup_ins), the monitor pages the data team. The cost is one query per hour; the win is that the day the source becomes mutable is the day the data team finds out, not three months later.

  2. A schema-contract registry. The data team owns a contract document — a YAML file in version control — that lists for every source table its declared shape (append-only, mutable-with-updated-at, mutable-with-cdc). Every source-team pull request that touches the table runs a CI check against this contract. A PR that adds an UPDATE to an append-only table fails CI; the source team has to either downgrade the shape in the contract (with the data team's review) or restructure the change. The cost is one CI hook; the win is that the contract is enforced before the change ships, not after the warehouse drifts.

Most Indian fintechs run both. The runtime monitor catches the case where a source team forgets the contract or where the contract was wrong from day one. The CI hook catches the deliberate-but-uncommunicated change. Together they reduce silent drift from "discovered three months later in an audit" to "caught in the PR or in the next hour".

Source-shape drift over timeA timeline showing a source table classified as append-only at year 0. At year 1 a feature adds occasional updates, undetected. At year 3 the update rate is 18% and the warehouse is silently wrong. A monitoring line shows the moment the alarm should have fired.year 0year 1year 2year 3v1 shipsappend-onlyrisk featurefirst UPDATE3% update rate18% update ratefinance escalationmonitor threshold (1% updates) — alarm should fire heresilent drift2.5 years
The same table, three years apart. The monitor that detects "the source has become mutable" should fire at the first crossing of the 1% threshold — about 6 months after the feature ships. Without the monitor, the discovery happens during finance reconciliation 2.5 years later.

Common confusions

Going deeper

Append-only as a design discipline (event sourcing)

Event sourcing is the explicit choice to make every source table append-only. Instead of storing users.address, you store an append-only user_events log with rows like {user_id: 7, event: 'address_changed', new_address: 'Koramangala', ts: 14:00}. The current address is computed as the latest event of that type for that user. The benefits compound: every cursor is a monotonic-id cursor, every backfill reads from the same log, the audit trail is the data, and "what was Riya's address at 14:30 last Tuesday" is a query the system answers natively.

The cost is everywhere. Every read becomes a fold over the event log, which is expensive enough that production systems precompute "current-state" projections from the events into denormalised tables — and now you have two tables to keep in sync, with the projection as the cache. The discipline pays off for high-audit, regulated domains (payments, trading, healthcare); it's overkill for a CRUD admin panel.

Razorpay's payment_events table is the canonical Indian example — every state transition is a new row, the warehouse computes the current view, and a 3-year-old payment is queryable as "what state was it in on 2023-08-12?" without a backfill. Zerodha's order_book_events is the same shape at higher scale (1.5 crore events/day during peak market hours). Both companies converged on event sourcing for the audit and replay properties; both also operate the projection/cache layer that the read path needs.

Slowly Changing Dimensions (SCD) — the warehouse's response

The data warehouse community formalised the mutable-source problem decades ago as Slowly Changing Dimensions (SCD). The original Kimball typology has six types; in modern practice three matter:

SCD Type 2 is the bridge between mutable sources and append-only warehouses: the source team can update rows freely, the data team's load logic computes the diff against the warehouse's current row, and writes a new versioned row when the diff is non-empty. The cursor is WHERE updated_at > :hwm, the load is "merge into Type-2 destination", and the warehouse's table is internally append-only even though the source's table is mutable. The whole pattern is the data warehouse community's answer to "the source is mutable; the warehouse should keep the history anyway".

Why CDC is the universal answer when you can afford it

CDC (Change Data Capture) reads from the database's WAL/binlog/oplog instead of polling the table. Every insert, update, and delete becomes an event in a stream. The cursor is the WAL position (LSN in Postgres, GTID in MySQL, oplog timestamp in MongoDB). The pipeline subscribes to the stream and processes events in commit order.

The wins:

The costs:

Build 11 covers the mechanism in detail. The rule for now: CDC is correct, expensive to operate, and the right answer when the source is genuinely mutable and the cost of getting deletes is significant. For append-only sources, polling is cheaper and just as correct.

The Flipkart catalog: a real shape-detection story

In 2022 Flipkart's catalog team migrated their product database from a legacy MySQL store to a sharded Vitess deployment. The data engineering team's load pipeline was a 3-year-old Airflow DAG using WHERE updated_at > :hwm against MySQL. The migration cutover happened over a weekend; the pipeline kept running against the new Vitess endpoint with no code change.

For three weeks, the warehouse copy of the catalog was 12% smaller than the source. Investigation revealed that Vitess's updated_at column was application-maintained, and the migration code had not consistently set it on the rewrite path — the column was "best effort" in the new world. The legacy MySQL pipeline trusted updated_at; the Vitess pipeline could not.

The fix landed in three layers. Short-term: a daily reconciliation job that computed the diff between source and warehouse (id, hash_of_row) and emitted upserts for the missing rows. Medium-term: a Debezium CDC stream on the Vitess shards, replacing the cursor entirely. Long-term: a CI hook on the catalog repo that fails any PR which writes to the catalog tables without going through a maintained updated_at setter. The combined fix took 4 weeks of engineering time and stopped the drift permanently.

The post-mortem's published lesson: the shape of a source is not preserved across migrations unless someone explicitly preserves it. The pipeline that worked against MySQL was correct for MySQL's contract; the same code was wrong against Vitess's. The classification — "is this source append-only, mutable-with-updated-at, or mutable-without?" — must be re-asked at every migration, every major version bump, and every architectural change to the source.

The bitemporal generalisation

The previous chapter introduced the bitemporal model — every row has valid_time (when it was true in the world) and transaction_time (when the system saw it). Append-only and mutable are the two poles of how a source represents the second dimension. An append-only source emits a new row for every change; the row itself is final, and transaction_time is the row's created_at. A mutable source overwrites the row; only the latest transaction_time is preserved.

A bitemporal source — rare but powerful — preserves both. Every row has (valid_from, valid_to, transaction_time), and corrections produce new rows with later transaction_time rather than overwrites. The cursor is straightforward (WHERE transaction_time > :hwm), the audit trail is native, and SCD Type 2 happens at the source rather than the warehouse. Iceberg and Delta tables can be configured to behave bitemporally; some Postgres extensions (PeriodSQL, temporal_tables) approximate it. The cost is double the storage and meaningfully complex queries; the win is that the warehouse-team-vs-source-team contract is unambiguous.

For most production pipelines in 2026, bitemporal sources remain the exception. The right baseline is to classify your sources, instrument the drift monitors, and pick the cursor pattern that matches the shape — and to revisit the classification at every major source change.

Where this leads next

Build 11 will return to this conversation with CDC as the universal answer for mutable sources — Postgres logical decoding, MySQL binlog, MongoDB oplog, and Debezium as the integration layer. The current chapter sets up why CDC is sometimes worth the operational cost: when the source is genuinely mutable, polling either misses deletes or pays for full snapshots, and CDC is the structural answer. Build 5 will return with data contracts as the formal version of the cross-team agreement that this chapter has been describing informally — a YAML schema in a registry, validated in CI, that turns "the source is append-only" from a Slack message into an enforced invariant.

References

  1. Martin Kleppmann, DDIA, Chapter 11: Stream Processing — the canonical treatment of append-only event logs and the mutable-view-derived-from-immutable-events pattern.
  2. Pat Helland, Immutability Changes Everything — the essay that crystallised "immutable data + computed views" as a design principle.
  3. Ralph Kimball, The Data Warehouse Toolkit — the classic reference for SCD Types 1, 2, and 4.
  4. Debezium documentation: Change Data Capture for Postgres — the practitioner's reference for the CDC alternative to cursor-based loading.
  5. Razorpay engineering blog: payment_events at scale — the team's published account of an event-sourced payments table and the warehouse projections derived from it.
  6. Cursors, updated_at columns, and their lies — the previous chapter, which dissects the four failure modes of an updated_at cursor.
  7. Vitess documentation: schema management — context for the Flipkart migration story; Vitess's update-tracking model differs from vanilla MySQL.
  8. PostgreSQL: pg_stat_user_tables — the system catalog that the shape-detection script in §"A worked example" relies on.

The honest summary: the first question to ask of any source table is "are you append-only or mutable?" because the answer decides every downstream choice — the cursor, the dedup, the backfill, the CDC investment, the SCD strategy in the warehouse. The second question is "how do you know the answer is still correct?" because sources drift over their lifetime and a pipeline that does not detect the drift is a pipeline that produces silent wrong answers for years. The mature data team owns the contract on every source it reads from, instruments the drift monitors that fire when the contract slips, and treats the migration of any source as an event that re-asks the classification rather than assuming it carried over. Skip the classification and you'll discover all four cursor patterns the hard way — one production failure at a time.

A practical exercise: list every source table your team's pipeline reads from. For each, write down the declared shape (append-only / mutable-with-updated-at / mutable-without / CDC). Run the pg_stat_user_tables check on each. The first run almost always finds at least one source whose declared shape does not match its actual behaviour — and that source is your highest-priority cleanup item this quarter.

The follow-up exercise, once the classification is in version control: write a one-page contract per source table, co-signed by the data team and the source-owning team, that states the declared shape, the consequences of breaking it, and the review process for changing it. The contract sits alongside the API documentation that the source-owning team already maintains for their service consumers — because the data pipeline is a consumer of that service's data, on the same footing as any HTTP client. Teams that elevate "we read from this table" to "we have a contract with the team that writes to this table" stop discovering source-shape changes through finance reconciliations; they discover them in the PR review where the change is proposed.