Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

What "idempotent" actually means for data (and why it's hard)

At 02:14 a refund-settlement job at PaisaBridge times out on its third Postgres write and the Airflow scheduler fires the on_failure_callback retry. The worker pod that takes the retry is a different machine in a different rack; the INSERT it issues for txn_8714002 is the same INSERT the dead pod issued thirty seconds ago — except the dead pod's INSERT might have committed before the network broke, or might not have. The on-call engineer types SELECT count(*) FROM settlements WHERE txn_id = 'txn_8714002' and sees 2. That extra row, multiplied across seventeen retried jobs over the following six hours, is a ₹46 lakh accounting drift that costs a week of reconciliation. Nobody wrote a bug. The pipeline did exactly what its author told it to. It just wasn't idempotent.

This chapter unpacks what idempotency means for data — not for the textbook function f(f(x)) = f(x), but for a real INSERT against a real database when the wire might cut at any byte. You will see why the math is the easy part, why operational idempotency requires a key and not just a property, and why "idempotent retries" is the foundation that Builds 2 through 9 each rebuild in their own primitives.

Idempotency for data means a write executed twice produces the same database state as executing it once — under any retry pattern, partial failure, or clock skew. Achieving it requires a stable idempotency key derived from the input (not the wall clock or a UUID), a destination operation that is conditional on that key (UPSERT, MERGE, INSERT-IGNORE, conditional PUT), and a contract that the consumer of the data tolerates re-emission. Get any of the three wrong and you ship duplicates.

The math is one line; the contract is three pieces

Mathematically, a function f is idempotent if f(f(x)) = f(x). Setting a variable to 5 is idempotent: doing it twice leaves the variable at 5. Appending to a list is not idempotent: doing it twice leaves the list with two new elements. Multiplying by 1 is idempotent; multiplying by 2 is not. This is the version taught in CS101 and it is correct, but it leaves the hard part unsaid: in a real pipeline, what is x, what is f, and what does "twice" mean when the network might have already committed the first call?

The data-engineering version of the definition has three operational pieces, and a write is idempotent only when all three hold simultaneously.

The three pieces. A write is idempotent only when the key is stable, the destination operation respects the key, and the consumer's read pattern doesn't double-count if the producer accidentally emits twice.

Skip any one and you ship a system that is idempotent on a whiteboard and broken in production. The PaisaBridge incident at the top of this chapter failed on piece 2 — the INSERT had no ON CONFLICT clause, so the retry created a duplicate row even though the key (txn_id) was perfectly stable.

Why all three are non-negotiable: the producer cannot know whether the previous attempt's bytes reached the database. The TCP ACK may have been lost on the return path, the database may have committed and then crashed before logging, the network partition may have healed milliseconds after the timeout fired. The only honest assumption is "I don't know if my last attempt landed" — and the only way to retry safely under that assumption is a key-conditional write that the database can deduplicate against its current state.

What "twice" actually looks like on the wire

Junior engineers picture retries as "the function ran, threw, and ran again". Production retries are much messier — they happen at five different layers, each with its own definition of "twice", and a pipeline must be idempotent against all five.

The five retry layers. The cost of duplicates rises with each layer because the consumer has had more time to read the duplicated rows. The discipline: design for L5, and L1–L4 fall out for free.

A pipeline that handles L1 retries by naively wrapping a requests.post() in a for attempt in range(3) is not idempotent — it just appears so because the retry happens fast enough that no consumer has read the in-between state. When the same bug surfaces at L4 (a tired engineer clicks "Clear and Re-run" at 03:00 on a job that ran successfully at 02:00), the duplicates are guaranteed. Designing for L5 is what makes the system actually idempotent; the lower layers come along for free.

Why the cost of duplicates rises with retry depth: at L1 (millisecond retry on a TCP timeout), no consumer has had time to read the duplicate. At L4 (manual re-run hours later), a downstream dashboard, a recommendation model, a finance team's daily report, and a regulator-facing audit trail have all already consumed the original. Cleaning up after an L4 duplicate means notifying every consumer, re-running every downstream job, and explaining to a finance lead why yesterday's report changed. The blast radius scales with how long the duplicates have been visible.

Building one: the dedup-key + UPSERT pattern

The smallest complete idempotent write in production is six lines, and they hide every subtle requirement of the previous section. The example writes refund settlements to Postgres. Same pattern works for Snowflake MERGE, BigQuery MERGE, S3 conditional PutObject, Iceberg row-level deletes — every destination Build 2 through Build 12 will visit.

# idempotent_write.py — the smallest correct idempotent INSERT.
import hashlib
import json
import psycopg2
from typing import Iterable

CONN = "host=db.razorpay.internal dbname=settlements user=etl"

def dedup_key(row: dict, run_date: str, source: str) -> str:
    payload = json.dumps({
        "txn_id": row["txn_id"],
        "amount_paise": row["amount_paise"],
        "merchant": row["merchant"],
        "source": source,
        "run_date": run_date,
    }, sort_keys=True, separators=(",", ":"))
    return hashlib.sha256(payload.encode()).hexdigest()

def write_settlements(rows: Iterable[dict], run_date: str, source: str) -> int:
    sql = """
        INSERT INTO settlements (dedup_key, txn_id, amount_paise, merchant,
                                 settled_at, source, run_date)
        VALUES (%(dedup_key)s, %(txn_id)s, %(amount_paise)s, %(merchant)s,
                %(settled_at)s, %(source)s, %(run_date)s)
        ON CONFLICT (dedup_key) DO NOTHING
    """
    written = 0
    with psycopg2.connect(CONN) as conn, conn.cursor() as cur:
        for row in rows:
            row_with_key = {**row, "dedup_key": dedup_key(row, run_date, source),
                            "source": source, "run_date": run_date}
            cur.execute(sql, row_with_key)
            written += cur.rowcount  # 1 if inserted, 0 if conflict
        conn.commit()
    return written

if __name__ == "__main__":
    sample = [
        {"txn_id": "txn_8714002", "amount_paise": 1_24_99_900,
         "merchant": "flipkart", "settled_at": "2026-04-24T06:00:00+05:30"},
        {"txn_id": "txn_8714003", "amount_paise": 12_47_50_000,
         "merchant": "swiggy", "settled_at": "2026-04-24T06:00:00+05:30"},
    ]
    n = write_settlements(sample, run_date="2026-04-24", source="razorpay-prod")
    print(f"inserted={n}  duplicates_ignored={len(sample) - n}")

Sample run on a fresh table, then re-run on the same input:

$ python idempotent_write.py
inserted=2  duplicates_ignored=0
$ python idempotent_write.py
inserted=0  duplicates_ignored=2
$ python idempotent_write.py
inserted=0  duplicates_ignored=2

A few load-bearing lines.

hashlib.sha256(payload.encode()).hexdigest() computes a stable 64-character key from the input fields. Same input row, same run_date, same source → same key, every time, on every machine, in any process. SHA-256 is overkill for collision resistance against millions of keys; the choice is for "no surprises" rather than performance. The bottleneck of this pipeline is Postgres, not the hash function.

json.dumps(..., sort_keys=True, separators=(",", ":")) is the part most engineers get wrong on the first try. Without sort_keys=True, two Python dicts with the same fields in different insertion order produce different JSON and different hashes. Without separators=(",", ":"), the output has variable whitespace depending on the Python version. Both bugs surface only after you upgrade Python or refactor the row construction — silent until the day everything starts double-inserting.

ON CONFLICT (dedup_key) DO NOTHING is the destination's contribution. Postgres checks the unique index on dedup_key and silently ignores the second INSERT if a row with that key already exists. The transactional guarantee is that either the row inserts or it doesn't, never half — and the unique index makes the check atomic with the write. A MERGE in Snowflake or BigQuery achieves the same effect with different syntax; the underlying primitive is "conditional on key".

written += cur.rowcount is the observability hook. In Postgres, cur.rowcount is 1 when an INSERT actually wrote a row and 0 when ON CONFLICT ignored it. The pipeline can log "wrote 14,218 rows, ignored 7 duplicates from a partial previous run", which is exactly the diagnostic on-call needs at 02:14.

Why the dedup_key includes source and run_date: the same txn_id can legitimately appear from two sources (the bank reconciliation file and PaisaBridge's internal ledger) and on two run dates (a backfill of yesterday's data on top of today's). Including both in the key ensures that "same logical event" groups together while "same id from different upstream paths or different runs" stay distinct. This is the most important key-design decision in any pipeline, and the most common bug is to make the key too narrow.

Why "use a UUID" doesn't work

Engineers reaching for idempotency for the first time often write dedup_key = uuid.uuid4() and call it done. This is wrong, and the failure mode is silent for a long time before it bites.

A UUIDv4 is random. Every retry generates a fresh UUID. The whole point of the dedup key is that the same input generates the same key, so that the second write recognises the first. A random UUID gives every retry a fresh, unique key, which means ON CONFLICT never fires, which means duplicates are inserted exactly as if no idempotency mechanism were present.

The same trap applies to:

uuid.uuid1() — encodes the current MAC address and timestamp, so retries on a different machine or at a different time generate different UUIDs.
time.time_ns() — a different nanosecond on retry means a different key.
Database auto-increment IDs — generated at INSERT time, not at row-construction time, so retries get different IDs.
Random seeds — random.seed() plus random.randint() is reproducible only if the seed is itself derived from the input. If you're going to do that, just hash the input directly.

The discipline is: the dedup key must be a pure function of the input fields. No clocks, no random, no environment, no machine state. The same row constructed on a Mumbai dev laptop and a Bengaluru production worker, on different days of different months, must produce the exact same hash. If your key generator depends on anything else, you have a different bug than you think.

Why client-supplied UUIDs sometimes appear in idempotency-key APIs (Slipway, AWS) without contradicting this rule: the client generates the UUID once, locally, before the first attempt, and then re-uses the same UUID across all retries of that one logical request. The UUID is random, but it is constant per-request from the client's perspective. The pattern is functionally equivalent to "hash the input" — both produce a stable per-request key. The hash version is preferred when the input is a row in a batch pipeline; the client-UUID version is preferred when the input is an external API call where the request body alone might not uniquely identify "this user's intent right now".

Operations that are naturally idempotent

Some destination operations are idempotent without a dedup key — knowing which is which lets you simplify pipelines that don't need the full pattern.

Setting a row to a value. UPDATE balances SET amount_paise = 12750000 WHERE user_id = 'rahul-89' is idempotent. Running it twice leaves the same value. Unless the value depends on the current value (amount_paise = amount_paise + 100), which is the trap — that one is not idempotent and needs an explicit guard.

Atomic file rename to a deterministic path. os.replace(tmp, "/data/out/refunds_2026-04-24.csv") is idempotent because the destination filename is determined by the input (run_date), not by the time the rename happens. Re-running publishes to the same path; the consumer sees the same file. This is the chapter-3 staging-rename pattern, and it is the simplest idempotent Load you can build.

S3 PutObject to a deterministic key. Same logic as os.replace. S3 will overwrite an existing object atomically. As long as the key is s3://bucket/refunds/dt=2026-04-24/file.parquet and not s3://bucket/refunds/<random-uuid>.parquet, the second PUT replaces the first cleanly.

Conditional writes with If-Match / If-None-Match. S3 conditional PUT (introduced in late 2024) lets you say "only write if the object's ETag is X" or "only write if no object exists at this key". This converts a non-idempotent operation (overwrite without checking) into an idempotent one with a precondition.

Set membership. SADD users:active rahul-89 in Redis is idempotent — the set either contains rahul-89 or doesn't, after either one or many calls. SADD returns 1 the first time and 0 thereafter, which doubles as a "this is new" signal.

Operations that are not naturally idempotent and need the dedup-key pattern: plain INSERT, list LPUSH/APPEND, counter INCR, queue SEND without a sequence number, idempotent-by-payload-but-non-idempotent-by-side-effect operations like sending an email.

The dividing line between the two categories is "does this operation observe the current state before deciding what to write?". UPDATE ... SET amount = 5 observes that the row exists and writes 5 regardless of what was there. INSERT does not observe; it just appends. INCR reads the current value and writes value + 1, which is observation, but the observation is consumed — the next call sees a new state. The natural-idempotence operations are the ones whose post-state depends only on the input, not on the pre-state.

Common confusions

"uuid4() is the dedup key — it's universally unique." UUIDv4 is unique across keys; the point of a dedup key is to be the same across retries of the same logical write. Use a hash of the input fields, not a fresh random ID. This is the single most common idempotency bug a junior engineer ships.
"Postgres INSERT ... ON CONFLICT DO NOTHING is the same as INSERT IGNORE in MySQL." They look similar but the failure modes differ. ON CONFLICT in Postgres requires you to specify the conflict target (a column or unique index); INSERT IGNORE in MySQL silently swallows all errors including type mismatches and constraint violations you'd want to see. Use ON CONFLICT DO NOTHING with an explicit target column; never use bare INSERT IGNORE without auditing what it's ignoring.
"At-least-once delivery plus idempotent processing equals exactly-once." It equals exactly-once processing of accepted messages, which is what most consumers want. It does not equal exactly-once delivery on the wire — the message is still re-sent. The distinction matters because the cost of at-least-once-with-idempotence (extra network traffic, dedup state) is paid every retry, even though the user-visible outcome looks exactly-once. Build 9 has the long version.
"If my pipeline is idempotent at the destination, the source can be anything." False. A source that emits the same event with different content on retry — say, a clock-stamped JSON with processing_time baked into each row — produces different dedup hashes and bypasses the idempotency check. The source contract must be that retries produce byte-identical (or at least semantically-identical-after-normalisation) rows, or the idempotency at the destination is meaningless.
"MERGE handles idempotency for me, I don't need a dedup key." MERGE requires a match condition (ON target.txn_id = source.txn_id); without the right key in the source data, the match will be wrong. The dedup-key discipline is what makes the MERGE match condition robust to retries that happened on different machines or at different times.
"I don't need idempotency because I run with retries=0 in Airflow." L3 is one of five retry layers. L1 (HTTP client) and L4 (manual re-run) happen regardless of your scheduler config. The first time a tired engineer at 03:00 clicks "Clear and Re-run" on a job that previously succeeded, you discover this the hard way.

Going deeper

Idempotency keys vs natural keys vs surrogate keys

Three kinds of "key" travel through a data pipeline and they are easy to confuse. The natural key is the business-meaning identifier of the row — txn_id for a transaction, order_id for an order, user_id + event_time for a user event. The surrogate key is what the database generates on insert — a Postgres BIGSERIAL, a Snowflake IDENTITY column. The idempotency key is the hash used to detect retries — derived from the natural key plus context (source, run_date).

These three should not be the same column. Conflating "natural key" and "idempotency key" causes a bug when the same txn_id legitimately appears from two sources and gets deduplicated incorrectly. Conflating "idempotency key" and "surrogate key" defeats the purpose because surrogate keys are generated by the database, after the dedup decision has already been made. The production schema typically has all three:

CREATE TABLE settlements (
    settlement_id    BIGSERIAL PRIMARY KEY,        -- surrogate
    dedup_key        CHAR(64) UNIQUE NOT NULL,     -- idempotency
    txn_id           VARCHAR(64) NOT NULL,         -- natural
    amount_paise     BIGINT NOT NULL,
    merchant         VARCHAR(64) NOT NULL,
    settled_at       TIMESTAMPTZ NOT NULL,
    source           VARCHAR(32) NOT NULL,
    run_date         DATE NOT NULL,
    inserted_at      TIMESTAMPTZ DEFAULT now(),
    INDEX (txn_id)
);

The unique index on dedup_key is what makes ON CONFLICT (dedup_key) cheap; the natural-key index on txn_id is what downstream queries actually use; the surrogate settlement_id is for joins and ordering. Each column earns its keep.

What goes into the key — and what doesn't

A dedup key is a hash, but the input to the hash is the design decision that takes thought. The canonical recipe at production scale is: business identifier(s) + source identifier + temporal scope. Each piece is in the key for a reason.

The business identifier groups retries of the same logical event. For payments, that's (merchant, txn_id). For event tracking, it's (user_id, event_id). For CDC, it's the table's primary key.

The source identifier prevents collisions when two upstream paths legitimately produce the same business id. PaisaBridge's internal ledger and the bank reconciliation file may both contain txn_8714002 representing different sides of the same transaction; treating them as duplicates would erase one. Tagging the key with source = "razorpay-internal" vs source = "bank-recon" keeps them distinct.

The temporal scope is the part most easily mis-set. run_date (the logical date of the data, not the wall-clock processing time) is usually right. processing_time (when the row happened to be processed) is usually wrong because it changes between retries — the very thing the key is supposed to be invariant against. Build 3's chapter on event-time vs processing-time generalises this distinction.

What does not go in the key: hostname, PID, environment ("prod" vs "staging" — that should be a different table), Airflow run id, Kubernetes pod id, IP address. None of these are stable across retries. If you find yourself adding them "just to make sure", you have probably misunderstood what the key is for.

Scaling the dedup index: when a hash table doesn't fit in memory

The pattern above scales to a few hundred million rows in Postgres without any tuning. Beyond that, the unique index on dedup_key becomes a write bottleneck — each INSERT does a B-tree lookup, and with 10 billion rows the index doesn't fit in shared buffers.

Three patterns scale further:

Bloom-filter prefilter. Maintain a Bloom filter of recent dedup keys (say, last 7 days) in front of the database. A negative result is definitive (not a duplicate, do the INSERT); a positive result triggers the full check. DigiPaisa's UPI pipeline uses this to handle ~100M tx/day with a Postgres index on only the trailing 24 hours of keys.

Time-partitioned dedup tables. Partition the destination by run_date. The dedup-key uniqueness constraint applies within a partition, not globally. A retry of yesterday's data only competes with yesterday's keys, not all keys ever. The trade-off is that a stale retry from three months ago could re-insert data, but in practice retries that old don't happen.

Separate dedup store with TTL. Redis or DynamoDB with a short TTL (24–48 hours) acts as the gatekeeper. The producer checks the dedup store before writing to the warehouse; the warehouse doesn't carry the dedup index at all. KhelKing uses this pattern for match-time event ingestion at peak rates of 1.2M events/sec — Postgres can't keep up, but Redis with a 1-hour TTL and 8 nodes can.

The right pattern depends on the retry window the upstream reasonably retries within. For a source where retries finish within minutes, a 24-hour TTL store is sufficient. For sources that backfill weeks of history, you need partition-scoped uniqueness. Pick the smallest mechanism that covers your real retry pattern, not the largest one that covers theoretical retries.

A fourth pattern, increasingly common in lakehouse-shaped pipelines, is content-addressable storage as the dedup primitive itself. If the data file's name is the SHA-256 of its content (s3://bucket/data/<sha>.parquet), a re-write of identical content produces the same path and naturally overwrites; the dedup table is replaced by S3's own object semantics. The cost is that file naming is locked to content; the benefit is that the dedup state grows zero — there is no separate index to maintain. Build 6 returns to this when discussing Iceberg's manifest-based file referencing, which uses content addressing under the hood.

A consideration that surfaces only at scale: the dedup index size and the dedup retention window are independent levers. A 7-day Postgres index over 100M tx/day is 700M rows and a few hundred GB; the same window in Redis is bounded by RAM. The cheapest design is often a hybrid — a hot Redis layer for the last 1–2 days of keys (where 99% of retries land), backed by a cooler Postgres or S3 layer for the remaining 5 days. DigiPaisa's UPI ingestion has run this shape since 2024, with a daily reconciliation job that compares the two layers and surfaces any divergence.

When idempotency conflicts with truth

A subtle case: the source legitimately re-emits the same event id with different content — a transaction whose status changed from pending to settled, but the upstream identifies both states with the same txn_id. A naive dedup-by-txn_id would drop the settled row because the pending row arrived first.

The fix depends on which row should win. If "latest wins" is the contract, the dedup key includes a version field and the destination operation is INSERT ... ON CONFLICT (txn_id) DO UPDATE — overwrite on conflict, with the source providing a version_id or updated_at so the UPDATE is conditional. If "first wins" is the contract (audit-style append-only), the dedup key includes the content hash and a new content-version simply adds a new row. The choice is a contract decision, not a code decision; the bug is making it implicitly.

ParakhTrade's order-book ingestion is a good example: each order_id can flip through up to 12 states (placed → modified → partial-fill → filled), and each state is a distinct row. The dedup key is (order_id, state, transition_time), and the destination is append-only — never UPDATE. Reconstructing the latest state at query time is a SELECT DISTINCT ON (order_id) ... ORDER BY transition_time DESC. This is more storage but never loses an audit trail; the regulatory contract makes it the right choice.

Why distributed idempotency is harder than single-machine

Everything above assumes the destination is one logical entity that can enforce a unique constraint. When the destination is a distributed system without a single arbiter — a sharded database, an eventually-consistent KV store, a Kafka topic with multiple producers — idempotency requires more machinery.

Kafka's idempotent producer (KIP-98) attaches a (producer_id, sequence_number) tuple to every message. The broker keeps a small per-producer state ("last seen sequence number") and rejects re-delivery within a window. The math is the same as the dedup-key pattern, but the state (the sequence number tracker) lives on the broker rather than in a database index. Build 9 walks the full protocol; the takeaway here is that the property — "same producer-input-sequence produces the same broker state, even on retry" — is identical to what ON CONFLICT does in Postgres. The implementation differs because the state machine is distributed.

The harder case is two producers, two destinations, one logical write — the canonical "transfer ₹1000 from Rahul's account at SetuBank to Riya's account at SBI" example. No single arbiter can enforce the dedup, and the protocol must be a two-phase commit with both sides agreeing or both rolling back. Build 9 covers this; for now, the takeaway is that single-destination idempotency (this chapter's domain) is a building block, and multi-destination atomicity is a strictly harder problem that uses idempotency as one of its primitives.

Where this leads next

Chapter 7 builds on this foundation by introducing checkpoint files — the on-disk state that lets a long-running pipeline restart without re-doing work that already committed. The dedup-key pattern handles "did this row already write?"; checkpointing handles "what was the last batch I successfully completed?" — orthogonal problems that compose.

State files and checkpoints: the poor man's job queue — chapter 7
Hash-based deduplication on load — chapter 8 generalises the SHA-256 pattern to streaming sources
Crashing mid-run: which state are you in? — chapter 4, the disk-state taxonomy that motivates this chapter's "key + condition" pattern
Idempotent producers and sequence numbers — chapter 66, the Kafka-distributed version of this chapter's pattern

Build 9 (chapters 65–73) returns to idempotency in the streaming world, where "the destination is a Kafka topic" replaces "the destination is a Postgres table" and the dedup state moves from database index to producer-broker protocol. The shape stays identical: stable key, conditional accept, consumer tolerant of re-emission.

Before moving on, it is worth re-reading the chapter-3 fifty-line pipeline through this chapter's lens. Every architectural choice in those fifty lines was setting up the idempotency story this chapter formalises: the deterministic run_id is the temporal scope in the dedup key; the staging-rename is the conditional write at the destination; the pure-function Transform is what makes the source contract "byte-identical retries". The doctrine generalises one layer at a time across the next 130 chapters, and the place to start internalising it is on the smallest pipeline you have already written.

References

PostgreSQL: INSERT ... ON CONFLICT — the canonical reference for the conditional-write primitive used throughout this chapter.
Idempotence in distributed systems — Pat Helland, "Life beyond Distributed Transactions" — the foundational essay arguing that idempotent operations are the substrate of every reliable distributed system.
Slipway engineering: designing robust APIs with idempotency keys — the production playbook for client-side idempotency keys in payment APIs, applied at scale.
Kafka KIP-98: Idempotent Producer and Transactional Messaging — the distributed-producer version of the same pattern, with (producer_id, sequence) replacing the SHA-256 hash.
Designing Data-Intensive Applications, Chapter 8 — The Trouble with Distributed Systems — Martin Kleppmann on partial failure, the underlying reason idempotency is non-negotiable.
PaisaBridge engineering: building a settlement pipeline at 100M tx/day — production patterns for dedup keys, retry windows, and the Postgres-vs-Redis dedup-store trade-off.
The append-only log: simplest store — cross-domain reference. An append-only log makes "first write wins" trivial because every write has a unique offset, eliminating the need for an explicit dedup key.
Snowflake MERGE statement — the warehouse-scale equivalent of Postgres ON CONFLICT, used for batch-MERGE patterns in Build 12's lakehouse chapters.
AWS S3 conditional writes (If-None-Match) — late-2024 addition that lets you express "PUT only if no object exists" without a separate lock; the cloud-storage equivalent of ON CONFLICT DO NOTHING.

A practical exercise to lock the concept in: take the idempotent_write.py script above and break each of the three pieces in turn — replace the SHA-256 with uuid4() and watch duplicates accumulate; remove the ON CONFLICT clause and watch the same; change the consumer query from SUM(amount) GROUP BY dedup_key to SELECT COUNT(*) and watch the count diverge from reality. Each broken version is a different production incident waiting to happen, and feeling the failure mode in your own laptop is what makes the pattern stick.

A second exercise, harder and more revealing: induce the partial-commit case. Run the pipeline against a small input, sleep for two seconds in the middle of the loop, and during the sleep kill -9 the worker. Re-run. Inspect the settlements table — every row whose INSERT had committed before the kill is present, every row that hadn't is absent, and the re-run silently fills in only the missing ones because the existing keys hit ON CONFLICT. This is the property that makes the pattern survive the messiest production failure mode: a partial commit is indistinguishable from a complete commit that's been retried, and both heal correctly without any operator intervention.

A third exercise, useful for senior engineers calibrating their own systems: walk a pipeline you already own and find every write to a destination. For each write, ask the three questions — is the key stable, is the destination operation conditional on that key, and does the consumer tolerate re-emission? Score each write on a 0–3 scale. The pipeline's idempotency is the minimum across all its writes, not the average. The first time a senior data engineer at a Bengaluru fintech ran this audit on a 40-job DAG, they found 11 jobs at score 3, 22 at score 2, and 7 at score 0 — and the 7 at score 0 were the entirety of the team's incident backlog over the previous quarter. The exercise pays for itself.