Hash-based deduplication on load

At 23:14 a Razorpay settlement-load job retries a failed batch of 4.2 lakh transactions because the network blipped halfway through the previous attempt. Some of those transactions are already in the settlements table — the previous attempt got them in before the connection died. Others are not. The job has no idea which are which, and asking the destination "did you receive transaction txn_8af9c2d4?" 4.2 lakh times would take longer than the original load. The fix the team's lead engineer Dipti reached for ten years ago is the same one every mature data team converges on: compute a content hash for each row, write it as a unique key, let the destination decide. A retry that ships row txn_8af9c2d4 a second time hits a key collision and is silently dropped. No duplicate, no question asked, no extra round trip. That hash — what it is, what it covers, and where it lives — is the subject of this chapter.

A dedup hash is a deterministic fingerprint of a row's identity-bearing content, computed once at the source and checked at the destination. It is the quietest, cheapest, and most-misconfigured idempotency primitive in data engineering. It composes directly with the checkpoint pattern from the previous chapter: the checkpoint cuts re-work in the common case, the dedup hash catches the rare boundary case where the checkpoint and the data write fell out of sync.

A dedup hash turns "did I already write this row?" from a question that requires a remote lookup into a question the destination's unique-index machinery answers for free. Compute the hash from the fields that define row identity, encode them in a canonical form (sorted keys, fixed timezone, normalised types), use a non-cryptographic-but-collision-resistant function (SHA-256 is the safe default), and store the hash as a unique constraint or partition key. The hash is the contract; everything else — ON CONFLICT DO NOTHING, MERGE-into-no-op, duplicate-marker tables — is mechanism.

The two failure modes are: hashing too few fields (and silently overwriting legitimate updates) and hashing fields that vary across runs (and producing fresh "duplicates" of every row, every time).

What a dedup hash actually is

A row arrives at the loader. It has some columns: txn_id, amount, currency, merchant_id, created_at, status, gateway_ref, plus a few fields the source system added to its export — a _extracted_at timestamp, a _source_run_id, an envelope checksum the SFTP transport added. The loader has to answer one question: is this the same row I have already written, or is it a new row that happens to share some of those values?

"The same row" is not a property the database can decide for you. You have to define it. The fields that constitute identity for this load — typically the immutable business identifiers — go into the hash. Everything else is excluded. The hash function turns that ordered tuple of fields into a fixed-size string that the destination can index, compare, and reject duplicates on.

Anatomy of a dedup hashDiagram showing a source row with many fields on the left, a canonicaliser that selects identity-bearing fields and normalises them in the middle, and a 64-character hex hash on the right that becomes a unique key in the destination table.Source rowtxn_id: TX8af9c2d4amount: 14250currency: INRcreated_at: 2026-..._extracted_at: ..._source_run_id: ..._envelope_chk: ...solid = identitydashed = run noiseCanonicaliserselect identity fieldssort keys, fix tznormalise typesSHA-256Destination rowdedup_key (UNIQUE)a3f0...c12d (64 hex)txn_id: TX8af9c2d4amount: 14250currency: INRcreated_at: 2026-...retry of same rowcomputes same hash,collides, is dropped.
The dedup hash funnel: many fields enter, an opinionated subset is normalised and hashed, and a single fixed-width key carries the row's identity into the destination. The dashed source fields (extraction timestamps, run ids) are excluded — they vary across runs of the same logical row.

The fingerprint has three properties that matter. It is deterministic — the same input bytes always produce the same output. It is collision-resistant — two different inputs almost certainly produce different outputs (SHA-256 has never had a collision found in any practical input space, and 2^256 is more than the number of atoms in the observable universe). And it is fixed-width — the destination index size is independent of how many fields you hashed, which matters when you have 70-column rows and a BTREE on the dedup key.

A dedup hash is not a primary key in the source-system sense. The source's txn_id is already a primary key over there. The dedup hash is the destination's unique constraint, computed by the loader, and it lets the loader treat retries as no-ops without any coordination back to the source.

Why a hash and not just the source primary key as the dedup key: in many real loads there is no single source primary key. Razorpay aggregates settlements from 40+ payment gateways and 14 banks; each has its own id space; collisions are guaranteed if you use raw txn_id as the dedup key across sources. The hash composes the source-system identifier with the source itself: sha256("razorpay-rzp" + "|" + "TX8af9c2d4" + "|" + ...). The output is unique across the merged universe in a way the raw ids are not.

What goes into the hash, and what does not

This is where the bug lives. Pick the wrong fields and the dedup either lets duplicates through or rejects legitimate updates as duplicates. There are exactly three field categories, and each has a fixed rule.

Identity-bearing fields go in. These are the fields that, taken together, define what this row means in the source system. For a transaction: txn_id, merchant_id, gateway. For a catalogue update: sku_id, vendor_id, effective_date. For a clickstream event: user_id, session_id, event_id, event_ts. The test is: if any of these change, it is a different row. If none of them change but other fields do, it is the same row updated.

Run-noise fields go out. Anything the extraction pipeline added — _extracted_at, _source_run_id, _pipeline_version, the envelope checksum the transport layer wrote, the SFTP filename, the Kafka partition number — must be excluded. These vary every time the same logical row is read again. Including them means every retry computes a fresh hash, every retry inserts a fresh "duplicate", and within a week the destination has 10x the row count it should.

Mutable-but-meaningful fields are the trap. A status column that goes PENDING → SUCCESS → REFUNDED. An updated_at that ticks forward every time the source row is modified. A last_login_at on a users dimension. Should these be in the hash? It depends on the load semantics, and getting it wrong is the most common dedup bug in production.

If the load is append-only event capture (every change is a new row), include the mutable field — (txn_id, status, updated_at) produces a fresh hash for every state transition, which is correct. If the load is upsert into a current-state table (the destination always shows the latest state per id), exclude the mutable field — (txn_id, merchant_id) produces a stable hash, the unique constraint catches retries of the same logical update, and a separate UPSERT mechanism handles the state change. Mixing the two is what produces "I see three copies of the same transaction with three different statuses, why?" tickets.

Three field categories for the dedup hashThree columns showing identity-bearing fields that always belong in the hash, run-noise fields that never belong, and mutable-but-meaningful fields whose membership depends on whether the load is append-only or upsert-into-current-state.Identity (always in)txn_idmerchant_idgatewaycreated_atIf it changes,it is a different row.Run noise (always out)_extracted_at_source_run_id_pipeline_version_kafka_partitionVaries across runsof the same row.Mutable (it depends)statusamountupdated_atlast_login_atAppend-only: in.Upsert current state:out.
Three categories. The middle column never goes in (every retry would produce a fresh "duplicate"). The left column always does. The dashed third column is where the bug lives — its inclusion is a load-semantic decision, not a default.

A subtler case is null and missing fields. A row missing the gateway_ref field hashes differently from a row whose gateway_ref is the empty string, which hashes differently from one whose gateway_ref is the literal string "null". The canonicaliser must define a deterministic representation for missing values — usually omitting the key from the JSON entirely, or always including it with the literal null JSON value. Inconsistent treatment within a single load is the second most common dedup bug.

Numeric formatting is the third trap. The integer 42, the floating-point 42.0, and the string "42" are three different hashes. A vendor that switches its export from amount: 14250 (paise as integer) to amount: 142.50 (rupees as float) silently invalidates every dedup hash from the changeover forward, even though the underlying transaction is unchanged. The canonicaliser must normalise types — coerce all numerics to a fixed representation (paise as integer, never a float; never trust the source's choice) — and the unit-test suite for the canonicaliser must include rows that exercise every type the source can emit. The Swiggy order-events ingest learned this when an upstream Java service began emitting amounts as BigDecimal.toString() ("142.50") instead of Long ("14250"); the resulting dupe-skip rate dropped from 0% to 0%, but every "duplicate" was actually a fresh insert because the hash had changed for every row.

A fourth trap, more subtle still, is field-presence drift. A vendor adds a new field risk_score to their export. The canonicaliser, written before the field existed, ignores it — correctly, because adding a new field shouldn't invalidate existing hashes. But a junior engineer "improves" the canonicaliser six months later to include the new field for "completeness", and from that day forward every row produces a fresh hash. The discipline is to enumerate identity fields explicitly, never to compute the hash from "all fields present in the row", and to require an explicit version bump for any change to the field list — exactly the version-tag mechanism described later in the Going-deeper section.

A complete, runnable dedup-on-load implementation

The pattern is small. The compute_dedup_key function below is the load-bearing one — get its canonicaliser right and the rest is mechanical. The example loads payment-settlement rows from a JSON file into a Postgres table with a unique constraint on the dedup key, retries a deliberately-half-completed previous run, and observes that the second attempt is silently a no-op for already-loaded rows.

# dedup_load.py — content-hash deduplication for idempotent retries.
import json, hashlib, datetime
import psycopg2

# Identity-bearing fields ONLY. Mutable status/amount/updated_at are excluded
# because this load is upsert-into-current-state. Run-noise fields are also
# excluded.
IDENTITY_FIELDS = ("source_system", "txn_id", "merchant_id", "created_at")

def canonicalise(row: dict) -> bytes:
    payload = {}
    for k in IDENTITY_FIELDS:
        v = row.get(k)
        if isinstance(v, datetime.datetime):
            v = v.astimezone(datetime.timezone.utc).isoformat()
        payload[k] = v
    return json.dumps(payload, sort_keys=True,
                      separators=(",", ":"), default=str).encode("utf-8")

def compute_dedup_key(row: dict) -> str:
    return hashlib.sha256(canonicalise(row)).hexdigest()

DDL = """
CREATE TABLE IF NOT EXISTS settlements (
    dedup_key   TEXT PRIMARY KEY,
    source_system TEXT NOT NULL,
    txn_id      TEXT NOT NULL,
    merchant_id TEXT NOT NULL,
    amount_paise BIGINT NOT NULL,
    status      TEXT NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL,
    loaded_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);
"""

INSERT_SQL = """
INSERT INTO settlements (dedup_key, source_system, txn_id, merchant_id,
                         amount_paise, status, created_at)
VALUES (%(dedup_key)s, %(source_system)s, %(txn_id)s, %(merchant_id)s,
        %(amount_paise)s, %(status)s, %(created_at)s)
ON CONFLICT (dedup_key) DO NOTHING
"""

def load(rows: list[dict], conn) -> tuple[int, int]:
    inserted, skipped = 0, 0
    with conn.cursor() as cur:
        for row in rows:
            row["dedup_key"] = compute_dedup_key(row)
            cur.execute(INSERT_SQL, row)
            if cur.rowcount == 1:
                inserted += 1
            else:
                skipped += 1
    conn.commit()
    return inserted, skipped

if __name__ == "__main__":
    with open("settlements.json") as f:
        rows = json.load(f)
    with psycopg2.connect("host=localhost dbname=razorpay user=etl") as conn:
        with conn.cursor() as cur:
            cur.execute(DDL)
        ins, skp = load(rows, conn)
        print(f"loaded: inserted={ins}, skipped(dupes)={skp}")

A two-run sample, where the first run dies after writing 1,200 of 4,200 rows (network drop), and the second run replays the entire input file:

$ python dedup_load.py settlements.json
loaded: inserted=1200, skipped(dupes)=0
[connection reset by peer]

$ python dedup_load.py settlements.json
loaded: inserted=3000, skipped(dupes)=1200

Total rows in settlements: exactly 4,200. No duplicates, no missing rows, no operator intervention. The skipped(dupes)=1200 line in the second run is the only signal that the first run partially succeeded — and it is a piece of useful observability rather than an error condition.

The four load-bearing lines deserve a careful look.

payload[k] = v with IDENTITY_FIELDS only. This is the field-selection decision from §2 made concrete. Adding status to that tuple would turn this into an append-only event log; removing created_at would mean two transactions on different days with the same (source, txn_id, merchant_id) collide. Every line of the rest of this script is mechanical; this tuple is the contract.

json.dumps(payload, sort_keys=True, separators=(",", ":")). The canonical encoding. sort_keys=True ensures {"a":1,"b":2} and {"b":2,"a":1} produce the same hash. separators=(",", ":") removes the default whitespace, so different Python versions or different JSON libraries produce the same byte output. Without this, an upgrade from Python 3.11 to 3.12 (or a switch from json to orjson) silently changes every hash, the unique constraint sees every row as new, and the destination grows by one full input every run.

v.astimezone(datetime.timezone.utc).isoformat(). Datetime normalisation. A row with created_at = "2026-04-25 17:14:00+05:30" and a row with created_at = "2026-04-25 11:44:00Z" represent the same instant; their isoformat() strings are different until you normalise. This single line eliminates a class of dedup bugs that took the team weeks to surface — settlements arriving from a vendor in Asia/Kolkata while the staging system uses UTC produced fresh hashes for the same logical row every time the timezone library upgraded.

ON CONFLICT (dedup_key) DO NOTHING. The destination's enforcement. The Postgres unique-index machinery catches the collision, returns a rowcount of 0, and the application interprets that as "we already have this row". Other engines have the same primitive under different names: MySQL INSERT ... ON DUPLICATE KEY UPDATE, BigQuery MERGE, Snowflake MERGE, Iceberg MERGE INTO. The pattern is universal; the keyword is local.

Why sort_keys is non-negotiable: dict ordering in CPython 3.7+ is insertion-order, which means the order keys appear in the final JSON depends on how the row was constructed. Two functionally identical pipelines that build the row dict in different orders produce different hashes. sort_keys=True decouples hash output from row-construction order, which decouples it from upstream code changes that don't actually change the row's content.

Why SHA-256 and not MD5 or CRC32: collision rate. CRC32 has 2^32 ≈ 4 billion possible outputs; at one billion rows you have roughly a 12% chance of a collision (birthday paradox: collisions become likely around 2^16 ≈ 65k for CRC32). MD5 is collision-broken in the cryptographic sense, but for non-adversarial dedup the practical rate is still vanishingly low — except on inputs an attacker controls. SHA-256's collision space is so large the question doesn't arise. The hash compute is ~500 ns per call in Python, ~20 ns in C; you will not bottleneck on it.

Where the dedup key lives in the destination

The hash is only useful if the destination indexes it as a unique key. There are four places to put it, in increasing order of operational weight, and the right choice depends on the destination's storage model.

Unique constraint on a row-store table. The pattern shown above. Postgres, MySQL, SQL Server, SQLite. The destination engine maintains a B-tree index on the dedup column; insert-with-conflict goes through the index and is rejected at the storage layer. Throughput is bounded by index-maintenance cost (typical Postgres: 30k–80k inserts/sec on a single node, well above what most batch loaders need). This is the right default for any OLTP-style destination up to about a billion rows.

Partition key on a columnar table. BigQuery, Snowflake, Redshift, Iceberg, Delta. These engines do not natively enforce uniqueness on insert (the storage model makes it expensive — you would have to scan every partition). Instead, the loader does the dedup in two steps: write to a staging area, then MERGE staging INTO target ON target.dedup_key = staging.dedup_key WHEN NOT MATCHED INSERT. The MERGE runs once per batch, not once per row, so the cost is amortised. Production lakehouse pipelines all use this pattern; Iceberg's MERGE INTO and Delta Lake's MERGE syntax are direct ports.

Bloom filter or hyperloglog approximate dedup. When the destination cannot efficiently maintain a unique index — for example, a Kafka topic with millions of messages per second — the dedup happens in the consumer, against an approximate set structure. False positives (saying "yes I have this" when you do not) cause data loss; false negatives (saying "no I do not have this" when you do) cause duplicates. Bloom filters tune toward zero false positives at the cost of false negatives, which composes safely with ON CONFLICT at a downstream destination. This is the regime Build 8's stream-processing chapters return to.

External dedup table. A separate seen_keys (dedup_key TEXT, first_seen_at TIMESTAMPTZ) table that the loader checks before inserting into the main fact table. Useful when the main table cannot afford the unique-index overhead, or when multiple loaders share a dedup namespace, or when you need to record when a duplicate was first seen. Operationally heavier — two transactions per row, two indexes to maintain — but composable with destinations that don't support ON CONFLICT.

The decision tree is straightforward: row-store OLTP destination → unique constraint; columnar warehouse destination → MERGE on dedup key; streaming destination → bloom filter; cross-system shared dedup → external table. The hash itself is identical across all four; only the enforcement layer changes.

The cost of a unique constraint deserves one more sentence. On a Postgres table with a B-tree on dedup_key, every insert pays roughly an O(log n) cost for the index lookup plus the conflict check; in practice on a 100M-row table that is ~12 microseconds of CPU per insert and well under the 100k-row/sec throughput of a single loader. The cost grows with table size, but only logarithmically — moving from 100M to 1B rows adds about 2 microseconds per insert. For batch-loaded fact tables that grow by a few million rows per day, the dedup index is essentially free. For high-rate streaming loads, it is not free, which is why the streaming variant moves the dedup state into the consumer's memory and accepts the bloom-filter approximation.

A pattern that surfaces in practice but is rarely written down is the secondary dedup window. The unique constraint enforces dedup forever — once a row's hash is in the index, the row will never be re-inserted. But many production loads only need dedup over a window: the last 24 hours of clickstream events, the last week of CDC changes, the last month of settlement retries. After that window, the dedup index is dead weight. The fix is partition pruning: partition the destination table by created_at, drop the unique constraint at the table level, add it per-partition, and drop old partitions. The index size is bounded by the dedup window, not by the table's lifetime. Zerodha's tick-data ingest uses this with a 7-day partition window; their dedup index is consistently 4 GiB regardless of how long the system has been running.

Common confusions

Going deeper

Composing the dedup hash with the checkpoint pattern

The previous chapter's checkpoint and this chapter's dedup hash are not redundant — they cover different failure modes, and a robust pipeline uses both. The checkpoint cuts the common case where a job restarts after a crash and skips the work it already committed; without the checkpoint, every restart re-processes everything from the top, which on a 14-lakh-row job is unacceptable. The dedup hash catches the boundary case where the data committed but the checkpoint did not advance, or where the same source row arrives twice from a re-export — situations the checkpoint alone cannot reason about because they straddle the gap between commit and checkpoint write.

A pipeline with only the checkpoint and no dedup hash will lose data on a crash between commit() and write_checkpoint() if the underlying retry logic is not implemented carefully. A pipeline with only the dedup hash and no checkpoint will re-process every batch from the top on every restart, hitting the unique constraint a few hundred million times before finally getting back to where it died. Together, the checkpoint cuts 99.99% of redundant work and the dedup catches the 0.01% the checkpoint missed. The Razorpay settlement-load pipeline that survives 06:00 deadlines uses both; the variant pipelines that have learned this lesson the hard way also use both.

When the dedup hash needs a salt or version

A dedup hash computed today must produce the same output a year from now, against the same input, or every retry of a year-old row will look fresh. This is the stability requirement, and it puts pressure on the canonicaliser: it cannot evolve, except by an explicit version bump.

The version is a small string prepended to the canonicalised payload before hashing: sha256("v3|" + canonical_payload). When the canonicaliser changes — a new identity field is added, a normalisation rule is fixed, a bug in the timezone handling is patched — the version increments and every newly-loaded row gets the new hash. Older rows in the destination keep their v2-era hashes; new retries of the same logical row produce a v3 hash, which collides with itself but not with the older v2 row. The two rows now coexist in the destination as a known duplicate, which a one-time backfill job can reconcile.

Without a version, fixing a canonicaliser bug means every previously-loaded row will appear as a duplicate of itself on the next retry, and the team has no way to distinguish "this is a real change in canonicaliser semantics" from "this is a corruption". The Flipkart catalogue pipeline learned this in 2023 when an upstream JSON library's number-formatting change silently produced fresh hashes for every SKU; without a version field, the team had to rebuild dedup state from a source-of-truth backup. Their post-incident change was to add the version prefix, and to require any change to the canonicaliser to bump it.

Hashing in streaming versus batch

In a batch load, every row's hash is computed at load time and the destination's unique constraint enforces dedup synchronously. The constraint check is cheap — a B-tree lookup — and the worst case is the loader stalls on the index for a few milliseconds.

In a streaming load, the same pattern would force the consumer to do a remote lookup against the destination per message, and at 100k messages/second per consumer that is unworkable. The streaming variant moves the dedup state into the consumer's local memory: a fixed-capacity bloom filter or dict[hash, expiry_ts] cache that records which keys have been seen in the last N minutes. The cache size and the retention window are tuned to the rate of expected duplicates and the consumer's memory budget.

Two complications arise. First, the in-memory dedup state must be checkpointed (back to the previous chapter) so a consumer restart does not re-emit messages it already deduplicated. Flink's RocksDB-backed state is the production-grade version of this; build-your-own variants typically write the dedup-key set to S3 every minute. Second, the cache eventually overflows and must drop entries; once an entry is dropped, a duplicate that arrives after the eviction is no longer detectable. The defence is to size the cache for the expected duplicate window — for retries that happen within minutes, a five-minute cache suffices; for replays that span days, the dedup must move to a persistent store. Apache Pulsar's deduplication feature, Kafka's idempotent producer, and Flink's exactly-once semantics are all variations on this theme; Build 8 and Build 9 unpack them.

Hash collisions and adversarial inputs

For non-adversarial inputs — rows arriving from a trusted source via an authenticated channel — SHA-256 collisions are not a practical concern. The probability of a collision in a billion-row dataset is roughly 10^9 / 2^128 ≈ 3 × 10^{-30}, which is several orders of magnitude below the probability of a cosmic-ray bit flip during the hash computation itself.

For adversarial inputs — user-controlled content, untrusted file uploads, web-scraped data where the source can choose its own content — the situation is different. SHA-256 is currently considered safe against adversarial collision-finding (no practical attack is known), but historically MD5 was considered safe in 1995 and broken by 2004. The discipline for adversarial workloads is to use a hash with a reasonable safety margin (SHA-256 today, BLAKE3 if you want a faster modern option) and to add an integrity check beyond the hash itself — for example, requiring an HMAC computed by the source with a secret the destination knows.

A subtler issue is that even without an adversary, two genuinely-different rows can map to the same hash if the canonicaliser is buggy. A canonicaliser that drops a field by mistake, or that normalises two different timestamps to the same value because of a timezone error, will produce a real (semantic) collision that has nothing to do with the hash function. The mitigation is to keep the original payload alongside the hash and check on conflict: INSERT ... ON CONFLICT (dedup_key) DO UPDATE SET ... WHERE settlements.payload = EXCLUDED.payload will silently no-op the genuine duplicates and alert on the canonicaliser-induced false collisions. This costs a column of storage and a conditional on insert; the payoff is that canonicaliser bugs are caught at write time rather than discovered three months later.

Operating the dedup key in production: monitoring, alerts, and rotations

A dedup key is a piece of running infrastructure, not a one-time design decision. Three operational signals are worth tracking from day one. The dupe-skip rate — the fraction of incoming rows that hit ON CONFLICT DO NOTHING and are silently dropped — should sit close to zero in steady state, with a few-percent spike during the recovery from a failed run. A persistently high dupe-skip rate (say, 30% of rows on every healthy run) usually means an upstream is replaying a window for some non-obvious reason — a stuck CDC consumer, a misconfigured Kafka offset reset, a vendor SFTP that re-uploads yesterday's file alongside today's. The dedup is doing exactly its job, but the upstream behaviour costs throughput and noise.

The canonicaliser-version distribution is the second signal. A SELECT canon_version, count(*) FROM settlements GROUP BY canon_version query should show a clean transition: rows loaded before a canonicaliser change are at version N, rows loaded after are at N+1, with a brief overlap during the deploy. A long tail of mixed versions — rows from last week still arriving as v2 alongside today's v3 — means a stuck consumer or a backfill running against the old code. The Cred rewards-ledger team caught a three-month-old stuck Lambda this way; the version tag was the only signal.

The payload-mismatch rate for systems using the optional payload tie-breaker column is the third. Two rows with the same dedup_key but different payload are either a hash collision (essentially never with SHA-256) or, far more likely, a canonicaliser bug that is mapping different rows to the same key. Even one mismatch per million rows is worth investigating, because canonicaliser bugs tend to pile up — once you find one, you usually find six.

The dedup column itself, indexed and unique-constrained, is also a source of operational pain at scale. A 64-character hex dedup_key column is 64 bytes per row; on a billion-row table that is 60 GiB before the index, plus another 30 GiB for the B-tree itself. Storing the hash as raw bytes (BYTEA in Postgres, BINARY(32) in MySQL) cuts this to 32 bytes plus index overhead, but you lose the ability to copy-paste the value into ad-hoc queries. The PhonePe transaction-history team chose BYTEA and added a one-line view that hex-encodes the column for human use; the storage savings paid for themselves the first quarter the table grew past 5 billion rows.

When the dedup hash is wrong, what happens

It is worth spending a paragraph on the incident pattern, because the dedup hash failure modes are unusually quiet — they do not throw exceptions, they do not page the on-call, they corrupt the data slowly and they are usually noticed by a downstream analyst who notices that revenue numbers are off.

A canonicaliser that excludes too many fields maps multiple genuinely-different rows to the same hash. The unique constraint silently rejects all but one. The destination is missing rows. The on-call sees no error. The analyst sees that the daily revenue from one merchant is suspiciously low, three weeks later, and files a ticket. The forensics require comparing source-system row counts against destination row counts per (merchant, day) and discovering the gaps.

A canonicaliser that includes too many fields — particularly run-noise fields that vary across runs — produces a fresh hash for every retry of the same logical row. The destination has duplicates. The on-call sees no error (the unique constraint is happy with the unique-but-redundant rows). The analyst sees that order counts are roughly 2x what the orderbook says they should be, and files a ticket. The forensics require running SELECT count(*), count(distinct (source_system, txn_id, merchant_id, created_at)) FROM settlements and discovering that the second count is much smaller.

The defensive habit that catches both is to add the payload-stability test to the canonicaliser's unit-test suite: take a row, build it two ways (different key insertion orders, different timezone formats for the timestamp, both with-and-without the optional risk-score field if the source might emit either), assert that all variants produce the same hash. Run this test on every commit. The first time someone "improves" the canonicaliser in a way that breaks dedup, the test fails before the change reaches production. The Razorpay team that runs this test has caught three canonicaliser regressions in two years; without the test, each would have been an analyst-discovered incident weeks later.

Where this leads next

Chapter 9 takes the dedup hash as input and builds the destination-side primitive that turns "insert with conflict" into "merge with replace" — the UPSERT and MERGE patterns used when the load is current-state rather than append-only. Chapter 10 studies the at-least-once contract that this chapter assumes (the loader will sometimes ship the same row twice; the destination must absorb it gracefully) and the at-most-once and exactly-once contracts that other patterns aim for.

Build 7 returns to dedup from the producer side — Kafka's idempotent-producer protocol, which assigns each message a (producerId, sequenceNumber) and lets the broker reject duplicate sequences. The mechanism is structurally the same as the content hash here: a deterministic identifier computed at the source, checked at the destination, used to make retries safe. Build 11 returns again, this time in the change-data-capture context where the dedup key must be derived from the source database's transaction log identifier (Postgres LSN, MySQL binlog position) so that a CDC consumer can resume from any point without re-emitting changes the destination already absorbed.

References

  1. Postgres ON CONFLICT documentation — the canonical reference for conflict-driven dedup at insert time, including the difference between DO NOTHING and DO UPDATE.
  2. SHA-256 in FIPS 180-4 — the specification for the hash function used as the default in this chapter, and the basis for its collision-resistance guarantees.
  3. Iceberg MERGE INTO — production-grade MERGE syntax for columnar lakehouse destinations, used by every team running large-scale dedup-on-load against Parquet files in S3.
  4. Kafka idempotent producer (KIP-98) — the streaming variant of dedup, where the producer supplies a sequence number and the broker rejects duplicates at the topic level.
  5. BLAKE3 specification — a modern fast hash with comparable collision resistance to SHA-256, useful when hashing throughput is a measurable bottleneck.
  6. Apache Pulsar message deduplication — broker-side dedup with a configurable window, illustrating the streaming-cache pattern in production.
  7. What "idempotent" actually means for data (and why it's hard) — the chapter-6 conceptual frame this chapter operationalises.
  8. Designing Data-Intensive Applications, Chapter 11 — Stream Processing — Martin Kleppmann's treatment of dedup in stream processing, including the difference between idempotent producers and exactly-once processing.

A practical exercise to internalise the canonicaliser: take the compute_dedup_key function above and feed it two rows that you believe are the same but that have minor formatting differences — different ISO timestamp formats, different key orderings, one with a null field present and one with the field absent. Compare the hashes. They should match if your canonicaliser is correct. If they don't, add normalisation steps until they do, then write a unit test that pins the canonicaliser's output for a fixed input. That test is the contract; without it, a future "minor refactor" of the canonicaliser will silently re-hash every row and double your destination's row count.

A second exercise: in the destination table, run SELECT COUNT(*), COUNT(DISTINCT dedup_key) FROM settlements after a few weeks of production runs. The two numbers must be equal — COUNT(*) counts physical rows, COUNT(DISTINCT dedup_key) counts logical rows, and the unique constraint ensures they cannot diverge. They will diverge the day someone disables the unique constraint "temporarily" and forgets to re-enable it. Set up an alert on COUNT(*) > COUNT(DISTINCT dedup_key) and you will catch that incident the same hour it happens, not three months later when finance reconciles the books.

A third exercise, for senior engineers calibrating their own pipelines: walk every dedup hash in a system you currently own and answer three questions. Is the field selection documented and code-reviewed? Is the canonicaliser version-tagged? Are the rows on each side of every retry boundary verified to produce identical hashes by an automated test? A "no" to any of these is a quiet bug waiting for the next infra upgrade or library bump to surface. The most-common "no" in the field is the third one — the canonicaliser is correct today, but no test pins its behaviour, and the day Python 3.13 changes some default formatting, every retry produces fresh duplicates and nobody knows why.