Partial failures and the at-least-once contract
At 02:47 on a Wednesday, Riya gets the page she has been dreading. The Razorpay settlement-aggregation pipeline has crashed at the third stage of seven, with the message psycopg2.OperationalError: connection terminated unexpectedly. The first stage (extract from the OLTP replica) finished. The second stage (load 1.8 lakh rows into the staging schema) finished. The third stage (the daily MERGE into settlements_current) was halfway through when the pooler killed the connection. Some rows are merged. Some rows are still in staging waiting to be merged. The downstream stages — fee calculation, GST split, NEFT payout file — have not started. The pipeline reports FAILED. The phrase "partial failure" is doing a lot of work here, because the system is not in either of the two states the on-call runbook covers ("everything succeeded" or "nothing happened"). It is in a third state: some progress, no clear summary. And the only safe move is to retry — which is only safe if every stage was designed for it.
The previous chapter showed the MERGE primitive that absorbs duplicates inside a single statement. This chapter widens the lens: the pipeline as a whole has multiple stages, multiple processes, multiple network hops, and any of them can fail at any point. The contract that keeps the system honest is at-least-once delivery with idempotent processing — the upstream promises every event will be delivered at least once (sometimes more), and the downstream promises that processing the same event N times produces the same observable state as processing it once.
Partial failure is the default state of any non-trivial pipeline: a stage crashes mid-run, and the downstream cannot tell which rows made it through. The at-least-once contract says retries are allowed and expected, and the responsibility for correctness moves into the receiver — every stage must be idempotent so that re-processing the same input produces the same observable state. Exactly-once is mostly a marketing term; the production reality is at-least-once delivery plus end-to-end idempotency, and the pipelines that survive years in production are the ones whose every stage absorbs duplicates without complaint.
The three failure shapes
A pipeline stage interacts with the outside world through three kinds of operations: it reads from a source, it does compute, and it writes to a destination. Failures are categorised by which of those three was in flight when the crash happened.
The leftmost case — crash before any external write — is the easy one. Nothing has changed in the destination, the source still has the unread data, and the retry simply does the work for the first time. Most local exceptions (parsing errors, division by zero, schema mismatches caught before the write) live here. The try/except block that surrounds the read-and-compute phase catches them and the orchestrator retries the stage with the same input. Correctness is automatic.
The middle case — crash during the write — is where the production bugs live. Some rows are committed, some rows are not, and the stage cannot tell which is which without re-reading the destination. If the write was a single SQL transaction (BEGIN; INSERT...; INSERT...; COMMIT;) the database's atomicity guarantee saves you — either all rows are visible or none are, and the retry is the same as case one. If the write was many separate transactions, or a streamed COPY flushed in chunks, or a sequence of HTTP calls to a downstream API, then the destination is in a partial state and the retry has to be designed for it. This is the case the at-least-once contract addresses.
The rightmost case — write succeeded, but the acknowledgment to the source was lost — is the subtlest. The destination has the data. The source does not know. On retry the source ships the data again, and unless the destination can recognise "I have already seen this" the row appears twice. This is the case that turns a well-meaning at-least-once retry into a corrupted ledger. Idempotent processing — the destination's ability to absorb duplicates — is the only defence.
Why "exactly-once" is mostly marketing: every distributed system has at least one network hop where the sender cannot tell whether the receiver got the message — was the message lost, or was the ack lost? The two cases are indistinguishable from the sender's side. The sender's options are "retry and risk a duplicate" (at-least-once) or "don't retry and risk a loss" (at-most-once). There is no third option at the network layer. The "exactly-once" systems you read about — Kafka transactions, Flink's two-phase-commit sink — implement at-least-once delivery plus idempotent or transactional processing on the receive side. The exactly-once is end-to-end behaviour, not a wire-protocol property. Build 9 returns to this in detail.
The contract, written out
The at-least-once contract is two clauses. Read them as a contract because they are: a producer side and a consumer side.
Producer side. "I will deliver every record at least once. I may deliver some records more than once. I will not lose records that I have acknowledged. After a crash or restart, I will resume from a checkpoint that may include records I have already delivered, because that is safer than skipping records I haven't."
Consumer side. "I will process every record I receive. I will produce the same observable state for the destination whether I receive a given record once, twice, or N times. I will commit my progress in a way that survives my own crashes, so that I do not silently re-process and double-write."
Each clause is necessary; neither alone is sufficient. A producer that delivers exactly-once with a non-idempotent consumer is fine until the producer's "exactly-once" assumption fails (which it will, the first time someone restores from a snapshot). A consumer that is idempotent with an at-most-once producer silently loses data — the consumer is happy to absorb duplicates, but the producer dropped the only copy on its way over.
The pattern that makes this work is the dedup key: a column or expression on the receive side that uniquely identifies a record's logical identity, independent of how many times it was delivered or what timestamp it arrived with. The previous two chapters covered the two primitives that use it — INSERT ... ON CONFLICT DO NOTHING for append-only event tables and MERGE INTO for current-state tables. This chapter steps up one level and asks: how do you decide what the dedup key is, and how do you commit consumer-side progress without losing the at-least-once guarantee?
A pipeline you can crash and re-run
The example below builds a small extract-load stage that demonstrates the contract end to end. The producer reads payment events from a JSON-lines file, fakes a network glitch by crashing 60% of the way through with high probability, and uses a checkpoint file to remember the byte offset of the last batch it acknowledged. The consumer writes into Postgres with an idempotent insert keyed on event_id. After a crash, the producer resumes from the last checkpoint — which may overlap with rows already delivered, by design — and the consumer absorbs the duplicates.
# alo_pipeline.py — at-least-once producer + idempotent consumer.
import json, os, random, sys, time
import psycopg2
from psycopg2.extras import execute_values
CHECKPOINT = "checkpoint.json"
INPUT = "payments.jsonl"
DDL = """
CREATE TABLE IF NOT EXISTS payments_loaded (
event_id TEXT PRIMARY KEY,
merchant_id TEXT NOT NULL,
amount_paise BIGINT NOT NULL,
event_ts TIMESTAMPTZ NOT NULL,
inserted_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
"""
INSERT_SQL = """
INSERT INTO payments_loaded (event_id, merchant_id, amount_paise, event_ts)
VALUES %s
ON CONFLICT (event_id) DO NOTHING
RETURNING event_id;
"""
def load_checkpoint() -> int:
if os.path.exists(CHECKPOINT):
with open(CHECKPOINT) as f: return json.load(f)["byte_offset"]
return 0
def save_checkpoint(offset: int) -> None:
tmp = CHECKPOINT + ".tmp"
with open(tmp, "w") as f: json.dump({"byte_offset": offset}, f)
os.replace(tmp, CHECKPOINT) # atomic on POSIX
def run_batch(conn, rows: list[dict]) -> int:
tuples = [(r["event_id"], r["merchant_id"], r["amount_paise"], r["event_ts"])
for r in rows]
with conn.cursor() as cur:
execute_values(cur, INSERT_SQL, tuples, fetch=True)
inserted = len(cur.fetchall())
conn.commit()
return inserted
def maybe_crash(progress: float) -> None:
if 0.55 < progress < 0.65 and random.random() < 0.7:
print(f"[CRASH] simulated network drop at progress={progress:.2f}",
file=sys.stderr); sys.exit(2)
def main() -> None:
start_offset = load_checkpoint()
total_size = os.path.getsize(INPUT)
inserted_total = seen_total = 0
with psycopg2.connect("host=localhost dbname=razorpay user=etl") as conn:
with conn.cursor() as cur: cur.execute(DDL)
with open(INPUT, "rb") as f:
f.seek(start_offset)
buf = []
while True:
line_start = f.tell()
line = f.readline()
if not line: break
buf.append(json.loads(line))
seen_total += 1
if len(buf) >= 500:
inserted_total += run_batch(conn, buf)
save_checkpoint(f.tell())
buf = []
maybe_crash(f.tell() / total_size)
if buf:
inserted_total += run_batch(conn, buf)
save_checkpoint(f.tell())
print(f"start_offset={start_offset} seen={seen_total} "
f"inserted={inserted_total} duplicates={seen_total - inserted_total}")
if __name__ == "__main__": main()
A two-run sample, where the first run crashes and the second resumes:
$ python alo_pipeline.py
[CRASH] simulated network drop at progress=0.61
$ python alo_pipeline.py
start_offset=512000 seen=389 inserted=312 duplicates=77
$ psql -c "SELECT count(*) FROM payments_loaded"
count
-------
1247
$ wc -l payments.jsonl
1247 payments.jsonl
The first run crashed mid-stream after inserting roughly 60% of rows and saving a checkpoint somewhere in the middle. The second run resumed from that checkpoint, but the checkpoint was saved at the batch boundary — which means the rows in the in-flight batch (the one being inserted at the moment of the crash) may have been written but not checkpointed. The second run re-reads them, the consumer's ON CONFLICT DO NOTHING recognises them by event_id, and the row count of payments_loaded ends up exactly equal to the row count of the input file. Zero loss, zero duplicates visible to downstream queries, despite a crash and a re-run.
Three lines deserve a careful walkthrough.
os.replace(tmp, CHECKPOINT). The checkpoint file is updated atomically. A naive with open(CHECKPOINT, "w") would truncate the existing file before writing the new contents, so a crash between truncate and write would leave the checkpoint empty — and the next run would restart from offset 0, replaying the entire input. Writing to a temp file first and then renaming is a POSIX-atomic operation: either the rename completes and the new content is visible, or it doesn't and the old content stays. The same pattern applies to every checkpoint store in the field, from Spark's _committed files to Kafka Connect's offset commits.
ON CONFLICT (event_id) DO NOTHING RETURNING event_id. The ON CONFLICT clause is the receive-side dedup. When the producer redelivers an event the consumer has already absorbed, the insert is silently dropped — no error, no transaction abort, no downstream wakes-up. The RETURNING clause lets the consumer count how many rows actually made it in versus how many were duplicates, which is the most useful observability signal in any at-least-once pipeline.
Save the checkpoint after the commit, not before. The order matters. If the consumer saves the checkpoint first and then commits the batch, a crash between the two leaves the checkpoint claiming progress that wasn't made — the rows in that batch are lost. If the consumer commits first and then saves the checkpoint, a crash between the two re-delivers a batch already absorbed — the consumer's idempotency catches the duplicates. The receive side biases toward the safer error: redeliver rather than drop.
Why batch-boundary checkpointing is enough: the alternative is per-row checkpointing, which divides the throughput by the per-row commit cost. A 1.8 lakh-row settlement load batched at 500 rows commits 360 times instead of 1,80,000 times; the throughput goes from a few hundred rows per second to roughly 50,000 rows per second on the same hardware. The cost of batch-boundary is that a crash inside a batch redelivers the in-flight batch on retry, which the receiver's idempotency absorbs. The cost-benefit is overwhelming for any pipeline running on more than a thousand rows per run.
Why the dedup key is event_id and not the natural primary key of payments_loaded: the natural primary key would be (merchant_id, settlement_date) or similar, which describes the row's logical content. A duplicate event from the producer carries the same event_id but, after schema evolution, may carry a slightly different content hash. Keying dedup on event_id — a stable producer-assigned identifier — guarantees a redelivery is recognised as a duplicate even if the row's other columns drift between deliveries. This is one of the few cases where the producer's contract trumps the destination's natural key.
Where the bugs hide: timeouts, partial commits, and ack loss
Three specific failure modes inside a single stage account for almost every "the load looks wrong" ticket the on-call data engineer answers. Each one is a small variation on the at-least-once contract, and each one has a specific defence.
The first is the timeout that wasn't really a timeout. The consumer's database call takes longer than the producer's HTTP timeout — say the producer waits 30 seconds, the database commits at 31 seconds. The producer treats this as a failure and retries. The retry's commit succeeds again, and now the destination has the row twice — unless the dedup key catches it. The defence is twofold: make the dedup key stable across retries (the event_id discipline), and align timeouts so the consumer's commit budget is shorter than the producer's wait. If the consumer is allowed to take 60 seconds and the producer waits 30, every long commit becomes a retry and every retry becomes a duplicate-attempt. The Zerodha trade-tick ingest pipeline hit this exact pattern in 2024 — the database fix was a server-side statement_timeout = 25s setting, which converted slow queries into clean failures the producer's retry could absorb cleanly.
The second is the partial commit inside a multi-statement transaction. The consumer issues BEGIN; INSERT INTO staging ... ; INSERT INTO target ...; COMMIT; and the connection drops between the two inserts. The pooler kills the session, the engine rolls back the open transaction, and the destination is left with neither insert visible — which sounds clean, but the upstream's checkpoint may have already advanced if the consumer committed it before the database transaction. Same fix as the example earlier: commit the database transaction first, then save the upstream checkpoint. The two writes happen in the order that biases toward a redelivery rather than a loss.
The third is ack-loss after success. The destination committed the row. The destination tried to send the success ack to the producer. The ack was lost on the network. The producer treats the request as failed and retries. The destination's idempotency catches the duplicate. This is the textbook case the contract is designed for, and it is also the reason every well-designed at-least-once pipeline emits an inserted count separate from a seen count: the gap between them is the duplicate-redelivery rate, and tracking it over time gives you the operational health of the contract. A stable 0–1% gap is normal; a sudden spike to 30% is a sign of a broken upstream that is mis-acknowledging.
Why duplicate rate is the operational signal: the rate of duplicates absorbed by the consumer is a direct measurement of how often the producer is redelivering. Zero duplicates over weeks usually means the consumer's dedup is broken (and silent corruption is happening). Stable low duplicates mean the contract is working as designed. Sudden spikes mean an upstream is in a retry storm and the pipeline is masking it. Plot inserted / seen per stage on a dashboard and the operational health of the at-least-once contract becomes legible.
Stage-level versus pipeline-level idempotency
The stage in the example above is idempotent on its own. A real pipeline has many stages chained together — extract, stage, transform, dimension upsert, fact load, aggregate refresh, payout file generation — and each one has its own contract with its upstream and downstream. A pipeline is end-to-end idempotent only if every stage is idempotent and the orchestrator that retries them respects the contract.
The orchestrator's job is to retry the stage that failed, not the whole pipeline, and to not start downstream stages until the failed stage's retry has succeeded. This is what Build 4 calls a DAG executor: a directed graph of stages with dependencies and retry policies. For now, the relevant point is that retrying the right stage is a choice, and getting it wrong creates inconsistency. If the dimension-upsert stage crashes after partial commit and the orchestrator restarts the entire pipeline from extract, the extract may pick up new rows that the half-finished upsert was not designed to handle. Retrying only the failed stage — with the same input parameters as the original run — is the discipline that keeps the contract intact.
Two stage-level patterns matter for end-to-end correctness:
Run-id propagation. Every pipeline run gets a single immutable identifier (run_2026_04_25T02_47_11Z), and every stage's input and output is tagged with it. A retry of stage 3 carries the same run-id as the original; the row count of stage 3's output for that run-id is what the orchestrator checks before declaring the stage "done". Without a run-id, a retry produces output that mingles with the original output and the orchestrator cannot tell whether the stage finished cleanly.
Output overwrite versus append. Each stage decides whether its retry should overwrite the previous attempt's output (idempotent overwrite) or append to it (idempotent insert with dedup). A staging table that holds the day's input is overwrite — the retry truncates it first and reloads everything. A long-running fact table is append-with-dedup — the retry inserts only the rows that aren't already present, keyed by event-id. The choice is per-stage; mixing them up is a common production bug.
Common confusions
-
"At-least-once is what you fall back to when you can't get exactly-once." No, it's the foundation. Exactly-once is at-least-once delivery plus idempotent processing on the receive side. The "exactly-once" frameworks (Kafka transactions, Flink two-phase-commit, Beam's effectively-once) are sophisticated implementations of the consumer side, not different from at-least-once at the wire level. Build the at-least-once foundation and you can layer exactly-once on top later.
-
"If my producer is at-least-once, my consumer can be naive." Wrong direction. The producer at-least-once is the cause; the consumer's idempotency is the effect that makes the contract work. A naive consumer behind an at-least-once producer is exactly the configuration that double-writes ledgers. The pair is what defines correctness — neither party alone is enough.
-
"My producer commits its checkpoint before sending the batch." This is the at-most-once anti-pattern: a crash between checkpoint save and batch send loses the batch entirely, and the receiver never sees it. The consumer's idempotency cannot recover lost data — only the producer's commit-after-send-and-ack discipline can. Kafka consumer groups commit offsets after the messages are processed, not before; that is the at-least-once shape.
-
"Idempotency means processing the row produces no side effect." It means processing the row N times produces the same observable state as processing it once. The first time produces the side effect; subsequent times do not. The destination ends up in the same state either way. A noop is one valid implementation of idempotency, but writes that are correctly deduplicated are also idempotent.
-
"I can make a stage idempotent later." Adding idempotency to a stage with a year of data already loaded is the most expensive refactor in data engineering. It usually requires a backfill, sometimes a schema change, and always a careful migration that doesn't double-count the existing rows. Build it in from the first commit; pay the small cost upfront instead of the enormous cost in year two.
-
"Retries should be unlimited." Bounded retries with backoff prevent a permanent downstream failure from amplifying into a thundering herd. A typical policy is 3 retries with exponential backoff (1m, 5m, 25m), then fail loudly and page on-call. Unlimited retries hide the failure long enough for it to become someone else's problem at 4 a.m.
Going deeper
Idempotency tokens at the API boundary
When the consumer side is an HTTP API rather than a SQL destination, the idempotency primitive is an idempotency key in the request header — a UUID the producer generates once per logical operation and reuses on every retry. Stripe, Razorpay, and Adyen all expose this; the API server stores recently-seen keys with their results, and a retry with a previously-seen key returns the cached result instead of re-executing the operation. This is the same pattern as INSERT ... ON CONFLICT DO NOTHING, lifted from the database to the API.
Razorpay's Payments API requires an idempotency key on every POST /payments/create call. A retry of the same key returns the original payment object; a different key creates a new payment. Without this, a network glitch during checkout that times out the response would cause the merchant's retry to charge the customer twice — and customer-double-charging tickets are the single most expensive class of payment-platform support call. The idempotency key is the small piece of header that pays for itself a thousand times over.
The implementation on the server side is exactly the receive-side dedup pattern: a Postgres table keyed on idempotency_key with a unique index, and an INSERT ... ON CONFLICT (idempotency_key) DO NOTHING RETURNING result that either returns the cached result or commits a new one inside the same transaction as the underlying operation. The cache is bounded by a TTL — typically 24 hours — after which the key may be reused for an unrelated operation.
When at-least-once becomes "approximately once" — the pragmatic compromise
A truly idempotent receive side is sometimes too expensive. A real-time analytics pipeline ingesting one million events per second cannot afford a SQL ON CONFLICT round trip per event. The pragmatic compromise that production systems run is probabilistic deduplication: a Bloom filter or a bounded LRU on the consumer remembers recently-seen event_id values with bounded memory, dropping duplicates that arrive within the window. Duplicates that arrive outside the window slip through, but in practice the producer's redelivery happens within seconds, so the window only needs to be tens of seconds wide.
The Dream11 game-events pipeline runs a 60-second LRU on each consumer task with a 1 million-entry capacity per partition. A redelivered event almost always arrives within seconds; the LRU catches it. An event redelivered after a 10-minute consumer outage may slip through, but the downstream aggregation is itself idempotent — it joins on event_id before counting — so the slip-through is absorbed at a later stage. This layered approach is what "approximately once" means in real systems: every layer applies the cheapest dedup it can, and the layers compose.
The poison-pill pattern: a row that fails forever
Every long-running at-least-once pipeline eventually meets the row that the consumer can never process — a malformed JSON document, a row whose foreign key was deleted upstream, an event whose schema disagrees with the destination's contract. The naive retry behaviour, "retry forever", turns this single row into a blocker for everything behind it: the consumer crashes on row N, restarts, crashes on row N again, never advances past it.
The defence is the dead-letter queue (DLQ): after K failed retries on a single row, route the row to a side-channel destination (a Kafka topic, an S3 bucket, a quarantined database table) and advance the main pipeline past it. The on-call engineer or a separate batch job triages the DLQ later, fixes the root cause, and replays the rows. The main pipeline keeps moving; the bad row gets a careful look. The Razorpay disputes-ingest pipeline runs with K = 5 and a 7-day DLQ retention; the DLQ catches roughly 40 rows a day out of 60 lakh, all of which are the result of a specific bug class on the source side that gets patched out within a release cycle.
The DLQ is a small piece of infrastructure with a large effect on the operational shape of the system. Without it, every poison pill is a 3-a.m. page; with it, every poison pill is a 9-a.m. ticket triaged during business hours.
Backfills are the ultimate test of the contract
A backfill — re-running the pipeline over a historical window because of a bug, schema change, or upstream correction — is the operation that exposes every failure of the at-least-once contract. The backfill processes the same input the original run processed, days or weeks later. If the destination is not idempotent, the backfill double-writes; if the producer is not deterministic, the backfill produces different rows than the original; if the orchestrator retries the wrong stages, the backfill diverges from the canonical state.
The discipline is to treat the backfill as the same operation as a retry, just over a wider window. Same run-id pattern (one run-id per backfill window), same stage idempotency, same checkpoint logic. A pipeline that can be backfilled cleanly is a pipeline that has internalised the at-least-once contract; a pipeline that cannot is one that will eventually need a Saturday afternoon of manual SQL surgery to fix.
The Flipkart catalogue team treats backfill-ability as a release-gate: every new pipeline must demonstrate a successful backfill of the past 30 days before it is allowed into production. The backfill is not the rare event; it is the test that the everyday contract holds.
Side effects that escape the contract: emails, files, and irreversible writes
The at-least-once contract is comfortable when the destination is a database — INSERT ... ON CONFLICT and MERGE make the consumer side cheap. The contract gets uncomfortable when the destination is an external side effect: an email sent to a customer, a payout file uploaded to NPCI, a webhook fired to a merchant's server. The receive-side dedup pattern still works, but the implementation moves: an idempotency table on the producer, written in the same transaction as the side effect, that records "this idempotency key has already triggered the side effect; do not trigger again on retry".
The pattern is the same shape as the API idempotency-key pattern from earlier in this section, but the table now lives on the producer side and gates a non-database action. PhonePe's payout-file generator uses exactly this — a Postgres table payout_file_runs (run_id, file_name, uploaded_at, npci_ack_id) with a unique index on run_id, written in the same transaction as the SFTP upload's "we sent the file" log. A retry of the same run_id finds the row, skips the upload, and returns the cached npci_ack_id. The cost of getting this wrong — uploading the same crore-rupee payout file twice — is a finance-team escalation that nobody wants. The contract holds because the gate is in the same transaction as the action it gates.
Where this leads next
Chapter 11 picks up the pieces and asks: when retrying is not enough, when do you need a manual full backfill, and how do you run one without bringing the warehouse to its knees? The full-backfill primitive sits one layer above this chapter's at-least-once contract — it is the operation you reach for when the everyday retry path can't fix what's wrong.
- Hash-based deduplication on load — chapter 8, the append-only side of the consumer's idempotency story
- Upserts and the MERGE pattern — chapter 9, the current-state-table side
- What "idempotent" actually means for data (and why it's hard) — chapter 6, the conceptual frame
- Crashing mid-run: which state are you in? — chapter 4
- State files and checkpoints: the poor man's job queue — chapter 7
Build 7 returns to this contract from the message-log side: Kafka's consumer-offset commit semantics, the choice between auto.offset.reset = earliest and latest, the implications of enable.auto.commit = true. Build 9 returns yet again as exactly-once-semantics — the layer of two-phase commits and idempotent producers built on top of the at-least-once foundation this chapter establishes.
The contract here is the one the rest of the curriculum keeps stress-testing. Every pipeline you build from this point on either honours it or quietly betrays it; the difference shows up in the 3 a.m. ticket queue.
References
- Designing Data-Intensive Applications, Chapter 8 — The Trouble with Distributed Systems — Martin Kleppmann's treatment of partial failure, network unreliability, and the impossibility of perfect coordination, which is the theoretical underpinning of why at-least-once is unavoidable.
- Kafka documentation: message delivery semantics — the canonical reference for the three semantics (at-most-once, at-least-once, exactly-once) and the trade-offs each implies.
- Stripe: idempotency in the API — the production reference for how API-level idempotency keys are designed and enforced.
- Flink: end-to-end exactly-once with two-phase commit — the original blog post explaining how at-least-once delivery plus a transactional sink composes into end-to-end exactly-once.
- Razorpay engineering: building idempotent payment APIs — Razorpay's engineering posts on how the idempotency-key pattern is operationalised at UPI scale.
- Hash-based deduplication on load — the receive-side primitive this chapter's contract depends on.
- Pat Helland: Life beyond Distributed Transactions — the foundational paper that argues idempotent operations and entity identifiers are the path forward when transactions don't scale.
- The Tail at Scale (Dean & Barroso, CACM 2013) — why retries are not free, why bounded retries with backoff matter, and why the consumer side has to be cheap on the duplicate path.
A practical exercise: take the alo_pipeline.py example above and remove the os.replace(tmp, CHECKPOINT) atomic-rename. Replace it with a direct with open(CHECKPOINT, "w") as f: json.dump(...). Crash the pipeline by killing it with SIGKILL between the truncate and the write — easy to reproduce by adding a time.sleep(2) between the two. The next run loads an empty checkpoint, restarts from offset 0, and reprocesses the entire input. The destination's ON CONFLICT clause absorbs the duplicates, so the row count is still correct; but the work done — the CPU, the IOPS, the network bytes — is exactly twice what it should have been. The atomic rename is the difference between paying for the work once and paying for it twice on every crash. On a 1-crore-row daily pipeline, that is the difference between running on one node and running on two.