Wall: exactly-once is a lie everyone tells
A 2 a.m. page from a Razorpay payments pipeline once read: "customer charged twice for order_id=BLR-9847122. Stream processor logs say processed once. Database says inserted twice. Kafka says delivered twice. Sink says committed once. Who is lying?" Aditi, the on-call engineer, opened the Flink job's UI and saw the green EXACTLY_ONCE badge next to the checkpoint mode. The badge was telling the truth. The duplicate row was also telling the truth. Both can be true — and that contradiction is what every "exactly-once" claim in the streaming world rests on. This chapter is the wall between Build 8 (you now understand state, windows, watermarks, checkpoints) and Build 9 (you are about to learn how exactly-once is actually engineered). Before the engineering, you need to know what the words mean.
"Exactly-once" in streaming never means "the operator's side-effects fire exactly once on the world". It means "the observable end state of a closed system looks as if every record were processed once, even though many records were physically processed multiple times." The lie is subtle and the truth is mechanical: idempotent writes plus transactional commits plus replayable sources. Skip those three and the badge is decorative.
What people hear when you say "exactly-once"
Read a Kafka or Flink marketing page and the phrase "exactly-once processing" lands as: each event flows through the pipeline exactly one time, end to end, no matter what fails. That mental model is wrong in two specific ways, and the wrongness is the entire point of this chapter.
Wrong-way #1: physical execution is not what gets guaranteed. When a Flink task crashes mid-window-aggregate, the next attempt re-reads the same Kafka offsets and re-processes the same records. The aggregator's add(event) method physically runs twice on the same input. The framework does not — cannot — un-execute the first attempt. What it does is roll the state back so the second attempt produces the same final state as if only one execution had happened.
Wrong-way #2: the world outside the pipeline is not part of the guarantee unless you make it so. If your sink is INSERT INTO payments_ledger (order_id, paise) VALUES (...) without a uniqueness constraint, the same record reprocessed twice writes two rows. The streaming framework does not know whether your sink is idempotent, transactional, or neither. The "exactly-once" badge sits on the boundary of the framework, not on the boundary of the database.
So when a vendor ships an EXACTLY_ONCE mode, what they ship is a contract that says: if you also do these three things on the source side and these two things on the sink side, the observable end state is what you'd see from one logical execution. The "if" clause is the lie's whole hiding place.
What "exactly-once" actually delivers — the precise statement
The honest version of the guarantee, spelled out, is this:
Given: (a) a source that lets the framework re-read any prefix of records by offset, (b) operator state checkpointed atomically with consumed offsets, (c) a sink that supports either idempotent writes keyed by (checkpoint_id, record_key) or a two-phase commit protocol coordinated by the framework — then: for every record that enters the source, the cumulative effect on the sink and on operator state is identical to a hypothetical execution where each record was processed exactly once, even though physical re-execution happened during failures.
Read that twice. The guarantee is on the cumulative effect, not on the count of executions. The operator's process(record) method might run 4 times for a single record across crash recoveries. The sink might receive 4 write attempts for the same record. The guarantee is that the observable end state — the row count in the database, the value of the windowed aggregate, the offset committed back to Kafka — looks like the record was processed once.
Why this distinction matters operationally: if your process(record) method has a side-effect that the framework can't see — calling an external HTTP API, sending an email, charging a credit card — that side-effect fires once per physical execution, not once per record. Exactly-once does not extend to side-effects beyond the framework's transactional boundary. Razorpay's payment-charge step is deliberately placed outside the streaming pipeline for exactly this reason.
Why duplicates are the default — a small simulation
To feel why the lie exists, build the simplest possible streaming system with naive at-least-once semantics and watch what happens when the operator crashes. The code below simulates a 5-record stream, an in-memory operator that maintains a running count, a sink that writes to a "database" (a list), and a forced crash mid-processing.
# at_least_once_demo.py — what happens without exactly-once machinery
import random, time
class NaiveStreamProcessor:
"""A bare at-least-once processor: source, operator, sink, no checkpoint."""
def __init__(self):
self.source_offset = 0 # where the source thinks we are
self.operator_count = 0 # operator state (running count)
self.committed_offset = 0 # last offset the source has committed
self.sink_rows = [] # the "database"
def process_record(self, offset, payload, crash_after=None):
# Step 1: operator updates its in-memory state
self.operator_count += 1
# Step 2: operator emits a result downstream to the sink
self.sink_rows.append(
{"offset": offset, "payload": payload, "running_count": self.operator_count}
)
# Step 3: simulated crash BEFORE we tell the source we're done
if crash_after == offset:
raise RuntimeError(f"crashed after writing offset {offset} but before committing")
# Step 4: source advances the committed offset
self.committed_offset = offset + 1
stream = [(i, f"order-MUM-{1000+i}") for i in range(5)]
proc = NaiveStreamProcessor()
# First attempt — crashes after offset 2 was written but before commit
print("=== Attempt 1 (will crash) ===")
try:
for off, payload in stream:
proc.process_record(off, payload, crash_after=2)
except RuntimeError as e:
print(f" CAUGHT: {e}")
print(f" committed_offset = {proc.committed_offset}")
print(f" sink rows so far = {len(proc.sink_rows)}")
# Restart — source replays from committed_offset, operator state is still in memory
print("=== Attempt 2 (restart from committed offset) ===")
for off, payload in stream:
if off < proc.committed_offset:
continue # source skips already-committed offsets
proc.process_record(off, payload, crash_after=None)
print(f" committed_offset = {proc.committed_offset}")
print(f" total sink rows = {len(proc.sink_rows)}")
print(f" final running_count in operator = {proc.operator_count}")
print(f" rows by offset = {[(r['offset'], r['running_count']) for r in proc.sink_rows]}")
=== Attempt 1 (will crash) ===
CAUGHT: crashed after writing offset 2 but before committing
committed_offset = 2
sink rows so far = 3
=== Attempt 2 (restart from committed offset) ===
committed_offset = 5
total sink rows = 6
final running_count in operator = 6
rows by offset = [(0, 1), (1, 2), (2, 3), (2, 4), (3, 5), (4, 6)]
The points worth dwelling on:
committed_offset = 2after the crash, even though the sink has 3 rows. The source has no way to know that offset 2 was successfully written downstream, because the commit step never ran. On restart, the source replays starting at offset 2 — and the sink gets a duplicate row for offset 2.running_countis wrong twice over. Offset 2 appears withrunning_count=3(first attempt) andrunning_count=4(replay) — neither is what a single-execution semantics would produce, because the operator state was not rolled back to its pre-offset-2 value when the crash happened. Why this is the heart of the problem: at-least-once delivery without snapshotted state means operator state diverges further from the "as-if-once" state on every retry. The aggregate is now over-counted by exactly one event.- The sink has 6 rows for 5 records. Offset 2 is duplicated. Without exactly-once machinery, this is the default outcome of every crash that happens after the sink write but before the source commit. Why this gap is unavoidable in a naive design: there is no atomic primitive across "operator state + sink write + source commit" in a system without 2PC. The framework can choose which non-atomic ordering to use, but every ordering leaks.
The 6th row is the lie made visible. Real production streaming jobs experience this every time a task manager dies mid-checkpoint — which, at scale, is several times a day per cluster.
A second look: where in time the duplicate appears
The duplicate in the simulation isn't random. It happens at one specific moment in the failure timeline — between the sink write succeeding and the source offset being committed. Naming that moment makes the engineering fix obvious.
The width of this window is also why "we just commit the offset before the sink write" doesn't fix it — that ordering produces the opposite failure (offset committed, sink write never happened, record silently dropped). At-most-once is what you get when you flip the ordering and hope. Neither ordering alone is correct; you need atomicity across both steps.
What the engineering actually does
Build 9 will go deep on the mechanism. For the wall, what matters is the shape of the fix:
Source side: replayable + offset-committed-with-state. The source must let the framework start reading from any prior offset. Kafka does this natively; HTTP polling APIs typically don't, which is why CDC pipelines from Postgres use pg_create_logical_replication_slot instead of LISTEN/NOTIFY.
Operator side: state checkpointed atomically with consumed offset. The Chandy-Lamport checkpoint algorithm (covered in /wiki/checkpointing-the-consistent-snapshot-algorithm) writes the operator's state and the source's committed offset to durable storage in a single atomic step. On crash recovery, the operator restores its state from the last successful checkpoint and the source resumes from the matching offset — so the operator never sees the same record twice from its own perspective.
Sink side: idempotent writes OR a two-phase commit handle. The sink is the part the framework doesn't own, so the framework needs the sink to play along. Two patterns dominate:
- Idempotent writes keyed by
(checkpoint_id, sink_partition_key). A duplicate write is detected and dropped at the sink. Works for relational databases with unique constraints, key-value stores with conditional puts (DynamoDBPutItemwithattribute_not_exists, S3 withIf-None-Match). - Transactional / two-phase commit sinks. The framework calls
pre_commit(checkpoint_id, batch)before checkpointing, thencommit(checkpoint_id)after the checkpoint succeeds, thenabort(checkpoint_id)if the checkpoint fails. The sink must hold the pre-committed batch in a not-yet-visible state between those two calls. Kafka's transactional producer (init_transactions+commit_transaction) is the canonical example.
Without one of those two sink patterns, the dashed boundary in the diagram leaks — and the badge is decorative.
The two operational consequences
Once you internalise that exactly-once is a closed-system guarantee, two operational truths fall out:
You cannot bolt exactly-once onto a non-replayable source or a non-idempotent sink. Engineering teams routinely try, and routinely write incident reports about "duplicate orders in the ledger". The fix is never a config flag — it's redesigning the source or the sink. A common Razorpay/Flipkart pattern: even when the streaming engine offers exactly-once, the team explicitly designs sink writes as idempotent UPSERTs keyed on a deterministic event-id, because they don't trust the boundary.
Side-effects must be lifted out of operators. A Flink operator that sends a Slack notification, charges a card, or makes a non-idempotent HTTP call will fire that effect multiple times across crash retries. The exactly-once contract has nothing to say about it. The standard Indian-fintech pattern: operators write a "command intent" row to a transactional sink (Kafka with EOS, or Postgres with a unique constraint), and a separate idempotent worker reads the intent and performs the real-world action with its own dedup key. Two stages, one boundary, no double-charges.
Common confusions
- "Exactly-once means each record is processed exactly once." No — each record may be physically processed many times during retries. The guarantee is that the cumulative observable effect equals one processing.
- "Kafka exactly-once and Flink exactly-once are the same thing." Kafka EOS (KIP-98) covers producer→broker→consumer with transactional offsets. Flink EOS extends that into operator state via Chandy-Lamport plus 2PC sinks. They compose, but they're solving different sub-problems and you can have one without the other.
- "If my sink is a database with primary keys, exactly-once is automatic." Only if every write uses
INSERT ... ON CONFLICT DO NOTHING(or equivalent UPSERT) and the conflict key is deterministic across retries. A naiveINSERTwill still duplicate on retry. - "Exactly-once is too expensive for production." The cost is real (5–15% throughput hit, longer checkpoints) but for payment ledgers, GST filings, fraud counters, and inventory updates, "expensive" is cheaper than the alternative. Razorpay's reconciliation pipeline runs EOS; the analytics pipeline runs at-least-once with idempotent dbt models. Different SLAs, different choices.
- "At-least-once + idempotent sink = exactly-once." Almost. You also need consumed offsets committed atomically with the sink write — otherwise the source can replay records the sink has already processed, the sink's idempotency dedups them, and the count is right but the latency and throughput take repeated hits during recovery. EOS makes that recovery cheap by syncing offsets with state.
- "The badge in my Flink UI saying
EXACTLY_ONCEproves the pipeline is correct." It proves the framework is in EOS mode. Whether the sink honours the contract is on you. A green badge with a non-transactional JDBC sink is a polite way of saying "good luck".
Going deeper
The original Tyler Akidau definition
The clearest statement of the lie is in Tyler Akidau's 2017 Streaming 101 essay (later expanded into the Streaming Systems book): "Exactly-once delivery is not what most people think it is. It is effectively-once processing, where the externally visible side-effects of the system look as though every record were processed exactly once." Akidau introduced the term effectively-once precisely because exactly-once was already loaded with the wrong intuition. The streaming community largely chose to keep using exactly-once anyway because marketing won, but every paper since 2017 spends a paragraph clarifying. When you hear an engineer say effectively-once in conversation, they're being precise on purpose — they want you to know they know.
Why Kafka EOS uses transactions, not idempotent producers alone
Kafka shipped idempotent producers (KIP-98 part 1) before transactional producers (KIP-98 part 2). An idempotent producer assigns a sequence number per (producer_id, partition) so the broker drops duplicates on retry. That solves duplicates from producer-broker network retries. It does not solve "the producer crashed and another instance took over with a different producer_id" — the new producer's sequence numbers don't match, the broker can't dedup, and you get duplicates anyway. Transactions added a transactional_id that survives restarts: the new producer instance fences the old one (init_transactions aborts any in-flight transaction from a previous instance) and continues with the same transactional identity. That fencing is what makes EOS survive producer crashes, not just network retries.
The Razorpay double-charge incident pattern
A 2024 internal Razorpay write-up described an incident where a payment-status update Flink job had EOS enabled but the sink was a REST call to the payment gateway's status-callback endpoint. The endpoint had no idempotency key support; each retry of a checkpoint after a task crash sent a fresh callback, and the gateway interpreted multiple callbacks as multiple settlement instructions. The engineering fix was twofold: (1) add a deterministic idempotency-key: {checkpoint_id}-{order_id} header that the gateway started honouring; (2) move the callback out of the Flink operator into a separate dedup-aware worker that consumed from a Kafka topic written by the Flink job under EOS. The outcome: 0 duplicate settlements per quarter, down from 12–18 in the bad quarter. The architectural lesson — which mirrors §10 of [Streaming 101] — is that a side-effecting call belongs in a worker fed by an EOS-protected log, never inside the operator itself.
What "effectively-once" still doesn't fix
Even with EOS end-to-end, three things break the guarantee:
- Non-deterministic operators. If
process(record)callsrandom.random()or readstime.time(), the second physical execution produces a different output than the first. The framework can't replay non-determinism. You either snapshot the random seed alongside state (Apache Beam'sRestrictionTrackerdoes this for source splits) or accept that downstream observers see different values across replays. - External reads. An operator that calls
requests.get("https://gst.gov.in/api/...")for each event sees fresh data on each retry; the rest of the pipeline computes a different result. The fix is to fold the external read into the source as a first-class CDC stream, so retries replay the same external value. - Watermark side-channels. Two parallel jobs that both consume from the same source can diverge during recovery if their watermarks advance differently — a classic case is one job triggers a downstream alert at watermark 11:00:00 IST while the other doesn't because its watermark is still at 10:59:30 IST. Both jobs are individually EOS; their joint behaviour is not.
Where this leads next
The next chapter — /wiki/what-exactly-once-actually-means-its-delivery-not-processing — replaces the lie with the precise definition: delivery vs processing semantics, and why the streaming community uses both phrases interchangeably and confusingly.
After that, the mechanism gets built piece by piece:
- /wiki/idempotent-producers-and-sequence-numbers — Kafka's first attempt: dedup at the broker by sequence number.
- /wiki/transactional-writes-2pc-wearing-a-hat — the second leg of the contract: the sink-side transaction.
- /wiki/end-to-end-exactly-once-source-operator-sink — the three pieces composed into a working pipeline.
The mental model: exactly-once is a coordinated story between a replayable source, a checkpointed operator, and a transactional or idempotent sink. Take any one of those pieces away and the badge is theatre. The rest of Build 9 is the script for that coordinated story.
References
- Streaming 101 — Tyler Akidau (O'Reilly, 2015) — the essay that introduced effectively-once as the precise term.
- Streaming Systems — Akidau, Chernyak, Lax (O'Reilly, 2018) — chapter 5 covers the boundary between framework guarantee and external side-effect.
- KIP-98: Exactly Once Delivery and Transactional Messaging — the original Kafka design doc, including the producer fencing and transactional-id rationale.
- An Overview of End-to-End Exactly-Once Processing in Apache Flink — the canonical Flink + Kafka EOS walk-through, with diagrams that match the dashed-boundary mental model.
- Apache Spark — Structured Streaming output modes and idempotent sinks — Spark's variation on the same problem; uses idempotent sinks more than 2PC.
- Confluent — Exactly-once semantics are possible: here's how Kafka does it — vendor post, useful for the producer-fencing diagrams.
- Apache Beam — exactly-once and side effects — explicit on which guarantees Beam offers and which it deliberately does not.
- /wiki/checkpointing-the-consistent-snapshot-algorithm — the previous chapter; the algorithm that makes EOS possible on the operator side.