Wall: exactly-once is a lie everyone tells

A 2 a.m. page from a Razorpay payments pipeline once read: "customer charged twice for order_id=BLR-9847122. Stream processor logs say processed once. Database says inserted twice. Kafka says delivered twice. Sink says committed once. Who is lying?" Aditi, the on-call engineer, opened the Flink job's UI and saw the green EXACTLY_ONCE badge next to the checkpoint mode. The badge was telling the truth. The duplicate row was also telling the truth. Both can be true — and that contradiction is what every "exactly-once" claim in the streaming world rests on. This chapter is the wall between Build 8 (you now understand state, windows, watermarks, checkpoints) and Build 9 (you are about to learn how exactly-once is actually engineered). Before the engineering, you need to know what the words mean.

"Exactly-once" in streaming never means "the operator's side-effects fire exactly once on the world". It means "the observable end state of a closed system looks as if every record were processed once, even though many records were physically processed multiple times." The lie is subtle and the truth is mechanical: idempotent writes plus transactional commits plus replayable sources. Skip those three and the badge is decorative.

What people hear when you say "exactly-once"

Read a Kafka or Flink marketing page and the phrase "exactly-once processing" lands as: each event flows through the pipeline exactly one time, end to end, no matter what fails. That mental model is wrong in two specific ways, and the wrongness is the entire point of this chapter.

Wrong-way #1: physical execution is not what gets guaranteed. When a Flink task crashes mid-window-aggregate, the next attempt re-reads the same Kafka offsets and re-processes the same records. The aggregator's add(event) method physically runs twice on the same input. The framework does not — cannot — un-execute the first attempt. What it does is roll the state back so the second attempt produces the same final state as if only one execution had happened.

Wrong-way #2: the world outside the pipeline is not part of the guarantee unless you make it so. If your sink is INSERT INTO payments_ledger (order_id, paise) VALUES (...) without a uniqueness constraint, the same record reprocessed twice writes two rows. The streaming framework does not know whether your sink is idempotent, transactional, or neither. The "exactly-once" badge sits on the boundary of the framework, not on the boundary of the database.

So when a vendor ships an EXACTLY_ONCE mode, what they ship is a contract that says: if you also do these three things on the source side and these two things on the sink side, the observable end state is what you'd see from one logical execution. The "if" clause is the lie's whole hiding place.

Where the exactly-once boundary actually sitsA diagram showing a streaming pipeline. The framework boundary contains the source, the operator, and the checkpoint store. Outside that boundary sits the sink (a database) and the source's upstream producer. The label "exactly-once" sits inside the framework boundary; the labels "your problem" sit at the source-producer and sink-database boundaries. The exactly-once badge sits inside this dashed line — and nowhere else Streaming framework boundary source replayable offset store operator stateful checkpointed sink staging 2PC pre-commit transactional "exactly-once" guarantee applies HERE producer your problem: don't double-emit sink (DB / API) your problem: idempotent commit at-least-once delivery at-least-once delivery
The framework can guarantee exactly-once between a replayable source and a transactional sink. Outside that dashed boundary, you are responsible for the duplicates yourself. Most "exactly-once is broken" stories are stories of someone who put a non-transactional sink past the dashed line.

What "exactly-once" actually delivers — the precise statement

The honest version of the guarantee, spelled out, is this:

Given: (a) a source that lets the framework re-read any prefix of records by offset, (b) operator state checkpointed atomically with consumed offsets, (c) a sink that supports either idempotent writes keyed by (checkpoint_id, record_key) or a two-phase commit protocol coordinated by the framework — then: for every record that enters the source, the cumulative effect on the sink and on operator state is identical to a hypothetical execution where each record was processed exactly once, even though physical re-execution happened during failures.

Read that twice. The guarantee is on the cumulative effect, not on the count of executions. The operator's process(record) method might run 4 times for a single record across crash recoveries. The sink might receive 4 write attempts for the same record. The guarantee is that the observable end state — the row count in the database, the value of the windowed aggregate, the offset committed back to Kafka — looks like the record was processed once.

Why this distinction matters operationally: if your process(record) method has a side-effect that the framework can't see — calling an external HTTP API, sending an email, charging a credit card — that side-effect fires once per physical execution, not once per record. Exactly-once does not extend to side-effects beyond the framework's transactional boundary. Razorpay's payment-charge step is deliberately placed outside the streaming pipeline for exactly this reason.

Why duplicates are the default — a small simulation

To feel why the lie exists, build the simplest possible streaming system with naive at-least-once semantics and watch what happens when the operator crashes. The code below simulates a 5-record stream, an in-memory operator that maintains a running count, a sink that writes to a "database" (a list), and a forced crash mid-processing.

# at_least_once_demo.py — what happens without exactly-once machinery
import random, time

class NaiveStreamProcessor:
    """A bare at-least-once processor: source, operator, sink, no checkpoint."""

    def __init__(self):
        self.source_offset = 0           # where the source thinks we are
        self.operator_count = 0          # operator state (running count)
        self.committed_offset = 0        # last offset the source has committed
        self.sink_rows = []              # the "database"

    def process_record(self, offset, payload, crash_after=None):
        # Step 1: operator updates its in-memory state
        self.operator_count += 1
        # Step 2: operator emits a result downstream to the sink
        self.sink_rows.append(
            {"offset": offset, "payload": payload, "running_count": self.operator_count}
        )
        # Step 3: simulated crash BEFORE we tell the source we're done
        if crash_after == offset:
            raise RuntimeError(f"crashed after writing offset {offset} but before committing")
        # Step 4: source advances the committed offset
        self.committed_offset = offset + 1

stream = [(i, f"order-MUM-{1000+i}") for i in range(5)]
proc = NaiveStreamProcessor()

# First attempt — crashes after offset 2 was written but before commit
print("=== Attempt 1 (will crash) ===")
try:
    for off, payload in stream:
        proc.process_record(off, payload, crash_after=2)
except RuntimeError as e:
    print(f"  CAUGHT: {e}")

print(f"  committed_offset = {proc.committed_offset}")
print(f"  sink rows so far = {len(proc.sink_rows)}")

# Restart — source replays from committed_offset, operator state is still in memory
print("=== Attempt 2 (restart from committed offset) ===")
for off, payload in stream:
    if off < proc.committed_offset:
        continue                          # source skips already-committed offsets
    proc.process_record(off, payload, crash_after=None)

print(f"  committed_offset = {proc.committed_offset}")
print(f"  total sink rows  = {len(proc.sink_rows)}")
print(f"  final running_count in operator = {proc.operator_count}")
print(f"  rows by offset  = {[(r['offset'], r['running_count']) for r in proc.sink_rows]}")
=== Attempt 1 (will crash) ===
  CAUGHT: crashed after writing offset 2 but before committing
  committed_offset = 2
  sink rows so far = 3
=== Attempt 2 (restart from committed offset) ===
  committed_offset = 5
  total sink rows  = 6
  final running_count in operator = 6
  rows by offset  = [(0, 1), (1, 2), (2, 3), (2, 4), (3, 5), (4, 6)]

The points worth dwelling on:

The 6th row is the lie made visible. Real production streaming jobs experience this every time a task manager dies mid-checkpoint — which, at scale, is several times a day per cluster.

A second look: where in time the duplicate appears

The duplicate in the simulation isn't random. It happens at one specific moment in the failure timeline — between the sink write succeeding and the source offset being committed. Naming that moment makes the engineering fix obvious.

The vulnerability window in at-least-once deliveryA timeline diagram showing four steps for processing a record — read from source, update operator state, write to sink, commit source offset — with a highlighted "vulnerability window" between the sink write and the offset commit. A crash inside that window forces a replay that produces a duplicate write at the sink. A crash here = duplicate. A crash anywhere else = no duplicate. read offset N from source update state operator's in-mem map write sink INSERT into downstream commit offset N+1 to source vulnerability window crash here → replay → duplicate sink write EOS closes this window by making "write sink" + "commit offset" atomic via 2PC or by making the sink dedup the duplicate using a key that's deterministic across retries.
Every at-least-once pipeline has a vulnerability window between sink-write and offset-commit. The window's width is small in human terms — milliseconds — but at 50k events/sec a 200ms outage hits 10,000 records inside that window. EOS shrinks the window to zero by making the two steps atomic.

The width of this window is also why "we just commit the offset before the sink write" doesn't fix it — that ordering produces the opposite failure (offset committed, sink write never happened, record silently dropped). At-most-once is what you get when you flip the ordering and hope. Neither ordering alone is correct; you need atomicity across both steps.

What the engineering actually does

Build 9 will go deep on the mechanism. For the wall, what matters is the shape of the fix:

Source side: replayable + offset-committed-with-state. The source must let the framework start reading from any prior offset. Kafka does this natively; HTTP polling APIs typically don't, which is why CDC pipelines from Postgres use pg_create_logical_replication_slot instead of LISTEN/NOTIFY.

Operator side: state checkpointed atomically with consumed offset. The Chandy-Lamport checkpoint algorithm (covered in /wiki/checkpointing-the-consistent-snapshot-algorithm) writes the operator's state and the source's committed offset to durable storage in a single atomic step. On crash recovery, the operator restores its state from the last successful checkpoint and the source resumes from the matching offset — so the operator never sees the same record twice from its own perspective.

Sink side: idempotent writes OR a two-phase commit handle. The sink is the part the framework doesn't own, so the framework needs the sink to play along. Two patterns dominate:

Without one of those two sink patterns, the dashed boundary in the diagram leaks — and the badge is decorative.

The two operational consequences

Once you internalise that exactly-once is a closed-system guarantee, two operational truths fall out:

You cannot bolt exactly-once onto a non-replayable source or a non-idempotent sink. Engineering teams routinely try, and routinely write incident reports about "duplicate orders in the ledger". The fix is never a config flag — it's redesigning the source or the sink. A common Razorpay/Flipkart pattern: even when the streaming engine offers exactly-once, the team explicitly designs sink writes as idempotent UPSERTs keyed on a deterministic event-id, because they don't trust the boundary.

Side-effects must be lifted out of operators. A Flink operator that sends a Slack notification, charges a card, or makes a non-idempotent HTTP call will fire that effect multiple times across crash retries. The exactly-once contract has nothing to say about it. The standard Indian-fintech pattern: operators write a "command intent" row to a transactional sink (Kafka with EOS, or Postgres with a unique constraint), and a separate idempotent worker reads the intent and performs the real-world action with its own dedup key. Two stages, one boundary, no double-charges.

Common confusions

Going deeper

The original Tyler Akidau definition

The clearest statement of the lie is in Tyler Akidau's 2017 Streaming 101 essay (later expanded into the Streaming Systems book): "Exactly-once delivery is not what most people think it is. It is effectively-once processing, where the externally visible side-effects of the system look as though every record were processed exactly once." Akidau introduced the term effectively-once precisely because exactly-once was already loaded with the wrong intuition. The streaming community largely chose to keep using exactly-once anyway because marketing won, but every paper since 2017 spends a paragraph clarifying. When you hear an engineer say effectively-once in conversation, they're being precise on purpose — they want you to know they know.

Why Kafka EOS uses transactions, not idempotent producers alone

Kafka shipped idempotent producers (KIP-98 part 1) before transactional producers (KIP-98 part 2). An idempotent producer assigns a sequence number per (producer_id, partition) so the broker drops duplicates on retry. That solves duplicates from producer-broker network retries. It does not solve "the producer crashed and another instance took over with a different producer_id" — the new producer's sequence numbers don't match, the broker can't dedup, and you get duplicates anyway. Transactions added a transactional_id that survives restarts: the new producer instance fences the old one (init_transactions aborts any in-flight transaction from a previous instance) and continues with the same transactional identity. That fencing is what makes EOS survive producer crashes, not just network retries.

The Razorpay double-charge incident pattern

A 2024 internal Razorpay write-up described an incident where a payment-status update Flink job had EOS enabled but the sink was a REST call to the payment gateway's status-callback endpoint. The endpoint had no idempotency key support; each retry of a checkpoint after a task crash sent a fresh callback, and the gateway interpreted multiple callbacks as multiple settlement instructions. The engineering fix was twofold: (1) add a deterministic idempotency-key: {checkpoint_id}-{order_id} header that the gateway started honouring; (2) move the callback out of the Flink operator into a separate dedup-aware worker that consumed from a Kafka topic written by the Flink job under EOS. The outcome: 0 duplicate settlements per quarter, down from 12–18 in the bad quarter. The architectural lesson — which mirrors §10 of [Streaming 101] — is that a side-effecting call belongs in a worker fed by an EOS-protected log, never inside the operator itself.

What "effectively-once" still doesn't fix

Even with EOS end-to-end, three things break the guarantee:

  1. Non-deterministic operators. If process(record) calls random.random() or reads time.time(), the second physical execution produces a different output than the first. The framework can't replay non-determinism. You either snapshot the random seed alongside state (Apache Beam's RestrictionTracker does this for source splits) or accept that downstream observers see different values across replays.
  2. External reads. An operator that calls requests.get("https://gst.gov.in/api/...") for each event sees fresh data on each retry; the rest of the pipeline computes a different result. The fix is to fold the external read into the source as a first-class CDC stream, so retries replay the same external value.
  3. Watermark side-channels. Two parallel jobs that both consume from the same source can diverge during recovery if their watermarks advance differently — a classic case is one job triggers a downstream alert at watermark 11:00:00 IST while the other doesn't because its watermark is still at 10:59:30 IST. Both jobs are individually EOS; their joint behaviour is not.

Where this leads next

The next chapter — /wiki/what-exactly-once-actually-means-its-delivery-not-processing — replaces the lie with the precise definition: delivery vs processing semantics, and why the streaming community uses both phrases interchangeably and confusingly.

After that, the mechanism gets built piece by piece:

The mental model: exactly-once is a coordinated story between a replayable source, a checkpointed operator, and a transactional or idempotent sink. Take any one of those pieces away and the badge is theatre. The rest of Build 9 is the script for that coordinated story.

References