What exactly-once actually means (it's delivery, not processing)
A Zerodha trade-tick pipeline runs on Flink in EXACTLY_ONCE mode and produces 1,84,72,310 ticks one Tuesday. The downstream Pinot table shows 1,84,72,317 rows for the same window. Seven extra rows, one green badge, one engineer with a query open at 11:42pm trying to figure out which of the three "exactly-once" guarantees the system actually shipped — and which one she assumed it shipped. The previous chapter established that the badge is a contract, not a fact. This one tells you what the contract actually says, word by word, so you can read a Flink job config or a Kafka producer setting and know which sub-guarantee is on and which is bluff.
"Exactly-once" hides three distinct guarantees: delivery (each record reaches the next stage exactly once), processing (each record's effect on operator state happens exactly once), and effect (the externally observable end state matches a one-execution baseline). Vendors usually ship delivery + processing inside the framework boundary and call it exactly-once. Effect-once on the outside world is your job.
Three meanings hiding inside one phrase
Every time someone says "exactly-once", they are picking one of three guarantees and hoping you fill in the same blank. The three are not interchangeable, and confusing them is how the seven extra rows show up.
Delivery semantics describe what happens between two stages — say, between a Kafka broker and a consumer, or between a Flink operator and its sink. Exactly-once delivery means: for every record produced by the upstream stage, the downstream stage receives it exactly one time, regardless of network retries, broker failovers, or consumer crashes. This is a transport-layer property. The unit of measurement is "one record sent → one record received".
Processing semantics describe what happens inside an operator. Exactly-once processing means: for every record an operator receives, the cumulative effect on that operator's state — its windowed aggregates, its keyed-state map, its accumulators — looks as if the record had been processed one time, even if the framework physically called process(record) four times during recovery. The unit of measurement is "one record received → one logical state transition".
Effect semantics (sometimes called effectively-once — Tyler Akidau's preferred term) describe what the outside world observes. Exactly-once effect means: the cumulative side-effects on systems outside the framework — rows in a database, emails sent, payments charged — equal what one logical execution would have produced. The unit of measurement is "one record entered the pipeline → one externally visible mutation".
The three compose, but they don't imply each other. You can have exactly-once delivery between Kafka and Flink while still doing at-least-once processing if the operator's state isn't checkpointed. You can have exactly-once processing inside a Flink operator while still doing at-most-once effect if the sink silently drops on conflict. The badge usually refers to the first two — the framework's two-thirds. The third one is on you.
Why "delivery" is the word the engineering papers use
Read KIP-98 (the Kafka exactly-once design doc) and the phrase is exactly-once delivery, not exactly-once processing. There's a precise reason for that word choice. The Kafka brokers are the transport layer; they have no view into what a downstream consumer does with a record after receiving it. They can only guarantee what happens up to the moment of receipt — which is delivery. Why this matters for reading vendor docs: when a system promises "exactly-once delivery", it is saying nothing about what happens after the consumer's poll() returns. If your consumer crashes after poll() but before processing, the broker has discharged its delivery guarantee — the record was delivered exactly once. The duplicate (or loss) on restart is now the consumer's problem.
Flink, on the other hand, advertises exactly-once processing semantics (EOS-P). Its job is to extend the guarantee inward — past the point of delivery, through operator state, and out the other side via 2PC sinks. Flink's contract is stronger than Kafka's by construction: it covers delivery from the source plus processing inside the operator. But it stops at the sink boundary unless the sink itself is transactional.
Spark Structured Streaming uses the phrase exactly-once fault-tolerance and means: the sink writes are idempotent (because of the way micro-batch IDs are encoded) so reprocessing the same micro-batch produces the same row in the sink table. Different word, similar guarantee, different mechanism. Why the wording shifted: Spark's micro-batch model means the operator runs the same batch deterministically on retry — the framework doesn't need a 2PC sink because it forces sink writes to be idempotent by construction. Flink's continuous model needs 2PC because individual records flow through one at a time and there is no natural batch boundary to make idempotent.
So when an interviewer asks "does Flink offer exactly-once?", the precise answer is: "Flink offers exactly-once processing within its operators and exactly-once delivery to its sinks if the sink supports 2PC; effect-once on the outside world depends on what the sink does next." That's the right level of pedantry. Anything less specific is hand-waving.
A side-by-side simulation: delivery vs processing vs effect
The three guarantees show up as three different counters. Build a tiny pipeline that tracks all three and watch what happens during a crash.
# three_guarantees.py — count what each "exactly-once" actually counts
from collections import defaultdict
class Pipeline:
def __init__(self):
# Counters for the three guarantees
self.delivery_count = defaultdict(int) # times each record was delivered to operator
self.processing_count = defaultdict(int) # times each record was process()'d
self.effect_count = defaultdict(int) # times each record produced a sink mutation
self.sink_rows = [] # the actual sink table
def deliver(self, record_id, payload):
# Source → operator: this is the "delivery" boundary
self.delivery_count[record_id] += 1
return record_id, payload
def process(self, record_id, payload, force_state_replay=False):
# Operator: this is the "processing" boundary
self.processing_count[record_id] += 1
if force_state_replay:
# Framework rolled back state but is replaying the record
return None # state-restore replay does not count toward effect
# Sink: this is the "effect" boundary
self.effect_count[record_id] += 1
self.sink_rows.append({"order_id": record_id, "amount": payload})
stream = [(f"BLR-{8000+i}", 100 + i * 10) for i in range(5)]
pipe = Pipeline()
# Normal run, no crash
for rid, amt in stream:
pipe.process(*pipe.deliver(rid, amt))
# Now simulate a crash: record 2 was processed, sink wrote, then framework crashed
# before checkpoint. On restart the framework replays from offset 2.
crash_record = stream[2]
pipe.deliver(*crash_record) # delivery happens again
pipe.process(*crash_record, force_state_replay=True) # processing happens, no sink effect
# Then framework continues from offset 3
for rid, amt in stream[3:]:
pipe.process(*pipe.deliver(rid, amt))
# But records 3 and 4 also got re-delivered, since the source replayed from 2
for rid, amt in stream[3:]:
pipe.deliver(rid, amt)
print(f"Records in stream: {len(stream)}")
print(f"Delivery counts: {dict(pipe.delivery_count)}")
print(f"Processing counts: {dict(pipe.processing_count)}")
print(f"Effect (sink) counts: {dict(pipe.effect_count)}")
print(f"Final sink row count: {len(pipe.sink_rows)}")
Records in stream: 5
Delivery counts: {'BLR-8000': 1, 'BLR-8001': 1, 'BLR-8002': 2, 'BLR-8003': 2, 'BLR-8004': 2}
Processing counts: {'BLR-8000': 1, 'BLR-8001': 1, 'BLR-8002': 2, 'BLR-8003': 1, 'BLR-8004': 1}
Effect (sink) counts: {'BLR-8000': 1, 'BLR-8001': 1, 'BLR-8002': 1, 'BLR-8003': 1, 'BLR-8004': 1}
Final sink row count: 5
The three counters tell different stories about the same crash:
delivery_countshows duplicates forBLR-8002throughBLR-8004. The source replayed those offsets after the crash, so the operator'sdeliver()was called twice for them. Why this is fine: at-least-once delivery is the natural floor for any replayable source. Building exactly-once delivery requires producer-side dedup (Kafka's idempotent producer with sequence numbers, KIP-98) — without that, replays naturally double-deliver.processing_countshows a duplicate only forBLR-8002. The framework replayed that record after rolling back state, soprocess()ran twice. But the second run was marked as a state-restore replay (force_state_replay=True), so it didn't trigger a sink effect.effect_countis exactly 1 for every record. Even though delivery happened twice for some records and processing happened twice for one, the sink only saw one mutation per record. That is what the user actually cares about — and that is what the engineering of exactly-once is fighting for.
If the framework had not distinguished state-restore replays from first-time processing — i.e. if force_state_replay=False for the second process() call — the effect count would be 2 for BLR-8002 and the Pinot table would have an extra row. That gap, between "processing replayed during recovery" and "effect should not replay", is exactly what 2PC sinks and idempotent writes solve.
How vendors map to the three guarantees
A working table for reading product docs:
| System | What it actually guarantees |
|---|---|
Kafka producer with enable.idempotence=true |
Exactly-once delivery for one (producer, partition) session — sequence numbers dedup retries. |
Kafka transactional producer + transactional consumer (isolation.level=read_committed) |
Exactly-once delivery survives producer crash via transactional.id fencing. |
Flink with EXACTLY_ONCE checkpoint mode + Kafka source |
Exactly-once processing inside operators, plus exactly-once delivery to a 2PC-capable sink. |
Flink with AT_LEAST_ONCE checkpoint mode |
Delivery only — operator state may double-count on crash recovery. |
| Spark Structured Streaming + idempotent sink (Delta, Iceberg, JDBC with composite PK) | Exactly-once effect via deterministic micro-batch IDs and idempotent sink writes. |
| Beam (Dataflow runner) | Exactly-once processing via per-bundle deduplication; effect-once depends on the sink connector. |
| Materialize | Exactly-once effect by construction — the materialised view is the sink, and IVM updates are deterministic from the input log. |
The "what to read first when you adopt a system" rule: open the output or sink section of the docs and read the words around exactly-once. If the docs only talk about delivery and processing without saying anything about the sink behaving idempotently or transactionally, your effect-once is on you.
What "delivery, not processing" means for your code
The pragmatic translation of all this for an engineer writing a Flink or Kafka pipeline today:
Don't conflate Kafka's exactly-once with end-to-end exactly-once. Kafka's transactional producer + consumer gives you exactly-once delivery within Kafka. If your consumer reads from Kafka, transforms, and writes to Postgres, Kafka's guarantee stops at the consumer's poll() — the Postgres write is a separate concern. The mistake at Zerodha was assuming Kafka's badge covered Pinot. It didn't, and couldn't.
Pick the boundary you care about and engineer for that boundary specifically. If the metric you optimise is "duplicate rows in Pinot per quarter", you need exactly-once effect and the engineering work happens at the sink. If the metric is "double-counted aggregates in Flink state during recovery", you need exactly-once processing and the work is in checkpoint configuration. If the metric is "duplicate Kafka records consumed by downstream services", you need exactly-once delivery and the work is in producer config + transactional consumer. Three different failure modes, three different fixes, three different config knobs.
The sink connector is the load-bearing piece. Every "exactly-once is broken" production story I've read traces back to a sink that didn't honour the contract — a JDBC sink with naive INSERT, an HTTP sink with no idempotency key, a S3 writer that appended instead of conditionally-putting. The streaming framework is rarely the bug. It's the connector.
Common confusions
- "Exactly-once delivery is the same as exactly-once processing." No — delivery is between stages, processing is inside one operator. Kafka offers delivery only; Flink offers processing only-if you also configure 2PC sinks; the two compose but neither implies the other.
- "If Flink says EXACTLY_ONCE, my downstream Pinot won't have duplicates." Only if the Pinot sink connector is transactional or idempotent. The default JDBC connector with naive
INSERTwill duplicate on Flink's checkpoint replays. Use the dedicated upsert connector or write through a Kafka topic with EOS. - "Exactly-once effect is a vendor's responsibility." The vendor can give you the framework guarantees (delivery + processing). Effect-once depends on the external system the framework writes to — which the vendor doesn't own. They can ship a 2PC sink connector for common cases (Kafka, JDBC with composite PK, S3 with conditional put); for everything else you write the idempotency key yourself.
- "Idempotent producer = exactly-once delivery." Only within one producer session. The idempotent producer's sequence numbers reset when the producer process restarts under a new
producer_id. To survive restarts you also need transactional producers with a stabletransactional.id— that's the fencing layer. - "Exactly-once processing implies deterministic operators." No — but non-determinism limits how useful exactly-once is. If
process()readstime.time()or calls a remote API, the cumulative effect is reproducible if you re-execute with the same inputs and clock. EOS doesn't snapshot the clock; it snapshots state. Determinism is a separate concern, layered on top. - "Effect-once is impossible in distributed systems." It's not impossible — it's a closed-system guarantee. Inside a closed system (Kafka source, Flink operator, Iceberg sink with snapshot isolation), effect-once holds. Outside a closed system (HTTP calls to a third-party API with no idempotency key), it doesn't. The trick is to draw the closed-system boundary explicitly and make every external effect cross it through an idempotent worker.
Going deeper
Akidau's "effectively-once" — the term that should have won
Tyler Akidau spent a chunk of Streaming Systems (chapter 5, sections 5.1–5.3) arguing that exactly-once is the wrong word and effectively-once is the right one. His argument: exactly-once implies that physical execution happens once, which is false in any system that does retry-based fault tolerance. The real guarantee — that the cumulative effect equals one execution — is better described as effectively. The streaming community largely stuck with exactly-once because Kafka and Flink had already shipped that brand and rebranding mid-flight is a marketing nightmare. But every paper since 2017 includes a paragraph clarifying that exactly-once really means effectively-once. When you see effectively-once in a paper, the author is being precise on purpose. When you see exactly-once, you have to read the next paragraph to figure out which of the three sub-guarantees they mean.
Why Kafka's idempotent producer alone isn't enough
The Kafka idempotent producer (KIP-98 part 1) assigns a (producer_id, partition, sequence_number) triple to every record. The broker tracks the highest sequence number per (producer_id, partition) and silently drops duplicates whose sequence number is <= last_seen. This handles network retries within one producer session. It does not handle: producer process restarts (the new process gets a fresh producer_id, so its sequence numbers start at zero and the broker can't dedup against the old session); long-running transactions (no atomic boundary across multiple records); or consumer-side replays (offsets are committed separately from records).
Transactional producers (KIP-98 part 2) added the transactional.id config — a stable, application-chosen identifier that survives process restarts. When a new producer instance starts up with the same transactional.id, it calls init_transactions() which fences out any zombie producer sessions still holding open transactions. This is what makes exactly-once delivery survive producer crashes. The fencing is implemented as a per-transactional.id epoch number on the broker; bumping the epoch invalidates all previous producer instances. Read the Confluent transactions blog post for the protocol diagram — it's the clearest published walk-through of how the epoch fencing actually works under the hood.
The Zerodha tick-pipeline post-mortem
In late 2024, Zerodha's market-data team wrote up an internal post-mortem about a Pinot table that consistently had 4–9 extra rows per million during NSE peak hours. The Flink job was in EXACTLY_ONCE mode; the Pinot sink was a custom HTTP-based connector that posted batches of rows to Pinot's realtime ingestion endpoint. The connector had been written before Pinot supported upserts, so it had no dedup key. When a Flink task crashed mid-checkpoint, the framework replayed the checkpoint's pre-committed batch, and the connector POSTed the same batch a second time. Pinot dutifully ingested both copies. The fix was twofold: (1) move the sink behind a Kafka topic with EOS-enabled producer, and let Pinot's first-class Kafka upsert connector dedup using (symbol, exchange_ts) as the primary key; (2) drop the custom HTTP connector. After the fix, duplicate rows fell to zero across three quarters. The architectural lesson — which is the lesson of this entire build — is: trust the closed system, and bridge to the outside world only through idempotent or transactional bridges. Custom HTTP sinks almost always violate that rule.
How Beam's runner-specific exactly-once differs
Apache Beam tries to abstract away the runner — Dataflow, Flink, Spark, Samza, Direct — but exactly-once semantics are inherently runner-specific because they depend on how the runner does fault tolerance. Beam's compromise is to let the runner declare which guarantee it offers via the Pipeline.requireUnboundedSources() and per-PTransform deduplication options. The Dataflow runner uses per-bundle dedup keyed on a content-hash, so reprocessing produces the same bundle ID and the framework drops it. The Flink runner uses Flink's native EOS plus 2PC sinks. The Spark runner uses Spark Structured Streaming's micro-batch IDs. The Beam programmer writes one pipeline; the runner picks the mechanism. The catch: the per-PTransform DoFn.processElement may be called multiple times by the runner during recovery, and Beam's docs explicitly say so. If your DoFn has external side-effects (writes to a non-Beam sink), you have to make them idempotent — Beam will not.
Where this leads next
The three guarantees are the vocabulary you need to read the next four chapters without confusion:
- /wiki/idempotent-producers-and-sequence-numbers — how Kafka delivers the delivery leg via sequence numbers and producer-id fencing.
- /wiki/transactional-writes-2pc-wearing-a-hat — how the processing leg extends into the sink via two-phase commit.
- /wiki/end-to-end-exactly-once-source-operator-sink — the three legs composed, with one closed-system diagram.
- /wiki/why-eos-is-still-not-end-to-end-real-world-side-effects — the effect leg outside the closed system, and the patterns Indian fintech uses for it.
Hold on to this map: when someone says "exactly-once", ask "delivery, processing, or effect?" before agreeing or disagreeing. The conversation will get shorter and the production bug will get easier to fix.
References
- Streaming Systems — Akidau, Chernyak, Lax (O'Reilly, 2018) — chapter 5 is the canonical treatment of effectively-once and the framework-vs-effect boundary.
- KIP-98: Exactly Once Delivery and Transactional Messaging — the design doc that introduced idempotent and transactional producers; clear distinction between session-local and cross-session dedup.
- Transactions in Apache Kafka — Confluent blog — the producer-id fencing diagram and protocol walk-through.
- An Overview of End-to-End Exactly-Once Processing in Apache Flink with Kafka — Flink's mapping of processing-side EOS to 2PC sinks.
- Spark Structured Streaming — output sinks and idempotency — Spark's micro-batch-ID approach to effect-once.
- Apache Beam — fault tolerance and exactly-once — runner-specific guarantees and the
DoFnretry contract. - Materialize — strict serializability and effect-once — IVM-based exactly-once-by-construction.
- /wiki/wall-exactly-once-is-a-lie-everyone-tells — the previous wall chapter; the "lie" framing this article unpacks.