Exactly-once semantics: how it actually works

A producer sends a payment event over the network. The TCP connection drops just as the broker is writing the bytes. Did the broker get it? Did it get it twice? You retry. Now you might have charged Riya ₹500 once, twice, or zero times. "Exactly-once" is the name of the trick that makes this stop being your problem.

Exactly-once delivery is a network impossibility — but exactly-once effect is achievable by combining two cheap primitives. First, give every message a stable id and have the broker reject duplicates (idempotence). Second, write the message and the consumer's offset to the same log atomically, so a partial failure leaves no half-applied state (transactional commit). Together these turn an at-least-once pipe into a system where each input event causes its downstream effect exactly once, even across crashes.

Why "exactly-once delivery" is a lie, and what we actually want

Open a terminal and run nc -l 9000 in one window. In another, run echo "pay-Riya-500" | nc localhost 9000. The byte stream arrives. Now pull your laptop's network cable mid-echo. The sender saw an error. The receiver saw — what? Maybe the full message. Maybe nothing. Maybe half a message. The sender has no way to tell which. This is the Two Generals Problem, and it is older than computers: no finite exchange of messages over a lossy channel can give both sides certainty that the other side received a specific message.

That has a brutal consequence. There are three things a sender can do with a message it is unsure about:

  1. Send it once and never retry. If the network ate it, it is gone forever. This is at-most-once.
  2. Retry until acknowledged. If the original got through but the ack was lost, the receiver gets it twice. This is at-least-once.
  3. Refuse to choose. Pretend impossible. This is what most engineers do until they ship to production and a customer calls about a double charge.

There is no fourth option at the delivery layer. Every wire-level guarantee in real systems — Kafka, RabbitMQ, gRPC, HTTP retries, your phone's UPI app — is either at-most-once or at-least-once. Pick one. Kafka picks at-least-once by default because losing payments is worse than charging twice (and the second can be fixed in software).

The trick of "exactly-once" is to give up on the delivery layer entirely and move the guarantee one level up — to the effect of the message. The wire delivers the bytes some non-zero number of times; the application makes sure that no matter how many copies arrive, the outcome is identical to the outcome of receiving exactly one copy. The classical name for this property is idempotence — applying the operation twice is indistinguishable from applying it once.

At-most-once, at-least-once, exactly-once comparedThree side-by-side panels each showing a producer on the left, a network cloud in the middle, and a consumer on the right. Panel 1 labelled at-most-once shows one arrow that gets lost in the cloud and the consumer sees nothing. Panel 2 labelled at-least-once shows three retry arrows arriving at the consumer with three duplicate side-effects. Panel 3 labelled exactly-once shows three arrows arriving at the consumer with a small filter labelled dedupe and one resulting side-effect.at-most-oncePCdropno retryeffect: 0data loss possibleat-least-oncePCretry on ack-losseffect: 1, 2, 3...duplicates possibleexactly-once (effect)PdedupeCat-least-once + filtereffect: exactly 1duplicates absorbed
The wire never gives you exactly-once delivery. Exactly-once is at-least-once delivery with a dedupe step that absorbs the duplicates before they cause a second side effect.

So when Kafka, Flink, or Razorpay claim "exactly-once", read it as: the producer retries until the broker acks; the broker tags every message and rejects duplicates; the consumer checkpoints its progress in the same atomic step that produces the output. No message is delivered exactly once — but each message's effect lands exactly once. This chapter is about how those three pieces fit together.

At-least-once with idempotent retries — the producer side

Build it from scratch. A producer sends payment events to a broker. Each event is independent: {user: "riya", amount: 500, op_id: "pay-2026-04-25-001"}. The producer wants two things:

  1. If the broker received the event, the broker stores it.
  2. If the broker did not receive the event, the producer keeps retrying until it does.

The naive producer just retries on timeout. That gives at-least-once on the wire. The problem is that the broker has no way to tell this is a retry of the message I already stored from this is a fresh message that happens to be identical. So if the producer's first send succeeded but the ack was eaten by a router, the retry duplicates the record on the broker. Now the consumer reads two pay-Riya-500 events. Riya's wallet gets debited twice.

The fix is for the broker to remember what it has already seen. That's it. The whole "idempotent producer" feature in Kafka is:

That's the entire mechanism. Twenty-five lines of Python make it concrete.

# idempotent_broker.py — a tiny illustration of Kafka's idempotent-producer trick
class IdempotentBroker:
    def __init__(self):
        self.log = []                   # the append-only log per partition
        self.high_seq = {}              # (pid, partition) -> highest seq seen

    def produce(self, pid, partition, seq, payload):
        key = (pid, partition)
        last = self.high_seq.get(key, -1)
        if seq == last + 1:
            self.log.append((pid, seq, payload))
            self.high_seq[key] = seq
            return "ack-stored"          # first time we see this seq
        if seq <= last:
            return "ack-duplicate"       # retry of already-stored message
        return "error-out-of-order"      # gap; producer must reset

# the producer retries until it gets an ack
broker = IdempotentBroker()
print(broker.produce("p-7", 0, 0, "pay-Riya-500"))    # ack-stored
print(broker.produce("p-7", 0, 0, "pay-Riya-500"))    # ack-duplicate (retry)
print(broker.produce("p-7", 0, 1, "pay-Rahul-200"))   # ack-stored
print(broker.produce("p-7", 0, 3, "pay-Asha-800"))    # error-out-of-order
print(broker.log)
# [('p-7', 0, 'pay-Riya-500'), ('p-7', 1, 'pay-Rahul-200')]

Walk it line by line.

high_seq[key] = seq. The broker's only durable state for the dedupe is one integer per (producer, partition) pair. The whole idempotence guarantee rides on this counter being persisted alongside the log. If the broker crashes and forgets the counter, an old retry will be accepted again — so this counter is part of the partition's metadata that gets snapshotted with the segment files.

seq == last + 1. The strict +1 rule, not "any seq we have not seen", is what protects against silent gaps. If a producer sent seq 0, 1, 2 and the broker only got 0 and 2, accepting 2 would silently lose seq 1 forever. The producer would think everything is fine. The +1 rule forces the broker to reject the gap, the producer sees the error, and it falls back to a recovery path (re-send the missing one or abort the session).

seq <= last returns ack-duplicate. From the producer's point of view, this looks identical to a successful first store. It does not retry forever; it considers the message delivered. The broker did not double-write, the producer did not crash, the consumer will not see a duplicate. This single equality is the whole idempotent-producer feature in Kafka 0.11+.

Why a sequence number is enough, when message content is not: the producer assigns sequence numbers in send-order, before the network. Two physically identical payloads — pay-Riya-500 legitimately twice — have different sequence numbers and are stored as two separate records. The dedupe is on the producer's intent (which send is this?), not on the bytes.

This solves the producer→broker leg. A retry over a flaky network can no longer cause the broker to store the record twice. But that is only one of three legs in a streaming pipeline.

The three legs of an exactly-once pipeline

A real streaming pipeline — Razorpay's payments, Swiggy's order fulfilment, ONDC's transaction events — has three places where a duplicate can sneak in:

  1. Producer → broker. Solved above: idempotent producer + sequence numbers.
  2. Broker → consumer (read). The consumer reads message at offset 42, processes it, then crashes before saving "I am at offset 43". On restart it re-reads 42, processes it again, and produces a duplicate downstream effect.
  3. Consumer → output (write). The consumer wrote the result to a downstream system (a database row, another Kafka topic, a Redis cache) and then crashed before checkpointing the offset. Same problem, different layer.

Solving leg 1 alone is what Kafka calls "idempotent producer". Solving all three together is what Kafka calls "exactly-once semantics" (EOS). The mechanism is one extra trick on top of idempotence: bind the output write and the consumer offset commit into one atomic transaction, written to the same log.

The three duplicate-introducing legs of a streaming pipelineA horizontal pipeline with four boxes labelled Producer, Broker (input topic), Consumer/Processor, and Output (DB or output topic). Three numbered red arrows point at the gaps: leg 1 between Producer and Broker labelled retry duplicates here, leg 2 between Broker and Consumer labelled re-read after crash, leg 3 between Consumer and Output labelled write succeeds, offset commit lost. Below, a wider arrow labelled transactional commit binds the offset commit on the broker and the output write into one atomic step.Producerapp codeBrokerinput topicConsumerprocessorOutputDB / topicleg 1producer retryleg 2re-read after crashleg 3write/commit splittransactional commitwrites output records AND consumer offsetsin the same atomic log entry
The three legs and the single trick that closes legs 2 and 3 at once: bundle the output write and the offset commit into one atomic transaction on the same log.

The reason this trick works is the stream-table duality from chapter 175. The consumer's "current offset" is just another piece of state. Storing it in a database is a write. Storing it in a Kafka topic is a write. If output records and the offset commit go into the same log and the same transaction, then either both are visible to the world or neither is. There is no in-between state where the output got produced but the offset did not advance — the very state that creates duplicates.

Atomic offset + output: the consumer-side commit

Make the trick concrete. A streaming job reads from payments-input, applies a fraud-score function, writes the result to payments-scored. With at-least-once and a naive offset commit, you get duplicates on every consumer crash. With a transactional commit, you do not. Here is the toy.

# transactional_consumer.py — bundle output writes and offset commit atomically
class Log:
    def __init__(self):
        self.records = []      # each record is a dict
    def append(self, rec):
        self.records.append(rec)

class TxnEngine:
    """One log holds both data records and offset-commit records. A txn is a
    contiguous block of records bracketed by BEGIN and COMMIT markers; on
    crash, any block without a COMMIT is invisible to readers."""
    def __init__(self, output_log, offset_log):
        self.output, self.offsets = output_log, offset_log
        self._buffer = []      # uncommitted writes, kept in memory only

    def begin(self):
        self._buffer = []

    def write_output(self, rec):
        self._buffer.append(("OUT", rec))

    def commit_offset(self, partition, off):
        self._buffer.append(("OFF", partition, off))

    def commit(self):
        # phase 1: append BEGIN to both logs
        self.output.append({"type": "BEGIN"})
        self.offsets.append({"type": "BEGIN"})
        # phase 2: flush buffered records
        for r in self._buffer:
            if r[0] == "OUT":
                self.output.append({"type": "DATA", "rec": r[1]})
            else:
                self.offsets.append({"type": "DATA", "p": r[1], "off": r[2]})
        # phase 3: append COMMIT — this is the moment the txn becomes visible
        self.output.append({"type": "COMMIT"})
        self.offsets.append({"type": "COMMIT"})
        self._buffer = []

# simulate processing one input record exactly-once
out_log, off_log = Log(), Log()
engine = TxnEngine(out_log, off_log)

engine.begin()
engine.write_output({"user": "riya", "score": 0.91})  # the fraud score
engine.commit_offset(partition=0, off=42)             # we just consumed off 42
engine.commit()

print(out_log.records)
# [{'type':'BEGIN'}, {'type':'DATA','rec':{'user':'riya','score':0.91}}, {'type':'COMMIT'}]
print(off_log.records)
# [{'type':'BEGIN'}, {'type':'DATA','p':0,'off':42}, {'type':'COMMIT'}]

Walk it.

begin() clears the in-memory buffer. Until commit() is called, nothing is on the log — not the output record, not the offset commit. This is critical: a crash mid-processing throws away the buffer and leaves the log untouched. On restart, the consumer re-reads offset 42 (the last committed offset was 41), reprocesses it, and produces an identical output record. The downstream system sees the output exactly once.

commit() writes BEGIN markers, then data, then COMMIT markers. The COMMIT marker is the atomic line. A reader (downstream consumer or read_committed consumer in Kafka) is configured to ignore any record sequence that does not have a matching COMMIT. So a crash between the BEGIN and the COMMIT — even one with the data records already on disk — is invisible to readers. The dangling fragment will be ignored or aborted on the next start.

Both logs receive BEGIN/DATA/COMMIT. The output topic and the offset-commit topic are written within the same transaction. In Kafka, the offset-commit topic is the special __consumer_offsets topic — yes, offsets are themselves messages on a Kafka topic, an example of the stream/table duality. Writing both topics under one transaction id means either both records become visible or neither does.

Why the COMMIT marker is enough — no two-phase commit, no coordinator: a Kafka transaction spans only Kafka topics. All the partitions involved live in the same cluster, with a single transaction coordinator broker. The "atomic across systems" problem of classic two-phase commit does not appear. The coordinator writes a transaction state machine to its own internal topic; the COMMIT marker is the visible signal to consumers, written after the coordinator durably knows the txn is done. There is no blocking 2PC because there is no second resource manager — only Kafka.

The astute reader will object: what if the output is not a Kafka topic? What if I am writing to Postgres, or to Razorpay's settlements table? Then the trick breaks down — Kafka cannot atomically commit a Postgres row alongside its own offset. There are two industry workarounds, both fall under "exactly-once":

Both shapes are special cases of the same idea: at-least-once delivery + a key the receiver can use to detect duplicates. The Kafka transaction is just the version where the broker itself is the receiver and the key is the txn id.

Common confusions

Going deeper

KIP-98: how Kafka actually implements transactions

Kafka exactly-once was added in version 0.11 (2017) under KIP-98. The design has three new pieces inside the broker:

  1. Producer epochs. Each transactional.id gets a monotonically-increasing epoch on every reconnection. An older epoch's writes are fenced out. This is what stops a "zombie" producer (one that hung for ten minutes, then woke up thinking it could commit) from corrupting state — its epoch is stale and the broker rejects it.
  2. Transaction coordinator. A broker chosen per-transactional.id (by hashing) that owns the transaction state machine — Empty → Ongoing → PrepareCommit → CompleteCommit — and writes that machine to a special compacted topic __transaction_state. This is the equivalent of the transaction coordinator in 2PC but with no second resource manager: only Kafka.
  3. Control records. Every transaction ends with a COMMIT or ABORT control record written to every involved data partition. Consumers in read_committed mode buffer records they have received but skip ones whose txn was aborted, using these control records as the truth.

The full state machine has nine states and is described in §4 of the KIP. The key fact: there is no blocking. If the coordinator broker dies mid-transaction, a follower takes over, reads the coordinator's __transaction_state log, and replays the state machine. Because the state machine is itself on a Kafka log, the new coordinator has full information and can drive every in-flight transaction to commit or abort without consulting any participant.

Why FLP impossibility doesn't kill the idea

You may have heard the FLP impossibility result (Fischer, Lynch, Paterson 1985): in an asynchronous system with even one faulty process, no deterministic consensus is possible. Doesn't that doom exactly-once?

It does not, for two reasons. First, FLP applies to consensus over the message, not to idempotent application of the message. Every system that retries until it gets an ack and uses a sequence number is sidestepping consensus — it does not need agreement on whether the message was received, only an upper bound on how many times its effect can occur. Second, real systems aren't fully asynchronous: they have failure detectors (heartbeats, timeouts), so they can make progress in practice even though FLP says they cannot in the worst case.

The cost shows up elsewhere. A network partition that lasts longer than the coordinator's session timeout will force the coordinator to abort in-flight transactions; the producer has to start a new transaction. Liveness is not guaranteed (FLP forbids it), but safety — no duplicate effects, no half-applied transactions — always is.

Tigerbeetle, Razorpay, and idempotence as the universal fix

Tigerbeetle is a financial-grade ledger that takes idempotence to its logical extreme: every account transfer must carry a 128-bit id chosen by the client, and the ledger rejects any transfer whose id it has already seen. The whole "exactly-once" story collapses into one rule. Combined with deterministic state-machine replication (LSM-tree-based VR consensus), this gets exactly-once effect without any transactions, any 2PC, any control records.

Razorpay's Payment Gateway API ships the same pattern at the HTTP layer. Every POST /v1/payments requires an Idempotency-Key header. The server stores the result of the first successful call keyed on that header; subsequent calls with the same key replay the stored result without re-charging. UPI itself uses transaction reference numbers for the same purpose — the fact that retries on a flaky 4G connection don't double-debit a Bengaluru auto-rickshaw fare is exactly this mechanism.

The lesson: exactly-once is not a Kafka-specific feature. It is a design pattern — at-least-once delivery + a stable id + a server-side dedupe. Kafka's transactions are one industrial implementation of it, optimised for the case where producer, broker, and consumer all live in the same cluster.

Performance cost of transactional EOS

The exactly-once mode is not free. Each transaction adds:

Confluent's measurements put the throughput overhead of enable.idempotence=true alone at roughly 3–10%, and full transactional EOS at about 20–25% relative to plain acks=all. For a Razorpay-class workload of ~5,000 transactions per second, the overhead is invisible. For a UPI-class workload of ~13,000 transactions per second per node it matters: the operational decision is whether the cost of a duplicate justifies the throughput haircut, and for money it always does.

The "exactly-once" debate that won't die

Tyler Treat and others have argued that "exactly-once" is a marketing term — that everything reduces to at-least-once + idempotence, and the phrase obscures the mechanism. They are right about the reduction. Where the phrase still earns its keep is to denote the configuration boundary: a system has crossed it when the user can rely on each input event causing exactly one logical output, without writing dedupe logic themselves. Kafka with enable.idempotence=true + transactional.id + read_committed is past that boundary; Kafka with bare acks=1 is not. Knowing where the line is, more than the name, is what matters.

Where this leads next

References

  1. Apurva Mehta, Jason Gustafson et al., KIP-98: Exactly-Once Delivery and Transactional Messaging in Kafka (2017) — the design doc for the producer epoch, transaction coordinator, and control records. cwiki.apache.org/confluence.
  2. Neha Narkhede, Exactly-Once Semantics Are Possible: Here's How Kafka Does It (Confluent, 2017) — the canonical engineering blog post. confluent.io/blog.
  3. Fischer, Lynch, Paterson, Impossibility of Distributed Consensus with One Faulty Process (JACM, 1985) — the foundational impossibility result that exactly-once-delivery cannot circumvent. groups.csail.mit.edu.
  4. Tyler Treat, You Cannot Have Exactly-Once Delivery (Brave New Geek, 2015) — the polemic that frames the entire debate; reduce to at-least-once plus idempotence. bravenewgeek.com.
  5. Razorpay, Idempotency in the Payment Gateway API — the production example at the HTTP layer, with Idempotency-Key header and replay semantics. razorpay.com/docs/api/idempotency.
  6. Stephan Ewen et al., Lightweight Asynchronous Snapshots for Distributed Dataflows (Apache Flink design) — how Flink layers two-phase commit over Chandy–Lamport snapshots to extend exactly-once across operator state. arxiv.org/abs/1506.08603.
  7. Kafka as a distributed log — the previous chapter; the substrate every transactional record sits on.
  8. The stream / table duality — chapter 175; the reason __consumer_offsets being a topic is the whole point.