Replication and ISR: how Kafka stays up
A partition, so far, is a single file on a single broker's local disk. The producer writes to it, consumers read from it, the offset cursor advances. Then that broker's NVMe drive dies during Diwali traffic, and 64 partitions of payments.tx_events go dark — Razorpay's fraud detector lags by 2 million records, the analytics warehouse stops ingesting, and the on-call engineer is paged at 11:47pm. Replication is the answer to that single sentence: copy each partition onto multiple brokers, elect one as the leader, and when the leader dies, promote a replica fast enough that nobody notices.
Each partition has a leader broker that handles all reads and writes, plus N-1 follower brokers that pull records and stay caught up. The set of followers that are not lagging is the in-sync replica set (ISR). A producer with acks=all only sees a write succeed once every ISR member has the record on disk. When the leader dies, the controller promotes one ISR member to leader; followers outside the ISR are forbidden from being promoted because their log might be missing records the leader had already acknowledged. ISR membership is the single mechanism that keeps Kafka durable, available, and consistent at the same time.
What gets replicated, and where
Every Kafka topic has two numbers: partitions (how many independent logs make up the topic) and replication.factor (how many copies of each partition exist across the cluster). For Razorpay's payments.tx_events topic with 64 partitions and replication.factor=3, the cluster holds 192 partition replicas total — three copies of each of 64 logs, each copy on a different broker. The cluster controller (a designated broker, elected via KRaft in Kafka 3.3+ or via ZooKeeper before that) decides which broker hosts which replica, trying to spread replicas across racks so a single rack-power-failure doesn't lose a quorum.
For each partition, exactly one of the three replicas is the leader; the other two are followers. All producer writes and all consumer reads (default config; follower-fetch is opt-in via KIP-392) go through the leader. The followers do nothing except pull records from the leader and write them to their own log. They are not load-bearing for throughput — they exist purely so the partition survives the leader's death. Why send all reads to the leader instead of load-balancing across replicas: the followers are eventually consistent with the leader, lagging by milliseconds. A consumer reading from a follower might see record 4203 as the latest while the leader already has 4210 — for a fraud detector tracking the high-water mark, that 7-record gap is a correctness bug, not a latency win. The leader-only-reads rule sidesteps the consistency question entirely; KIP-392 reintroduces follower reads only for the cross-region case where the latency saving outweighs the staleness.
A follower keeps up by issuing a Fetch request to the leader — the same RPC consumers use, but with a few flags set. The leader replies with whatever records the follower hasn't seen yet. The follower writes them to its own log, fsyncs (depending on flush.messages and flush.ms settings), and sends the next Fetch. This loop runs continuously, and the gap between the leader's log-end-offset and the follower's log-end-offset — measured in records — is the replica's lag.
Fetch RPC consumers use. The cluster controller maintains the (leader, replicas, ISR) tuple for every partition; a leader change is one write to the controller's metadata log, broadcast to all clients in their next metadata refresh.The picture above is the entire replication topology. The producer and consumer don't see the followers at all; they discover the leader through a metadata refresh and send everything there. The followers exist only to keep up with the leader's log and to be promotable when the leader dies. Everything in the rest of this chapter — ISR, high-water marks, leader election, unclean elections — is mechanism layered on top of this picture.
One sharper way to read the topology: the leader is the truth of the partition; followers are promises to become the truth if the current leader dies. A follower that has every record up to the leader's HWM is a credible promise; one that's been silent for 30 seconds is not. Kafka's whole replication subsystem is the bookkeeping that distinguishes credible promises from broken ones, in real time, while producers and consumers keep flowing through the leader. The ISR is just the name for "the current set of credible promises", and min.insync.replicas is how many credible promises you require before a write is allowed to count as durable.
The high-water mark and what acks=all actually means
A producer with acks=0 doesn't wait for the broker to ack at all — it sends and forgets. With acks=1 it waits for the leader's ack; the followers haven't replicated yet, but the producer moves on. With acks=all it waits for every member of the current ISR to ack the record. The setting names hide what's happening underneath: acks=all does not mean "all replicas", it means "all in-sync replicas right now". This subtlety is what makes acks=all durable without making it forever-blocking when a follower is slow.
The leader doesn't expose every record it has written to consumers. There's a second offset called the high-water mark (HWM) — the largest offset that every ISR member has fsynced. Consumers can only read up to the HWM. So if the leader's log-end-offset is 4210 but only B2 has caught up to 4209 and B5 is at 4208, the HWM is min(4210, 4209, 4208) = 4208. Consumers see records 0..4208; records 4209 and 4210 exist on the leader but are invisible until the followers catch up.
# replicate_demo.py — what acks=all means in code
from kafka import KafkaProducer
import json, time
producer = KafkaProducer(
bootstrap_servers="kafka.razorpay.internal:9092",
acks="all", # wait for all ISR members to ack
enable_idempotence=True, # KIP-98: dedup on the broker side
max_in_flight_requests_per_connection=5, # safe with idempotence on
retries=2147483647, # retry forever (idempotence makes this safe)
delivery_timeout_ms=120_000, # but cap total time end-to-end
compression_type="zstd",
)
def send(tx):
fut = producer.send(
"payments.tx_events",
key=tx["merchant_id"].encode(),
value=json.dumps(tx).encode(),
)
md = fut.get(timeout=10) # blocks until ISR has acked
print(f"committed: partition={md.partition} offset={md.offset} hwm-implied")
for tx in load_pending_payments():
send(tx) # blocks ~5–15ms typical, longer if ISR shrinks
producer.flush()
# Sample output, typical Razorpay production cluster, 3 brokers, replication.factor=3, min.insync.replicas=2:
committed: partition=3 offset=4209 hwm-implied
committed: partition=3 offset=4210 hwm-implied
committed: partition=27 offset=8843 hwm-implied
# A broker hits a brief disk slowdown — fetch latency rises:
committed: partition=3 offset=4211 hwm-implied (took 38ms — one ISR member at p99)
# B5 falls out of ISR (replica.lag.time.max.ms exceeded):
committed: partition=3 offset=4212 hwm-implied (took 4ms — only B1 and B2 in ISR now)
The lines that decide everything:
acks="all"— required for durability. Withacks=1, the leader can ack a record, then immediately crash before any follower replicates it; the new leader has no record of it, and the producer thinks it succeeded.acks=allblocks the ack until the ISR has the record. Why "all ISR" and not "all replicas": if a follower crashes, "all replicas" would block the producer until the dead replica recovers — minutes, sometimes hours. ISR membership is dynamic; a slow replica gets removed from the ISR afterreplica.lag.time.max.ms(default 30 s), at which point it stops blocking the producer. Durability is preserved because the controller refuses to elect a non-ISR replica as leader during a normal failover.enable_idempotence=True— required for safe retries. Without it, a retry can produce a duplicate when the original ack was lost on the network but the broker did write the record. With idempotence, the producer attaches a sequence number to every record; the broker rejects duplicates by(producerId, sequenceNumber). KIP-98 made this the default in Kafka 3.0+.min.insync.replicas=2(broker config, not shown in client code) — the durability floor. Withreplication.factor=3andmin.insync.replicas=2, the producer can tolerate one replica being out of ISR and still produce; if two replicas drop out and only the leader is left, the producer'sacks=allsends fail withNotEnoughReplicasExceptionrather than silently downgrading toacks=1. This is the most important one knob in the entire replication subsystem.fut.get(timeout=10)— the future blocks until the ISR acks. In healthy operation this is 5–15 ms (network round-trip + fsync on the slowest ISR member). When the cluster is unhealthy, this latency spikes; an ISR shrinking event correlates with a latency drop because there are fewer replicas to wait for, but durability also drops.compression_type="zstd"— compression happens once on the producer; the leader stores compressed batches and ships compressed batches to followers. Replication bandwidth is the cluster's silent cost driver, andzstdis the right knob; LZ4 is faster but compresses less, and gzip is slow.
ISR membership: when a follower falls out, when it comes back
A follower stays in the ISR if it satisfies one rule: it has fetched from the leader within the last replica.lag.time.max.ms (default 30 seconds in Kafka 3.0+; 10 seconds before that). Note the rule is time-based, not offset-based — a follower that's 1 million records behind but is fetching aggressively stays in the ISR; a follower that's only 5 records behind but hasn't sent a Fetch in 35 seconds gets kicked out. Why time-based and not offset-based: an offset-based threshold (e.g. "more than 4000 records behind") was the pre-0.9 design and turned out to be unworkable. During a producer burst — Razorpay's UPI flood at 9pm, when payments spike from 8k to 22k tx/sec — every follower briefly falls 50,000+ records behind even though they're all healthy and catching up at full speed. An offset-based ISR would shrink the ISR to {leader} during every burst and force min.insync.replicas violations for no good reason. Time-based ISR — "are you actively fetching?" — survives bursts cleanly because all the followers are still talking to the leader, just temporarily behind on the offset axis.
When the controller notices that a follower hasn't fetched within the threshold, it removes that follower from the ISR via a write to the controller metadata log. The leader stops counting that follower's acks toward acks=all; the HWM may advance because there's one fewer constraint. When the follower starts fetching again — process restart, network recovers — it catches up to the leader's log-end-offset, and the controller adds it back to the ISR. Both transitions trigger an ISR_CHANGE notification that propagates to clients on their next metadata refresh.
The signal you watch in production is kafka.cluster:type=Partition,name=UnderReplicatedPartitions,topic=.... A non-zero value means at least one partition has fewer in-sync replicas than replication.factor. During a rolling broker restart — say one of the three brokers is being patched — every partition led by another broker that has the patched broker as a follower will briefly show UnderReplicatedPartitions=1 while the patched broker catches up after it returns. The metric should return to zero within a minute or two; a partition that stays under-replicated for hours is a sign of a slow follower or an actually-dead broker that the controller hasn't given up on yet.
A second signal is min.insync.replicas violations: kafka.server:type=BrokerTopicMetrics,name=NotEnoughReplicasFetches. This counts producer acks=all requests that failed because the ISR was below min.insync.replicas. Any non-zero value here means the cluster is operating in a degraded state — durability has dropped from the configured tolerance, and producers are visibly failing. The right runbook is "stop new traffic to that topic, find the broker that's out, fix it, let the ISR recover".
The HWM itself isn't a separate field shipped between brokers; it's piggybacked on Fetch responses. When a follower sends a Fetch to the leader, the response includes the leader's current HWM along with the records. The follower writes the new records to its log, then applies the new HWM locally — exposing those records to consumers reading from this follower (in follower-fetch mode). The same mechanism keeps consumer reads consistent: a consumer's Fetch to the leader returns records up to the leader's HWM, computed each time as min(log-end-offset across current ISR). This piggybacking is why HWM updates are essentially free — no separate RPC round-trip is needed; the HWM rides on the response that the follower or consumer was already going to receive. For Zerodha's market data fanout where every microsecond of Fetch latency matters during pre-open auction at 9:00 AM IST, this design choice keeps replication overhead under 200 µs per round-trip on a healthy cluster.
Watching the right metrics over time also catches the slow-degrading cases that don't show up as outright failures. A broker whose disk is gradually slowing down — bad NVMe block, filesystem fragmentation, EBS throttling — often manifests as ISR oscillation: that broker's followers keep falling out for a few seconds, getting back in, falling out again. The cluster's IsrShrinksPerSec and IsrExpandsPerSec both rise from their normal near-zero baseline to several per minute. Producers are still succeeding because the ISR has the other replicas, so there's no NotEnoughReplicasFetches alarm — but the durability margin is silently being eaten. Catching this requires alerting on the rate of ISR change events, not just on outright failure metrics. The operational rule is: if a broker is ISR-oscillating for more than 10 minutes, take it out of the cluster (controlled shutdown) before it triggers a real incident; a planned replacement during business hours is cheaper than a 3 a.m. page.
Leader election: clean and unclean
When a leader dies — heartbeat to controller stops, KRaft session times out — the controller picks a new leader from the ISR. The protocol is straightforward: take the first replica in the ISR ordering that's still alive, mark it as the new leader, broadcast a metadata update. Producers and consumers see NotLeaderForPartitionException on their next request, refresh metadata, find the new leader, retry. The whole process takes ~6 seconds on default config, dominated by the broker session timeout.
The reason ISR-only election is safe is the durability invariant. Every record below the HWM was acknowledged by every ISR member; therefore, every ISR member has every record below the HWM on disk. Any ISR replica can become leader and the new log-end-offset will be ≥ the old HWM. Records above the old HWM might be lost — they were on the old leader but not necessarily on every ISR member — but those records were never acknowledged to the producer, so the producer is free to retry them. Acknowledged records are never lost; non-acknowledged records may be. That's the contract.
The dangerous case is unclean leader election: the controller is told to elect a non-ISR replica as leader because every ISR member is unreachable. This can happen when the entire ISR fails — a rack power outage taking out all three brokers hosting a partition's replicas — and only an out-of-sync replica is left. The out-of-sync replica is by definition missing some records that were below the old HWM, which means promoting it loses already-acknowledged data. The trade-off is availability vs durability: refuse the unclean election (default since Kafka 0.11) and the partition stays offline until an ISR member returns; allow it (unclean.leader.election.enable=true) and the partition stays available but lossy.
For Razorpay's payments.tx_events, unclean.leader.election.enable=false is non-negotiable — losing payment records is not a survivable outcome. For an analytics topic carrying clickstream events that are already aggregated downstream, the same team sets it to true because a 4-hour partition outage during an ISR-collapse incident hurts more than losing 30 seconds of clickstream events. The right setting is per-topic, not per-cluster, and the conversation that decides it should happen during topic creation, not during the incident.
The leader-election step itself is not free; it's coordinated by the controller, and during the window between the old leader's death and the new leader's promotion, every producer write to that partition fails with NotLeaderForPartitionException. Modern producers retry on this error after a metadata refresh, so the user-visible effect is a brief latency spike (~6–10 seconds on default config) rather than dropped writes. But during this window, consumers also get the same error and pause — for a fraud detector this is the difference between processing 22,000 transactions per second and dropping to zero for ten seconds. Tuning controller.quorum.election.timeout.ms and replica.lag.time.max.ms together is how you trade off "how fast does the controller declare the leader dead" against "how often does a hiccup look like a death". The defaults are conservative — they prefer waiting longer over false-positive elections — but for very-low-latency systems the right call is to tighten them and accept more election churn.
Per-topic durability tuning: the three knobs that matter
Three configurations together decide a topic's durability/availability profile:
| Knob | Where | Default | Razorpay payments | Razorpay analytics |
|---|---|---|---|---|
replication.factor |
per-topic | 1 (test); 3 (prod) | 3 | 3 |
min.insync.replicas |
per-topic | 1 | 2 | 1 |
unclean.leader.election.enable |
per-topic or broker | false | false | true |
The replication.factor=3 + min.insync.replicas=2 combination is the canonical "tolerate one replica failure with no impact, fail fast on two replica failures" setup. Producers with acks=all succeed when at least 2 of 3 ISR members are alive; they fail loudly when only 1 is left, preventing the silent loss that happens when the cluster degrades to single-replica-on-a-dying-disk. This is the configuration every payments-grade topic should use. Why not min.insync.replicas=3: it sounds safer but is operationally fragile. With replication.factor=3 and min.insync.replicas=3, any follower falling out of ISR — even briefly during a rolling restart — fails every producer write. The cluster has no tolerance for normal lifecycle events. The standard ratio is replication.factor - 1, leaving headroom for one normal failure.
For the analytics topic, min.insync.replicas=1 plus unclean.leader.election.enable=true flips the trade-off: producer writes succeed even with only the leader alive, and a total ISR collapse is recoverable by promoting whatever stale replica exists. The cost is up to a few minutes of clickstream gaps in pathological cases. For data that's aggregated 5-minute-windowed downstream and then dropped, this is fine.
There's one other knob that matters less but trips people up: replica.lag.time.max.ms. Default 30 s in Kafka 3.0+, was 10 s before. The longer the threshold, the more lag a follower can have before being kicked out — which means more risk of data loss on a leader failure (because the ack-confirmed records may not yet be on a slow follower) but fewer ISR shrink-grow oscillations during normal traffic bursts. The 3.0 default of 30 s is calibrated for cloud-block-storage-backed brokers (EBS, persistent disks) where p99 fsync can exceed 200 ms during the noisy-neighbour minute; on local NVMe the old 10 s default would also work fine. The monitoring tells you which: if kafka.server:type=ReplicaFetcherManager,name=MaxLag is regularly above 1000 records during normal operations, you're seeing replication noise that the long threshold is hiding.
A worked example of how the three knobs interact: BookMyShow's seat-booking topic during a Coldplay-tour ticket release at 11am IST. Producer traffic spikes from 200 records/sec to 18,000 records/sec in under 5 seconds. With replication.factor=3 and min.insync.replicas=2, a single follower briefly falling out of ISR during the burst causes no producer failure — the leader plus the other follower still satisfy min.insync.replicas=2. With replication.factor=3 and min.insync.replicas=3, that same brief blip would have failed every producer write across the spike, which is the worst possible time for the topic to fail. Choosing min.insync.replicas = replication.factor - 1 is what makes the cluster bend without breaking under load, and the Coldplay-tour incident report from the BookMyShow SRE team in 2024 was the case study that propagated this rule across a dozen Indian companies.
Common confusions
- "
acks=allwaits for all replicas, not just ISR." No — it waits for all in-sync replicas, the dynamic ISR set, not the static replicas list. A replica that's been kicked out of ISR doesn't block producer acks; that's deliberate, otherwise a single dead replica would block all producer writes forever. Durability comes frommin.insync.replicas, not fromacks=allalone. - "Followers serve reads to balance load." Default no — all reads go to the leader. Follower-fetch (KIP-392) exists for cross-region cases where reading a same-region follower beats round-tripping to a leader in another region, but it's an opt-in feature with explicit consistency caveats; the followers serve a slightly stale view of the log up to their own HWM.
- "More replicas means more durability." Up to a point —
replication.factor=3survives one broker failure,=5survives two. Beyond that, the extra replicas are mostly cost (storage, replication bandwidth, fsync amplification) without proportional safety, since correlated failures (rack power, kernel bug rolling out cluster-wide) dominate independent disk failures. - "Unclean leader election is just a safety net." It is catastrophic data loss. Every record between the failed leader's HWM and the unclean replica's log-end-offset is gone. Set
unclean.leader.election.enable=falsefor any topic where data loss is unrecoverable; set it totrueonly when you've explicitly decided availability matters more than durability for that topic. - "The HWM is the same as the leader's log-end-offset." Different. The leader's log-end-offset is the next offset it would write; the HWM is
min(log-end-offset across ISR). Records between HWM and log-end-offset exist on the leader but are invisible to consumers, because they aren't yet replicated to all ISR members. - "
min.insync.replicasis a producer setting." No — it's a topic-level (or broker-level default) setting. The producer just sendsacks=all; the broker enforcesmin.insync.replicasand replies withNotEnoughReplicasExceptionif the ISR is too small. Setting it on the producer side does nothing.
Going deeper
KRaft and the controller's metadata log
Pre-3.3, leader and ISR membership lived in ZooKeeper — a separate cluster, with its own quorum, replication, and operational story. KRaft mode (KIP-500, GA in 3.3, default in 4.0) folds that metadata into a Kafka topic of its own: __cluster_metadata. The controller is now one of the brokers, elected via Raft from a quorum of "controller-eligible" brokers (typically 3–5). Every leader change, every ISR shrink, every topic creation is a Raft-replicated record in the metadata log. Brokers tail this log to learn the current cluster state; clients refresh metadata via the standard Metadata request. The payoff: one fewer cluster to operate, faster failover (the controller doesn't need to read ZooKeeper at startup; it just reads its in-memory state), and metadata changes that scale with the broker count instead of the ZooKeeper write quorum. By 4.0, ZooKeeper support is removed entirely; KRaft is the only option.
Rack awareness and the cost of correlated failure
Kafka's default replica placement spreads replicas across brokers but not necessarily across racks. With broker.rack=ap-south-1a configured per broker, Kafka uses the rack-aware placement strategy: among the candidate brokers, pick one from each rack first. For a 3-replica partition in a cluster spanning three racks, this guarantees one replica per rack. The reason: AWS Mumbai (ap-south-1) availability-zone outages happen — ap-south-1c had a 4-hour outage in 2023. Without rack awareness, a 3-broker cluster can put all 3 replicas in ap-south-1c and lose the partition entirely; with rack awareness, you lose 1 replica out of 3 and the cluster keeps running. The configuration cost is one line in server.properties; the operational saving is "did not get paged during the AZ outage". For any production Kafka, rack awareness is non-negotiable.
Leader epochs: the subtle bug that KIP-101 fixed
Pre-Kafka-0.11, leader-failover used the HWM alone to decide which records a recovering follower should truncate. The problem: imagine leader L1 dies, follower F gets promoted to leader L2, accepts new writes, then dies. L1 comes back. L1 has records 100-105 from its old tenure that L2 never saw; L2 had records 100-103 from its own tenure that L1 doesn't have on disk. Both leaders had the same offset range "100-105" but with different bytes at those offsets. The pure HWM-truncation logic couldn't distinguish them; the recovering replica might keep the wrong fork. The fix in KIP-101 was the leader epoch: every time a leader is elected, the controller increments an epoch number, and every record carries the epoch under which it was written. On recovery, a replica asks the current leader "what's the largest offset I had under epoch E?" and truncates everything above that point. The fork is detected and resolved deterministically. This is the kind of bug that's invisible in normal operation and only shows up under back-to-back leader failures — which is exactly what happens during a rack power flap. Every modern Kafka deployment runs with KIP-101 enabled (default since 0.11); it's the unsung hero of Kafka's actual durability.
Cross-region replication: MirrorMaker 2 vs Confluent Replicator vs custom Connect sinks
In-cluster replication keeps the cluster up; cross-region replication keeps the region up. The standard tool is MirrorMaker 2 (KIP-382), which is itself a set of Kafka Connect sink+source connectors that read from one cluster and write to another. The trade-off: cross-region replication is asynchronous (acks=all between regions would mean every write blocks on a 30 ms India-Singapore round-trip), so the destination cluster lags by 1–10 seconds in steady state and may have data loss in the 1–10 second window if the source region collapses. For Razorpay's DR setup — Mumbai primary, Singapore standby — MirrorMaker 2 replicates payments.tx_events continuously; if Mumbai goes down, the Singapore cluster has roughly 5 seconds of lag, which is recovered by the producer's reconciliation logic against the upstream payment gateway logs. The DR runbook isn't "no data loss"; it's "1-minute window of records to manually reconcile after region failover", which is workable.
Operating replication: offline partitions, bandwidth, and leader skew
A partition with no in-sync leader is OFFLINE from the cluster's perspective. Producer writes fail with NotLeaderForPartitionException and don't retry to a different broker because there's nowhere to go. Consumers see OffsetOutOfRangeException if they try to read past the now-frozen HWM. The fix is one of: wait for an ISR member to recover; force a clean election once one returns (kafka-leader-election.sh --election-type preferred --topic ... --partition ...); or, in extremis, force an unclean election (--election-type unclean). The dashboard signal kafka.controller:type=KafkaController,name=OfflinePartitionsCount should be zero; any non-zero value is a P1 incident.
The other operational reality is bandwidth. A producer batch of 1 MB compressed lands on the leader and is sent verbatim to every follower; with replication.factor=3, outbound replication traffic is 2 × inbound producer traffic. For Razorpay's payments.tx_events ingesting 80 MB/sec of compressed records during peak UPI hours, the cluster is moving 160 MB/sec of replication traffic on top of producer ingest, plus consumer reads. On a 10 Gbps NIC this is fine; on older 1 Gbps bare-metal deployments it bottlenecks before disk IO does. The mitigation is even leader placement — spread leaders so no single broker is leader for too many high-traffic partitions. kafka-reassign-partitions.sh plus the leader-skew metric (kafka.controller:type=KafkaController,name=PreferredReplicaImbalanceCount) is how operators rebalance; auto.leader.rebalance.enable=true (default on) automates it once skew exceeds a threshold.
Where this leads next
Replication and ISR give you a partition that survives a broker death. They don't tell you what happens when the records themselves get old — when the topic's retention boundary hits, or when log compaction kicks in to reclaim space.
That's the next chapter, /wiki/retention-compaction-tiered-storage: how time-based and size-based retention interact with consumer lag, how compaction works on __consumer_offsets and other compacted topics, and how tiered storage (KIP-405, GA in 3.6) lets a topic hold years of history without putting it all on broker disks.
Beyond Build 7, the durability story you just learned is the foundation for /wiki/exactly-once-semantics-the-three-words-and-what-they-mean (Build 9). Idempotent producers — the enable_idempotence=True line in the code block — are one half of the puzzle; transactional commits across multiple partitions are the other half, and both rely on the same acks=all + ISR + HWM machinery you just walked through. Every higher-level guarantee in Kafka reduces to "what does it take to make this durable?", and the answer is always "every ISR member has it on disk, and the HWM has advanced past it".
The replication primitives also feed straight into stream processing. When a Flink or Spark Streaming job checkpoints its state, the consumer offsets it commits to Kafka are durable only if __consumer_offsets itself is replicated correctly — that internal topic ships with replication.factor=3 and min.insync.replicas=2 by default for exactly the reason this chapter walked through. The same machinery that protects payments.tx_events also protects the consumer-progress metadata of every team running off it.
The cleanest mental model to carry forward: a Kafka partition is a single log, replicated synchronously across N brokers via the leader-follower protocol, with the in-sync replica set acting as a dynamic quorum. Every other Kafka feature — consumer groups, transactions, log compaction, tiered storage — assumes this primitive works, and is structured around it.
One more lens worth carrying forward: replication in Kafka is synchronous within a region and asynchronous across regions. The synchronous side is what gives you the durability guarantees this chapter walked through — acks=all blocks until the ISR has the record. The asynchronous side, which the Going-Deeper section sketched, is the operational reality of disaster recovery: you cannot afford a 30 ms cross-region round trip on every write, so the cross-region copy lags. Most production Kafka decisions reduce to "where do I want synchronous replication, and where am I willing to accept asynchronous?" — which is the same question every distributed system has to answer, with the same trade-off curve.
References
- Apache Kafka — Replication — the canonical docs covering ISR semantics, leader election, and the high-water mark.
- KIP-101: Alter Replication Protocol to use Leader Epoch rather than High Watermark for Truncation — fixes a subtle data-loss bug that pre-0.11 replication had during back-to-back leader failures; the leader-epoch field is the mechanism.
- KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum — the KRaft design doc; explains why ZooKeeper had to go and how the controller now stores metadata in a Kafka topic.
- KIP-392: Allow consumers to fetch from closest replica — the follower-fetch design; spells out the consistency caveats and the rack-routing logic.
- Jay Kreps — A Few Notes on Kafka and Jepsen — Jepsen's analysis of Kafka's replication protocol and the durability guarantees it actually provides.
- /wiki/consumer-groups-and-offset-management — the previous chapter; the offset cursor that this chapter's HWM-based machinery makes durable.
- /wiki/retention-compaction-tiered-storage — the next chapter; what governs how long a record stays in the partition once it's safely replicated.
- Cloudflare — How we built a durable Kafka deployment on top of unreliable EBS — production-grade walkthrough of
min.insync.replicas, rack-aware placement, and the failure modes that drive the configuration choices.