Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Kafka as a distributed log

CricStream's live-events team spent the first six months of their 2025 rebuild trying to use Kafka the way they had used RabbitMQ — as a queue. They created a topic per match, set enable.auto.commit=true, and wired one consumer per microservice. During the IPL final at over-26 between Mumbai and Chennai, 41 million phones were holding open WebSocket connections to their edge fleet, the broker cluster was sitting at 70% network out, and their dashboards started showing the play-by-play feed lagging by 90 seconds. The on-call engineer, Karthik, pulled kafka-consumer-groups.sh and saw the lag for match-events-v3 partition 12 sitting at 2.3 million records. Restarting the consumer didn't help. Scaling out to ten consumers didn't help — they all rebalanced and one of them got stuck reading the same partition. The problem was not throughput. The problem was that they had thought Kafka was a queue. It is not. Kafka is a distributed, partitioned, append-only log — and the moment you internalise that, almost every operational mystery resolves.

This chapter is about what Kafka actually is, mechanically, on disk and on the wire. Not "what's the difference between a queue and a stream" (covered in /wiki/queues-vs-streams-the-fundamental-split) — but the storage layout, the partition-leadership protocol, the in-sync replica set, and why a single Kafka cluster can ingest more than a terabyte per minute on commodity hardware while a cluster of "real" message brokers struggles past a few gigabytes per second.

Kafka is a distributed append-only log, sharded into partitions, replicated across brokers, with consumer position stored externally as offsets. Producers append to the leader of a partition; followers replicate from the leader; the in-sync replica set (ISR) determines which writes are durable. Reads go to the leader (by default) and stream sequentially from the consumer's offset. Throughput comes from sequential disk I/O, zero-copy sendfile, and partition-level parallelism. The price you pay is that ordering is per-partition, not per-topic, and rebalancing consumer groups stalls the partition until reassignment completes.

What is on disk — the segment file

A Kafka topic is a logical name. A Kafka partition is a directory of files on a broker's disk. Open /var/lib/kafka/data/match-events-v3-12/ on the leader broker for partition 12 of match-events-v3 and you will see something like this:

00000000000000000000.log     1.0G   the actual records
00000000000000000000.index   10M    offset → byte-position lookup
00000000000000000000.timeindex  10M  timestamp → offset lookup
00000000000002738451.log     1.0G   next segment
00000000000002738451.index   10M
00000000000002738451.timeindex  10M
00000000000005478923.log     0.4G   active segment (currently being written)
00000000000005478923.index   10M
leader-epoch-checkpoint
partition.metadata

Each .log file is a segment — a contiguous slice of the log, named by the offset of its first record. Records are appended to the active segment until it hits the configured roll size (1 GB by default) or roll time (7 days by default), at which point a new segment is opened. The .index file is a sparse map from offset to byte position within the segment, indexed every 4 KB so that a consumer asking for offset 4,219,876 can mmap the index, binary-search to the nearest indexed offset, then linearly scan a few KB of the .log to find the exact record. The .timeindex is the same idea but keyed by timestamp.

Kafka partition layout — segments, index, and active write headA horizontal layout of three log segments on disk, each with associated .index and .timeindex sidecar files, and an arrow marking the active write head at the end of the third segment. Illustrative. partition 12 directory on leader broker segment 0 offsets 0 .. 2,738,450 closed, immutable, 1.0G segment 1 offsets 2,738,451 .. 5,478,922 closed, immutable, 1.0G segment 2 (active) offsets 5,478,923 .. ? open, append-only .index — sparse offset→byte one entry per 4KB written .timeindex — ts→offset enables time-based seek .index + .timeindex (live) updated on every roll W write head Producers append to segment 2. Consumers stream from any offset across any segment, using .index for O(log n) seek + sequential read. Illustrative — defaults shown (1.0 GB segment, 4 KB index granularity).
A Kafka partition is a directory of immutable closed segments plus one active segment being appended to. The sparse index sidecars give O(log n) lookup followed by sequential scan — the same trick LSM-tree storage uses. Illustrative.

Why segments instead of one big file: retention, compaction, and crash recovery. Retention by time or size means deleting old segments, which is unlink() on a closed file — atomic, fast, no rewriting needed. Log compaction (covered later) rewrites only the closed segments, never the active one. Crash recovery only has to scan the active segment, not the whole 100 GB partition. And segment rolling means the index files stay small enough to mmap cheaply — a 1 GB segment with 4 KB index granularity has roughly 250,000 index entries, about 4 MB of index — fits in page cache easily. If Kafka used a single growing file per partition, every one of these operations would degrade as the partition grew.

Why this is fast — sequential disk and the page cache

The single most important performance fact about Kafka is that it does not maintain a per-message index, does not have a B-tree, and does not implement its own caching layer. It writes records sequentially to a file and reads them sequentially from a file. The Linux page cache does the rest.

Here is what happens on a producer write:

  1. Producer sends a batch of records over TCP to the leader broker for the partition.
  2. Broker appends the batch to the active segment via a single write() syscall — sequential append to the end of the file.
  3. The OS buffers the write in the page cache. Kafka does not call fsync() per write (by default).
  4. Broker waits until the followers in the ISR have replicated the batch (for acks=all).
  5. Broker sends the ack back to the producer.

And on a consumer read:

  1. Consumer sends a Fetch request with (topic, partition, offset).
  2. Broker uses the .index file to find the byte position of offset in the segment.
  3. Broker calls sendfile(socket_fd, file_fd, offset, length) — the zero-copy path. The data goes from page cache to the network card's DMA buffer without ever entering Kafka's user-space process.
  4. Consumer receives the bytes, decodes records, hands them to the application.

Why sequential I/O is so much faster than random I/O on modern hardware: spinning disks famously had ~5 ms seek latency vs ~100 MB/s sequential throughput, a 500× gap. NVMe SSDs have closed the gap dramatically — a modern Samsung 990 Pro does ~7 GB/s sequential and ~700 MB/s random for 4 KB reads, only a 10× gap. But the page-cache effect is even bigger. A sequential scan of a hot Kafka partition reads almost entirely from RAM at ~30 GB/s, while a random-read OLTP workload misses the cache constantly and bottlenecks on the SSD's IOPS budget. Kafka's design extracts every byte of this advantage — appends are sequential, reads stream forward, the active segment is always hot in page cache, and sendfile skips user-space copying entirely. The net result is that one Kafka broker on commodity hardware can sustain 1 GB/s producer ingest and simultaneously serve 3 GB/s consumer reads — three to five times what a queue broker like RabbitMQ delivers on the same hardware.

Partitions, leaders, and the in-sync replica set

A topic with replication-factor 3 and 24 partitions has 72 partition replicas — three copies of each partition spread across the broker fleet. For each partition, exactly one replica is the leader. All producer writes for that partition go to the leader. All follower replicas pull from the leader (Fetch requests, the same protocol consumers use). When a follower has caught up to the leader's high watermark, it is in the in-sync replica set (ISR).

The ISR is the heart of Kafka's durability story. A write with acks=all is acknowledged to the producer only after every replica in the ISR has appended it. If the leader crashes, the controller picks a new leader from the ISR — never from a follower outside the ISR, because that would lose data.

Partition leader, follower replication, and the in-sync replica setA diagram showing one partition leader and three follower replicas. The leader pushes records to followers via the Fetch protocol. Two followers are inside the ISR (in-sync replica set) shown by a green border, and one slow follower is outside the ISR shown by a dashed border. Illustrative. ISR (3 replicas) Broker 1 — Leader HW = 5,478,923 Broker 2 — Follower caught up Broker 3 — Follower caught up Outside ISR (slow) Broker 4 — Follower lag > 30s, kicked Producer (acks=all) ack returned only after all 3 ISR replicas append Broker 4 (out of ISR) is NOT required for ack but a leader-election cannot pick it as leader Illustrative — defaults: replica.lag.time.max.ms = 30000.
The ISR is the set of replicas that have caught up to the leader's high watermark within `replica.lag.time.max.ms` (default 30 s). A producer with `acks=all` waits for every ISR member to append; an ISR shrink (a follower falling behind) reduces the durability cost but increases blast radius if the leader fails. Illustrative.

A follower drops out of the ISR if it has not caught up to the leader within replica.lag.time.max.ms (default 30 s). The drop is automatic — the controller updates ZooKeeper / KRaft metadata and producers waiting on acks=all see the ack happen sooner because the slow follower is no longer required. This is a critical Kafka behaviour to understand: the durability guarantee silently weakens when followers fall behind. A topic configured with replication-factor 3 and min.insync.replicas=2 is durable even if one follower drops, but if two drop (ISR shrinks to 1, just the leader), acks=all writes start failing with NotEnoughReplicasException. PaySetu's payment-event topic is configured replication-factor=5, min.insync.replicas=3 precisely so they can lose two brokers and still keep ingesting.

A code walkthrough — producer, consumer, and observing the offset

Here is a runnable end-to-end example that produces records to a partition, consumes them with offset tracking, and prints the partition layout. It uses kafka-python for the client; you can run it against a single-node Kafka in Docker.

# kafka_log_demo.py — produce, consume, observe offsets.
# Requires: pip install kafka-python  (and a running Kafka broker)
from kafka import KafkaProducer, KafkaConsumer, TopicPartition
from kafka.admin import KafkaAdminClient, NewTopic
import json, time, uuid

BOOTSTRAP = "localhost:9092"
TOPIC = "match-events-demo"
PARTITIONS = 4

# 1. Create the topic with 4 partitions, replication-factor 1 (single broker demo).
admin = KafkaAdminClient(bootstrap_servers=BOOTSTRAP)
try:
    admin.create_topics([NewTopic(name=TOPIC, num_partitions=PARTITIONS,
                                  replication_factor=1)])
except Exception:
    pass  # topic already exists

# 2. Producer — keyed by match_id so all events for one match land on one partition.
producer = KafkaProducer(
    bootstrap_servers=BOOTSTRAP,
    key_serializer=lambda k: k.encode(),
    value_serializer=lambda v: json.dumps(v).encode(),
    acks="all",                # wait for full ISR
    enable_idempotence=True,   # producer-side dedup via (pid, seq)
)

matches = ["MUM-vs-CSK", "RCB-vs-KKR", "DC-vs-PBKS"]
for i in range(15):
    match = matches[i % 3]
    event = {"event_id": str(uuid.uuid4()), "match": match,
             "type": "boundary" if i % 4 == 0 else "single",
             "ts": time.time()}
    fut = producer.send(TOPIC, key=match, value=event)
    md = fut.get(timeout=5)  # block for the ack
    print(f"  {match:13s} -> partition={md.partition} offset={md.offset}")

producer.flush()

# 3. Consumer — assigned all partitions, reads from the beginning.
consumer = KafkaConsumer(bootstrap_servers=BOOTSTRAP,
                         enable_auto_commit=False,
                         group_id="demo-group",
                         auto_offset_reset="earliest")
parts = [TopicPartition(TOPIC, p) for p in range(PARTITIONS)]
consumer.assign(parts)
consumer.seek_to_beginning(*parts)

print("\nreading back:")
for _ in range(15):
    msg = next(consumer)
    print(f"  p={msg.partition} off={msg.offset:3d} key={msg.key.decode():13s}"
          f" type={json.loads(msg.value)['type']}")

consumer.close()
producer.close()

Sample run:

  MUM-vs-CSK    -> partition=2 offset=0
  RCB-vs-KKR    -> partition=0 offset=0
  DC-vs-PBKS    -> partition=3 offset=0
  MUM-vs-CSK    -> partition=2 offset=1
  RCB-vs-KKR    -> partition=0 offset=1
  ... (10 more)

reading back:
  p=0 off=  0 key=RCB-vs-KKR    type=boundary
  p=0 off=  1 key=RCB-vs-KKR    type=single
  p=0 off=  2 key=RCB-vs-KKR    type=single
  p=0 off=  3 key=RCB-vs-KKR    type=single
  p=0 off=  4 key=RCB-vs-KKR    type=single
  p=2 off=  0 key=MUM-vs-CSK    type=boundary
  p=2 off=  1 key=MUM-vs-CSK    type=single
  ...

Walkthrough of the load-bearing lines:

  • key=match, value=event — Kafka's default partitioner hashes the key to pick a partition (murmur2(key) % num_partitions). All events for one match land on one partition, which gives per-match ordering without sacrificing parallelism across matches. This is the right way to use partitions: pick the key whose ordering matters and accept that ordering across keys is not preserved.
  • acks="all" + enable_idempotence=Trueacks=all waits for every ISR replica; enable.idempotence adds the producer-side (pid, seq) dedup so retries don't create duplicates inside the broker. Together, they are the production default.
  • enable_auto_commit=False — the demo manages offsets manually. Production code should commit only after the side effect is durable (database write succeeded, downstream call returned 200, etc.) — otherwise a crash between processing and commit replays correctly, but a crash between commit and processing loses the message.
  • consumer.seek_to_beginning(*parts) — replaying from offset 0 demonstrates that the log retains records, not just delivers them. This is the queue-vs-log distinction in one line: you can rewind, replay, and have a second consumer group read the same records from the same offsets.
  • Output ordering — note that records arrive grouped by partition, not by time. A consumer reading multiple partitions sees records in per-partition order, not global order. This is the constraint that surprises every team migrating from RabbitMQ.

Why offset commits and side-effect durability must be coordinated: the consumer's offset is the cursor for "where did I leave off". If you commit the offset before the side effect (e.g. the database write), and the process crashes between commit and side effect, the message is lost — on restart you read from the committed offset, past the un-applied record. If you commit after the side effect, and the process crashes between side effect and commit, the message is replayed — your idempotency layer absorbs the duplicate (covered in /wiki/at-least-once-idempotency-in-practice). The only safe ordering is side-effect first, commit second — and the side-effect must be idempotent.

Throughput numbers — what one cluster can actually do

Numbers from a real production setup are useful for calibrating expectations. Below are figures from CricStream's match-events-v3 topic during a peak IPL final, on a 12-broker cluster (each broker: 24 cores, 192 GB RAM, 4× 3.84 TB NVMe in RAID-10, 25 Gbps networking).

Metric Steady state (regular match) Peak (IPL final, last over)
Producer throughput 180 MB/s 1.4 GB/s
Records per second 420,000 3.1 million
Consumer throughput (sum of all groups) 540 MB/s 4.2 GB/s
Disk write IOPS (per broker) 8,000 62,000
Page-cache hit rate 99.7% 99.4%
99p produce latency 12 ms 38 ms
99p consumer fetch latency 6 ms 24 ms

The ratio of consumer-out to producer-in is roughly 3× because three downstream consumer groups read each record (the live-feed pusher, the analytics pipeline, the highlights generator). Page-cache hit rate stays above 99% because the active segment of every partition fits in RAM — 12 brokers × 192 GB ≈ 2.3 TB of cache, against an active working set of ~600 GB.

Why this matters for sizing: the binding resource for Kafka is rarely CPU. It is either network bandwidth (when consumer fan-out is high) or page-cache size (when consumers fall behind and have to read older segments from SSD). A common sizing mistake at PaySetu and KapitalKite was provisioning brokers with high CPU but only 32 GB of RAM, then watching consumer-lag-induced cache misses turn 1ms reads into 8ms reads, which cascaded into producer back-pressure. The rule of thumb: size RAM to hold (active segment + last few hours of retention) per broker. SSD capacity is for retention; RAM is for throughput.

Common confusions

  • "A topic preserves order." No — a topic is an unordered union of its partitions. Order is preserved per partition only. If you need global ordering, you need a single-partition topic (which caps throughput at one broker's write rate, ~1 GB/s).
  • "Consumers are pushed messages." No — Kafka consumers pull. They send Fetch(topic, partition, offset) requests and the broker streams records back. This pull model is why one slow consumer cannot back-pressure the producer or other consumers — the slow consumer just falls behind in the log.
  • "Kafka is exactly-once." Kafka delivers at-least-once with acks=all, and the idempotent producer eliminates broker-side duplicates from a single producer instance. End-to-end exactly-once semantics requires Kafka transactions plus the consumer/producer being part of the same transactional read-process-write loop. See /wiki/exactly-once-and-the-semantics-debate.
  • "Replication-factor 3 means I survive any 2 broker losses." Only if min.insync.replicas=1. With min.insync.replicas=2 (the production default), losing 2 of 3 brokers stops accepting writes — the cluster prefers unavailability over silently weakening durability. This is by design.
  • "Adding partitions scales any topic." Adding partitions does not redistribute existing records. Old partitions keep their data; new keys only go to new partitions if the partitioner maps them there. For an existing topic, adding partitions changes the key→partition mapping for every key, breaking per-key ordering for in-flight consumers — you usually want to create a new topic instead.
  • "Consumer groups load-balance like RabbitMQ queues." Almost — but a single partition can be read by at most one consumer in the group. If you have 4 partitions and 8 consumers in one group, 4 consumers sit idle. The unit of consumer parallelism is the partition, not the message.

Going deeper

Log compaction — keys, not records

A Kafka topic configured with cleanup.policy=compact retains, for each key, only the most recent record. The broker periodically rewrites closed segments, dropping records whose key has been overwritten by a later record. Compacted topics are the foundation of changelog topics (Kafka Streams' state stores), __consumer_offsets (the topic Kafka itself uses to store consumer-group offsets), and event-sourced "current state" projections. The trade-off: a compacted topic is no longer a complete history — you lose the ability to replay every event — but you gain bounded storage proportional to distinct-key count rather than event count. PaySetu uses a compacted topic for merchant-config-v2 (the latest config per merchant) and a non-compacted topic for payment-events-v3 (every payment, retained 30 days for audit).

KRaft — replacing ZooKeeper with Raft

Until Kafka 2.8, cluster metadata (topic list, partition assignments, ISR membership, broker liveness) lived in ZooKeeper, a separate distributed coordination service. From 3.3 onwards, Kafka's KRaft mode embeds a Raft consensus log in the controller brokers themselves, eliminating the ZooKeeper dependency. This simplifies operations (one system to deploy, one set of metrics) and lifts the metadata-scale ceiling — ZooKeeper-based clusters topped out at ~200,000 partitions per cluster because the controller had to read the whole metadata snapshot from ZooKeeper at startup; KRaft can sustain millions. CricStream migrated from ZooKeeper to KRaft in 2024 and saw broker startup times drop from 4 minutes to 22 seconds.

Why Kafka does not call fsync per write

Kafka relies on replication for durability, not fsync. A write acknowledged with acks=all is in the page cache of every ISR replica — but no fsync has been issued, so a simultaneous crash of all ISR replicas can lose the data. The argument is: simultaneous crashes of N machines in N different racks (or N different AZs) is rarer than the throughput cost of fsync, which would cap the broker at single-disk fsync IOPS (~30k/s on NVMe) rather than the ~10× higher write rate it can sustain otherwise. You can force fsync per N records or per T ms via flush.messages and flush.ms, but almost no production deployment does — the durability story is "two-AZ + ISR + low MTTR", not "fsync".

The unclean-leader-election trade-off

If the entire ISR is lost (all in-sync replicas dead), Kafka has two choices: wait for an ISR member to come back (potentially indefinite unavailability, no data loss) or promote an out-of-ISR follower to leader (immediate availability, lose any records the dead leader had that the new leader didn't). The default since Kafka 1.0 is unclean.leader.election.enable=false — prefer unavailability over data loss. PaySetu keeps it false on payment topics (lose nothing) and true on observability topics (lose some metrics, stay available). The choice is per-topic and reflects the cost of a lost record.

Reproduce on your laptop

# Single-broker Kafka in Docker (KRaft mode, no ZooKeeper).
docker run -d -p 9092:9092 --name kafka apache/kafka:3.7.0
pip install kafka-python
python3 kafka_log_demo.py
# Inspect the partition directories:
docker exec kafka ls -la /var/lib/kafka/data/match-events-demo-0/

Where this leads next

The next chapter in this part covers Kafka consumer groups and rebalancing — how a group of consumers divides partitions among themselves, how the rebalance protocol stalls reads during reassignment, and the cooperative rebalance protocol that fixes the worst of it.

References