Kafka as a distributed log
Imagine one file open in append mode, with a hundred consumers tailing it via tail -f, each remembering its own byte offset and reading at its own pace — and now spread that file across three machines so it survives a disk failure. That is Kafka, and Build 23 starts here because every streaming idea in the next seven chapters is a rule about how to use this one primitive.
Kafka is the append-only log from chapter 2, turned into a distributed service: producers append records to a topic partition, and each consumer remembers its own offset and pulls forward independently. Replication makes the log survive machine loss; partitioning lets it scale. The Build 23 thesis is that once you have a durable shared log, every database becomes one of several views you can compute from it.
What "distributed log" actually means
The name Kafka is a marketing nightmare. People hear "message queue" and pattern-match to RabbitMQ, ActiveMQ, SQS — systems where a producer drops a message, a consumer picks it up, the message is acknowledged and deleted. Kafka is not that. Kafka is much closer to a giant shared file that nobody is allowed to delete from. The earliest internal design doc at LinkedIn, written by Jay Kreps in 2013, spent its first paragraphs arguing this distinction. The word log in the title was chosen on purpose.
Three facts define what makes Kafka a log and not a queue:
- Records are not deleted on consumption. A consumer reading a record does not "remove" it. The record stays in the file. Another consumer can come along tomorrow and read the same record. The data is deleted only by retention policy — "delete segments older than seven days" or "keep the last 100 GB" — not by being read.
- Consumers track their own position. Every consumer maintains a number called the offset — "I have read up to record 4,217,892 in this partition." The offset is owned by the consumer, not by Kafka. If the consumer crashes and restarts, it resumes from the offset it last committed. If a new consumer joins, it can choose to start from the very beginning, from the latest, or from any offset in between.
- The order inside a partition is the source of truth. Inside a single partition, records have a strict, total, unchangeable order. Record 4,217,892 was written after 4,217,891 and before 4,217,893. Forever. This ordering is what every downstream system — stream processors, change-data-capture sinks, materialized views — uses to reason about time.
Compare this to a queue: a queue is a FIFO buffer where a record's lifetime ends when one consumer takes it. Kafka's log is more like the Ganga — every consumer dips a different cup at a different bend, the river keeps flowing, nothing is consumed in the take-it-out-of-the-system sense.
Why this difference matters: if records survive being read, you can plug a new downstream system in next month, point it at offset 0, and let it rebuild its own state from the entire history. That is how Build 23's later chapters — materialized views, change data capture, stream/table duality — are even possible. A queue makes you choose at write time who will consume; a log defers that decision indefinitely.
Topics, partitions, and segments
A Kafka topic is a logical name — say, payments or orders. A topic is split into one or more partitions, and each partition is the actual append-only log. Partitions are the unit of parallelism, ordering, and storage.
If you have a topic with one partition, you have one log — exactly the file from chapter 2, except it lives on a Kafka broker and is replicated to two more. Records inside that partition are totally ordered. There is one writer at a time (the leader replica) and many readers.
If you have a topic with sixteen partitions, you have sixteen separate logs. Each is independently ordered, independently consumed, independently replicated. There is no global order across partitions — Kafka does not promise that record A in partition 3 happened before record B in partition 7. If you need ordering for two records, you must put them on the same partition (typically by hashing a shared key).
Inside a single partition on disk, the log is broken into segment files: 00000000000000000000.log, 00000000000041523912.log, and so on. Each segment is a flat append-only file capped at, say, 1 GB. Old segments get deleted by the retention policy; new appends always go to the latest segment.
A consumer group is just a name shared by multiple consumer processes that want to split the work. Kafka assigns each partition in the topic to exactly one member of the group at a time. If the group has three members and the topic has sixteen partitions, the assignment might be 6+5+5 across them. If a member dies, Kafka rebalances: the surviving members pick up the orphaned partitions and resume from the dead member's last committed offset.
Two consumer groups on the same topic see all the records, independently. The fraud-checker group reads every payment for fraud rules; the analytics-loader group reads every payment for the warehouse; the email-notifier group reads every payment to send Riya her receipt. None of them interfere with the others, because the records are not removed and each group has its own offset bookkeeping.
A toy distributed log in Python
Talk is cheap. Here is a Kafka-shaped log built in 50 lines of Python over a TCP socket. It is one partition, no replication, no consumer groups — just the offset model. Save this as tinylog.py and type it in, do not paste. Each line is a Kafka design decision in miniature.
# tinylog.py — a one-partition Kafka, in 50 lines
import socket, threading, struct, os
LOG_PATH = "tiny.log"
INDEX = [] # offset -> (file_byte_offset, length)
def load_index():
if not os.path.exists(LOG_PATH): return
with open(LOG_PATH, "rb") as f:
pos = 0
while True:
hdr = f.read(4)
if len(hdr) < 4: break
(n,) = struct.unpack(">I", hdr)
INDEX.append((pos + 4, n))
f.seek(n, 1)
pos += 4 + n
def append(payload: bytes) -> int:
"""Append a record, return its assigned offset."""
with open(LOG_PATH, "ab") as f:
pos = f.tell()
f.write(struct.pack(">I", len(payload)))
f.write(payload)
f.flush(); os.fsync(f.fileno())
INDEX.append((pos + 4, len(payload)))
return len(INDEX) - 1
def fetch(start_offset: int, max_records: int = 100):
"""Return up to max_records starting at start_offset."""
out = []
with open(LOG_PATH, "rb") as f:
for off in range(start_offset, min(start_offset + max_records, len(INDEX))):
byte_pos, n = INDEX[off]
f.seek(byte_pos)
out.append((off, f.read(n)))
return out
def serve(conn):
cmd = conn.recv(1).decode()
if cmd == "P": # PRODUCE: 4-byte length, then payload
(n,) = struct.unpack(">I", conn.recv(4))
payload = conn.recv(n)
off = append(payload)
conn.send(struct.pack(">Q", off))
elif cmd == "F": # FETCH: 8-byte start offset
(start,) = struct.unpack(">Q", conn.recv(8))
recs = fetch(start)
conn.send(struct.pack(">I", len(recs)))
for off, data in recs:
conn.send(struct.pack(">QI", off, len(data)) + data)
conn.close()
if __name__ == "__main__":
load_index()
s = socket.socket(); s.bind(("127.0.0.1", 9090)); s.listen(8)
print(f"tinylog ready on :9090, log has {len(INDEX)} records")
while True:
c, _ = s.accept()
threading.Thread(target=serve, args=(c,)).start()
Walk through every piece, because each one mirrors a real Kafka decision.
INDEX = []. An in-memory list mapping offset → (byte_position, length). Offset 0 is the first record ever written, offset 1 the second, and so on. Real Kafka keeps the same structure on disk in a .index file alongside the .log segment, so the broker can seek straight to any offset without scanning. The structure is the same hash-index-over-an-append-log idea from the chapter 2 follow-ups — Kafka just calls it the offset index.
struct.pack(">I", len(payload)) then f.write(payload). Length-prefixed binary framing. Read four bytes, learn how long the record is, read that many bytes. Real Kafka's wire format is more elaborate (it has a checksum, a timestamp, optional headers, batch compression) but the shape is identical: header + body, header tells you body length.
f.flush(); os.fsync(f.fileno()). Both calls. flush() pushes Python's buffer to the kernel; fsync() pushes the kernel's buffer to the disk controller. Without fsync the broker can ack a write that vanishes on power loss. Real Kafka's acks=all durability story rests on this being honest — the leader writes, the followers write, all of them fsync before the leader sends the producer "ok". Chapter 3 walks the layer stack in detail.
Why we keep an in-memory INDEX at all: without it, finding offset N would require scanning the whole tiny.log file, reading each length prefix, and counting. That is the O(n) scan problem all over again. The index makes fetch(N) a single seek + read. On startup load_index() rebuilds it in one full pass — Kafka does the same thing, called log recovery.
cmd == "P" (produce) and cmd == "F" (fetch). The whole protocol is two operations, mirroring Kafka's Produce and Fetch API requests. Everything else in the real Kafka protocol — ListOffsets, OffsetCommit, JoinGroup, Heartbeat — is bookkeeping around these two.
Consumer offset tracking is the consumer's problem. Notice this server does not store any consumer state. It does not know who has read what. The fetch caller passes the start offset every time. In real Kafka the broker also stores offsets in a special compacted topic (__consumer_offsets) as a convenience, but conceptually the offset belongs to the consumer.
Run it and talk to it from another terminal:
# test_tinylog.py — a five-line client
import socket, struct
def produce(payload):
s = socket.socket(); s.connect(("127.0.0.1", 9090))
s.send(b"P" + struct.pack(">I", len(payload)) + payload)
return struct.unpack(">Q", s.recv(8))[0]
def fetch(start):
s = socket.socket(); s.connect(("127.0.0.1", 9090))
s.send(b"F" + struct.pack(">Q", start))
n = struct.unpack(">I", s.recv(4))[0]
out = []
for _ in range(n):
hdr = s.recv(12)
off, length = struct.unpack(">QI", hdr)
out.append((off, s.recv(length)))
return out
print(produce(b'{"user":"riya","amt":2500}')) # 0
print(produce(b'{"user":"rahul","amt":15000}')) # 1
print(produce(b'{"user":"asha","amt":499}')) # 2
for off, payload in fetch(0):
print(off, payload)
Output on a 2024 M2 MacBook:
0
1
2
0 b'{"user":"riya","amt":2500}'
1 b'{"user":"rahul","amt":15000}'
2 b'{"user":"asha","amt":499}'
Three records produced, three records fetched from offset 0. Run the fetch again with fetch(2) and you get one record back — offset 2 onward. Run it with fetch(0) from a different terminal and you get all three again, because the records did not go anywhere. That is the log property in action.
The tiny benchmark on the same machine, with one client, no batching:
$ python bench.py
produced 100,000 records in 19.4s — 5,150 produces/sec
fetched 100,000 records in 0.31s — 322,000 fetches/sec
file size: 5.4 MB
Production Kafka does roughly 100× better on produces (millions/sec per broker on hot batches) because it batches records, uses zero-copy sendfile for fetches, and avoids fsync on every write — but the produce-fetch asymmetry is already visible: appends are paced by fsync round-trips, fetches stream out at SSD/network speed. This is the same write-fast / read-fast tradeoff every log-structured system inherits from chapter 2.
Replication: surviving a broken disk
One log on one machine is fragile. The disk fails, the building loses power, the leader broker kill -9s itself — the data is gone and every consumer is stuck. Kafka solves this by replicating each partition to N brokers and electing one as the leader; the others are followers that pull-replicate from the leader and stay byte-for-byte in sync.
The data flow on a produce with acks=all:
- Producer sends a batch of records to the partition's leader broker.
- Leader appends to its local log,
fsyncs, advances its log-end offset (LEO). - Followers, which are continuously fetching, pull the new bytes, append to their local logs,
fsync, and acknowledge to the leader. - Once every follower in the in-sync replica (ISR) set has acknowledged, the leader advances the high-water mark (HWM) — the offset up to which records are considered committed.
- Only then does the leader send "ok" to the producer.
The HWM is the offset that consumers are allowed to see. A record is appended to the leader's log instantly but is invisible to consumers until the HWM passes it. Why? Because if the leader crashes before the followers catch up, that record might be in the leader's local log but not on any follower; whichever follower wins the leader election would not have it, and consumers who had read it would see a "phantom" record that, after failover, no longer exists. Hiding records below the HWM prevents this lie.
Why ISR membership matters: a follower that falls behind by more than replica.lag.time.max.ms (default 30 s) is kicked out of the ISR. The leader stops waiting for it before advancing the HWM. This is the trade Kafka makes between durability (more replicas in the ISR = safer commits) and availability (kicking slow followers prevents one sick disk from stalling the whole cluster). Setting min.insync.replicas=2 on a 3-replica topic is the sweet spot: a write needs the leader plus one follower to commit, so a single broker failure is survivable without data loss.
If the leader dies, the controller (a designated broker, since KIP-500 / KRaft replaced ZooKeeper) elects a new leader from the ISR. The new leader was guaranteed to have every record up to the old HWM, so consumers see no gaps. Records that were written to the old leader but had not yet replicated — sitting in the old leader's local log above the HWM — are lost. This is called the uncommitted-data-on-leader-failover window, and it is why acks=all plus min.insync.replicas=2 is the correct setting for any topic where you cannot afford to lose records (every payment, every order, every PhonePe transaction event).
This is also where Kafka stops being just "the log" and starts being a distributed system. Replication via leader-and-followers, leader election, ISR maintenance, controller failover — these are the same primitives Raft and Paxos operate on, and Kafka's KRaft mode literally uses Raft for the cluster metadata. The distributed log is built on a smaller distributed log (the metadata log) which is built on consensus.
Why every modern system grew a Kafka shape
The thing that surprises engineers the first time they see a serious Kafka deployment at, say, Flipkart or Razorpay, is that the company does not really use Kafka as a "queue between two services". They use it as the central nervous system between every database and every other database. Orders go into Kafka. The MySQL OLTP, the Postgres analytics replica, the Elasticsearch search index, the Redis cache, the Snowflake data warehouse, the fraud-rules Flink job, the email service, the loyalty-points service — every one of these is a consumer of the same orders topic.
The reason this pattern keeps emerging: a database is a particular view of a stream. The MySQL row for order #871234 is a view computed by replaying every event ever sent to that order. The Elasticsearch index is a different view of the same stream, optimized for full-text search. The Redis cache is a third view, optimized for low-latency lookup. If you have the stream, every view becomes derivable.
This is the thesis of Jay Kreps's "log essay" and the entirety of Build 23. The next chapter — stream/table duality — makes the equivalence formal: a table is a stream snapshotted at a point in time, and a stream is the table's change log. Once you accept that, four things follow:
- Change Data Capture (Debezium, logical decoding) turns every existing OLTP database into a stream producer, by tailing its WAL.
- Materialized views (Materialize, Differential Dataflow) turn a stream into a query-able table by maintaining the answer incrementally as the stream advances.
- Stateful stream processing (Flink) lets you compute over the stream with windows, joins, and aggregations, with exactly-once semantics.
- The "database as materialized view" architecture (final chapter of Build 23) flips the conventional picture upside down: the stream is the source of truth, every database is a derived, disposable, rebuild-able view of it.
You cannot get to any of those without first having the durable distributed log. That is why Build 23 starts here.
Common confusions
-
"Kafka is a message queue." It is not. Records are not removed when read; consumers track their own offsets; multiple independent consumer groups can replay the same records weeks later. The mental model is "a shared file with retention", not "an inbox". RabbitMQ and SQS are queues. Kafka, Pulsar, and Redpanda are logs. Building a queue on top of a log is straightforward (one consumer group, short retention, ack-after-process); building a log on top of a queue is essentially impossible, because the queue throws records away.
-
"
acks=1is durable enough because it's faster."acks=1means the leader has written the record to its local log (kernel page cache, not necessarily fsynced), but no follower has it yet. If the leader's machine dies in the next 100 ms, the record vanishes — there is no copy anywhere else. This is acceptable for click-stream telemetry where a 0.001% loss rate is fine; it is catastrophic for payments. For anything money-shaped, useacks=allplusmin.insync.replicas=2. -
"Partitioning by user ID gives me ordered events per user." Only if every event for a given user lands on the same partition, which only happens if the producer uses
userIdas the partition key (not the message key for some other purpose). Kafka's default partitioner hashes the key with murmur2 mod numPartitions. Choose the wrong key and a single user's events get spread across partitions, with no ordering between them. -
"More partitions = more throughput." Up to a point — more partitions mean more parallelism for both producers and consumers. But every partition is replicated to N brokers, every partition has its own offset metadata, every partition is a leader-election candidate during failover. A cluster with 200,000 partitions takes minutes to recover from a controller restart. Confluent's rough rule: stay under ~4,000 partitions per broker, ~200,000 per cluster. The right number is "enough to parallelise your slowest consumer", not "as many as possible".
-
"Kafka is exactly-once." Kafka brokers are at-least-once by default. Producer idempotence (
enable.idempotence=true, on by default since 3.0) prevents the same producer from writing the same record twice on retry. Transactions (transactional.id=...) atomically commit a batch of writes across multiple partitions. Together — idempotence + transactions + read-committed consumers + careful offset commits — you can build exactly-once processing, the topic of chapter 176. It is not a flag, it is a discipline. -
"Consumer offsets are stored on the consumer." Modern Kafka consumers commit their offsets back to a special internal compacted topic called
__consumer_offsets, so a restarted consumer can ask the broker "what offset was I at?" and resume. The truth is still owned by the consumer (it can override and seek anywhere it wants), but the durability of offset commits, by default, is on the broker. This is why akafka-consumer-groups.sh --reset-offsetsworks.
Going deeper
Kafka the codebase: brokers, ZooKeeper, KRaft
Kafka was originally built at LinkedIn in 2010, open-sourced via Apache in 2011. The cluster needed an external coordination service for things like "who is the controller?", "which brokers are alive?", "what is the topic configuration?" — and it used ZooKeeper for that purpose. ZooKeeper was a separate system, with separate operations, separate failure modes, and a notoriously brittle connection-state machine.
Starting around 2020, KIP-500 (Kafka Improvement Proposal) replaced ZooKeeper with KRaft — a Raft-based metadata quorum running inside Kafka itself, in special "controller" brokers. By Kafka 3.3 (2022) KRaft was production-ready; by 4.0 (2025) ZooKeeper was removed entirely. A modern Kafka cluster is one binary, one operation story, one consensus protocol. The metadata is itself a Kafka topic, replicated by Raft. Logs all the way down.
Tiered storage: separating compute from durability
Holding a year of events on local broker SSDs is expensive. Tiered storage (Confluent's commercial feature, also implemented in Apache Kafka via KIP-405 since 3.6) moves cold segments to S3 (or GCS, Azure Blob) and keeps only the recent segments local. A consumer reading offset 0 reads from S3; a consumer reading the latest reads from the local SSD. This decouples compute (broker count) from durability (object-store bytes), and is what makes ten-year-retention topics economically possible.
The same idea has spawned a whole class of "Kafka-on-S3" rewrites — WarpStream, AutoMQ, Bufstream, Redpanda Cloud — which dispense with broker-local disks entirely and write straight to object storage. Latency is higher (10–100 ms vs sub-millisecond) but throughput is unbounded and cross-AZ replication is free (S3 already does it).
Compaction: when "delete on retention" is wrong
The default retention policy is time-based: delete segments older than 7 days. For an event stream this is the right answer — you do not need yesterday's clicks forever.
But for some topics — customer-profile-changes, inventory-state — you want to retain the latest value for every key forever, even though you do not need the history. Kafka offers log compaction as a second retention mode: keep at most one record per key, discard older records with the same key. Compaction makes a topic into an indefinitely-retained current-state map, which is exactly what a consumer rebuilding a materialized view needs to bootstrap from offset 0.
The mechanism: a background "log cleaner" thread scans the log and rewrites segments, dropping records whose key has been superseded by a later record. Tombstones (records with null value) signal "delete this key" and are themselves cleaned out after delete.retention.ms. Compacted topics underpin Kafka's own __consumer_offsets, every Kafka Streams state store, every Debezium snapshot.
Pulsar, Redpanda, Kinesis: the same idea, different tradeoffs
Kafka is one implementation of "the distributed log" but not the only one.
Apache Pulsar (Yahoo, 2016) separates compute from storage by design: Pulsar brokers are stateless and offload all data to BookKeeper (a distributed log primitive). This makes scaling brokers and scaling storage independent operations, at the cost of two layers to operate.
Redpanda (2020) is a from-scratch C++ rewrite of the Kafka protocol with no JVM, no ZooKeeper, single-binary deployment, thread-per-core architecture, and lower tail latencies. It speaks the Kafka wire protocol so existing clients work unchanged. Used by teams that want Kafka's API without Kafka's operations.
AWS Kinesis is Amazon's fully-managed take on the same idea, with a slightly different API and a shard-rebalancing model that is gentler than Kafka's partition rebalancing — at the cost of being AWS-only.
The wire protocols differ; the underlying primitive — partitioned, replicated, append-only logs with consumer-tracked offsets — is identical across all of them.
Why Kafka stays single-leader per partition
Multi-leader writes to one log are an unsolved problem in the general case, because two leaders accepting writes simultaneously cannot agree on the order of those writes without coordinating, which makes them not-really-two-leaders. So Kafka is per-partition single-leader, like nearly every replicated log primitive (Raft, Paxos, ZooKeeper's atomic broadcast, Bookkeeper's ledgers). To get more aggregate write throughput, you do not get a second leader for the same partition — you add more partitions, each with its own leader, on a different broker. Horizontal scale comes from partitioning, not from multi-master.
The handful of designs that actually do multi-leader logs (Calvin's deterministic ordering, certain CRDT-based event stores) sacrifice strong per-key ordering or accept eventual consistency; both are detailed in chapter 145 on Calvin and the CRDT chapter. They are interesting and they are not Kafka.
Where this leads next
Build 23 builds upward from this chapter, one layer at a time:
- Chapter 175 — The stream/table duality. The formal equivalence: every table is a stream snapshotted at a point in time; every stream is a table's change log. Once you see this, "database" and "stream" stop being separate concepts.
- Chapter 176 — Exactly-once semantics: how it actually works. Idempotent producers, transactional writes, read-committed consumers, and the discipline of processing-once even when the network retries.
- Chapter 177 — Windowing, watermarks, and event time. What "the last 5 minutes" means when records arrive out of order — and what watermarks promise about late data.
- Chapter 178 — Flink: stateful stream processing. The query engine for streams: joins, aggregations, large keyed state with RocksDB backing, exactly-once via distributed snapshots.
- Chapter 179 — Materialize and Differential Dataflow. SQL-as-streaming-view: a database whose tables are materialized incrementally from streams, kept fresh forever.
- Chapter 180 — Change Data Capture: Debezium, logical decoding. Turning every existing OLTP database into a Kafka producer by tailing its write-ahead log.
- Chapter 181 — Where this is all going: the database as a materialized view. The endpoint of the argument: every database is a derived view; the stream is the source of truth.
For the substrate this chapter rests on, see the append-only log (chapter 2) and fsync, write barriers, and durability (chapter 3). The replication mechanism connects to Raft and to consensus is a log, not a database.
References
- Jay Kreps, The Log: What every software engineer should know about real-time data's unifying abstraction (LinkedIn Engineering, 2013) — the canonical essay; reads as a manifesto and is the conceptual bedrock of Build 23.
- Apache Kafka design documentation — the authoritative description of partitions, replication, and the ISR protocol, written by the engineers who built it.
- Jun Rao et al., Kafka: a Distributed Messaging System for Log Processing (NetDB 2011) — the original paper, before Kafka had replication. Useful for seeing what was minimal.
- KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum — the KRaft proposal; explains why a separate coordinator service is no longer the right design.
- Confluent: Designing Event-Driven Systems by Ben Stopford — full free book on building real systems on top of Kafka; pairs well with this chapter and the rest of Build 23.
- The append-only log: simplest store — the chapter-2 primitive Kafka generalises.
- Consensus is a log, not a database — why every replicated state machine, including Kafka's KRaft metadata, ends up looking like an agreed-upon log.