Why logs: the one data structure streaming is built on
The fraud team at Razorpay sits down to design a real-time fraud-block service. They sketch a queue: producers push, consumers pop, the consumer that pops the message processes it, the message disappears. Two days later they discover that when the rule-engine consumer crashes mid-message, the message is gone. They add an "ack". Then they discover that when a second rule-engine wants to read the same stream — the analytics team also needs every transaction, in the same order — the queue can only deliver each message once, so they fork it, badly. Then they discover that during the IPL final, the queue grows to 40 GB in memory and OOM-kills the broker. Each fix moves them, step by step, toward the same data structure every other team converges on: an append-only log on disk. The lesson the industry learnt the hard way is that you cannot dodge the log; the queue is just the log with three of its best properties amputated.
A log is an immutable, totally-ordered, append-only sequence where every record gets a position (an offset) and is never modified after being written. That sounds simple, almost trivial. It is the substrate of every real streaming system because it is the only data structure that simultaneously gives you replay, multiple independent consumers, durability, and total order — without having to coordinate any of the readers. Kafka, Pulsar, Kinesis, Redpanda, Postgres WAL, MySQL binlog, and Redis Streams are all the same idea wearing different clothes.
What "a log" actually means in this chapter
In a data-engineering context a log is not the line your print statement writes to stderr. It is a specific, ancient data structure with three rules: records are appended in order, the order is total (every record has a unique position called the offset, and the offsets are monotonically increasing), and once written a record is immutable — never modified, never deleted out of order, only ever truncated from the head when retention says so. There is no "update record 47". There is "append a new record that supersedes record 47". The log does not care what supersedes means; that's the consumer's job.
Pat Helland, in his 2015 paper Immutability Changes Everything, made the case bluntly: when records never change, every consequence we worry about in mutable systems — concurrency, locks, cache invalidation, lost updates, write skew — collapses into a non-problem. The append-only log is the most extreme expression of that idea. The cost is that you give up the ability to "fix" a record in place; the buy is that everything downstream becomes radically simpler.
The picture above hides one detail that is doing all the work: the log does not push messages to readers. Readers pull — they hold their own offset, ask the log "give me what's after offset X", and advance their offset on their own clock. This is the inversion that makes everything else possible.
The three properties that no other primitive gives you all at once
A queue (RabbitMQ-style), a database table, a pub/sub topic, a shared file — each gives you some subset of what streaming needs. The log is the only primitive that gives you all three at once.
Property 1: total order, established once at write time. When the producer writes record 4 after record 3, every reader sees record 4 after record 3. There is no race, no "did A see this update before B did" question, no consensus needed at read time. The order was decided when the broker assigned the offset, full stop. For workloads where order matters — and "order matters" is the default in any stream of state changes (transactions for one user, edits to one document, GPS pings for one rider) — this single property eliminates an entire class of bugs that distributed databases spend chapters trying to recover.
Property 2: replay from any offset. Because records are immutable and offsets are stable, a reader can ask "give me everything from offset 14,239,772 onwards" and get exactly that. A new fraud-rules version that needs to be tested against the last week of traffic just starts a new reader from the offset that corresponds to a week ago. A backfill of a downstream warehouse just starts a reader from offset 0. A bug-fix re-run of an analytics job rewinds its offset by 6 hours and re-reads. None of this needs the producer to do anything special. The producer wrote the data once; it is sitting there, immutable, waiting to be re-read by whoever asks. Why this is harder than it sounds in a queue: queues, by design, delete a message once it is acked. To replay, you have to either build a "dead-letter" plus "redrive" mechanism, which is a partial log glued onto the queue, or persist the messages elsewhere — which is just admitting you needed a log.
Property 3: multiple independent consumers, with their own offsets, no coordination. The fraud team's reader, the analytics team's reader, the warehouse-loader, the search-index updater, the audit archive — all five read the same log. None of them slow down the others. None of them need to know about each other. Each holds its own offset; the log doesn't even know how many readers it has. A new team that joins six months later just creates a new offset starting wherever they want and reads. This decoupling — the producer doesn't need to know who reads, and the readers don't need to know about each other — is the heart of why logs scale to organisations with thousands of streaming consumers.
A queue has property 1 (kind of — within one queue, in one broker, with one consumer group). It does not have property 2 (messages are deleted on ack). It does not have property 3 cleanly (each "fan-out" subscriber needs its own copy). A database has property 1 (transaction order via the WAL — itself a log!) and property 2 (you can replay history from a snapshot + WAL). It does not have property 3 cleanly — readers running long queries lock or version-stack against writers, and the system spends effort hiding the WAL from you. A shared file has property 1 (file order) and 2 (re-readable) but not 3 (no offset bookkeeping; nobody coordinates retention). The log is the data structure where all three are first-class.
Building the smallest log that demonstrates the idea
A log fits in 60 lines of Python. We won't replicate it, partition it, or give it network I/O — those are Build 7's later chapters. Right now: just the data structure, on disk, with three readers running at different offsets to prove they don't interfere.
# tiny_log.py — the data structure, stripped to its essentials.
# Append-only, ordered, replayable, multi-reader. No replication, no network.
import os, struct, json, threading, time
class TinyLog:
"""A log file: 4-byte length header, then JSON payload, repeating."""
def __init__(self, path):
self.path = path
self.lock = threading.Lock()
if not os.path.exists(path):
open(path, "wb").close()
def append(self, record: dict) -> int:
# Returns the offset of the just-written record.
payload = json.dumps(record).encode("utf-8")
with self.lock:
with open(self.path, "ab") as f:
offset = f.tell() # byte position = our offset
f.write(struct.pack(">I", len(payload)))
f.write(payload)
f.flush()
os.fsync(f.fileno()) # durable before we return
return offset
def read_from(self, start_offset: int):
# Generator: yields (offset, record) for every record at or after start.
with open(self.path, "rb") as f:
f.seek(start_offset)
while True:
here = f.tell()
header = f.read(4)
if len(header) < 4: # reached current end-of-log
return
(length,) = struct.unpack(">I", header)
payload = f.read(length)
yield here, json.loads(payload.decode("utf-8"))
# --- Demo: one producer, three readers at three different offsets ---
log = TinyLog("/tmp/tiny.log")
for i in range(8):
log.append({"tx_id": f"tx_{i:03d}", "amount_paise": 19999_00, "merchant": "razorpay-test"})
print("--- Reader A: from start (analytics backfill) ---")
for off, rec in log.read_from(0):
print(f" offset={off:>5} {rec}")
print("--- Reader B: from offset 0, takes first 3 then pauses ---")
gen = log.read_from(0)
for _ in range(3):
print(f" {next(gen)}")
print(" (Reader B's offset is now wherever it stopped; it does not affect Reader A.)")
print("--- Reader C: tail mode, only NEW records written after now ---")
tail_start = os.path.getsize(log.path)
log.append({"tx_id": "tx_008", "amount_paise": 50000_00, "merchant": "razorpay-test"})
for off, rec in log.read_from(tail_start):
print(f" offset={off:>5} {rec} (Reader C only saw this one)")
# Sample run:
--- Reader A: from start (analytics backfill) ---
offset= 0 {'tx_id': 'tx_000', 'amount_paise': 1999900, 'merchant': 'razorpay-test'}
offset= 72 {'tx_id': 'tx_001', 'amount_paise': 1999900, 'merchant': 'razorpay-test'}
offset= 144 {'tx_id': 'tx_002', 'amount_paise': 1999900, 'merchant': 'razorpay-test'}
offset= 216 {'tx_id': 'tx_003', 'amount_paise': 1999900, 'merchant': 'razorpay-test'}
offset= 288 {'tx_id': 'tx_004', 'amount_paise': 1999900, 'merchant': 'razorpay-test'}
offset= 360 {'tx_id': 'tx_005', 'amount_paise': 1999900, 'merchant': 'razorpay-test'}
offset= 432 {'tx_id': 'tx_006', 'amount_paise': 1999900, 'merchant': 'razorpay-test'}
offset= 504 {'tx_id': 'tx_007', 'amount_paise': 1999900, 'merchant': 'razorpay-test'}
--- Reader B: from offset 0, takes first 3 then pauses ---
(0, {'tx_id': 'tx_000', 'amount_paise': 1999900, 'merchant': 'razorpay-test'})
(72, {'tx_id': 'tx_001', 'amount_paise': 1999900, 'merchant': 'razorpay-test'})
(144, {'tx_id': 'tx_002', 'amount_paise': 1999900, 'merchant': 'razorpay-test'})
(Reader B's offset is now wherever it stopped; it does not affect Reader A.)
--- Reader C: tail mode, only NEW records written after now ---
offset= 576 {'tx_id': 'tx_008', 'amount_paise': 5000000, 'merchant': 'razorpay-test'} (Reader C only saw this one)
The lines that matter:
offset = f.tell()— the offset is just the byte position. We do not maintain a separate "logical offset" table; the file's geometry is the offset space. This is the same trick Kafka uses internally, with one subtlety: Kafka serves a logical offset (record-number) to clients but maintains a sparse index from logical to byte offset. The byte offset is the truth.os.fsync(f.fileno())— durability beforeappendreturns. If the process crashes one nanosecond afterappendreturns, the record is still on disk and a fresh reader will see it. Withoutfsyncthe buffer-cache could lose 30 seconds of writes on a kernel panic, which is exactly the kind of bug that turns a 99.99% durable system into a 99.5% one once a year.self.lock— only the producer holds the lock; readers don't lock at all. Because records never change once written, a reader can read concurrently with the producer appending, and there is no torn-read concern as long as the reader respects the length header and only reads bytes that were already there when it started. Why no reader lock works: the producer extends the file at the end; it never modifies bytes the reader has already seen. The kernel's page-cache semantics guarantee that bytes written andfsync-ed are visible to a subsequentreadfrom any process. Readers that race past the current end just see EOF, exactly whatread_fromalready handles.tail_start = os.path.getsize(log.path)— Reader C's "tail mode" trick. Read from the current end of the file forward; only see new records. This is how every "subscribe to live events" feature in every streaming system is implemented at the byte level.
What this 60-line program already gives you, that a queue does not: replay from any offset, multiple independent readers, durable on-disk storage, total order, and the ability for a brand-new reader to walk the entire history. What it does not yet give you: partitioning across machines (Chapter 49), replication for fault tolerance (Chapter 51), retention with bounded storage (Chapter 52), or the producer/consumer protocols a network system needs (Chapters 48 and 50). Build 7 fills in those gaps one at a time. Today's chapter is just the why.
Why the log keeps reappearing inside other systems
Here is a list of systems that, internally, are built on a log even when they look like something else:
- Postgres WAL. Every write to a Postgres table is first written to the write-ahead log, a sequence of records on disk. The B-tree on the heap pages is rebuilt by replaying the WAL. Replication is "ship the WAL to a follower". Point-in-time recovery is "replay the WAL up to timestamp T".
- MySQL binlog. The same idea, exposed for downstream consumption. Every CDC tool (Debezium, Maxwell, Canal) is a log reader for the binlog. Build 11 covers this in detail.
- Redis Streams (
XADD/XREAD). Redis added Streams in 5.0 specifically because pub/sub had no replay, and customers kept reinventing logs in user code on top of Redis. The Streams data type is just a log with consumer groups bolted on. - Kafka. The log is the headline product. Producers append to topic-partitions; consumers pull by offset; everything else (Kafka Streams, ksqlDB, Connect) reads and writes logs.
- AWS Kinesis Data Streams. Marketed as a "real-time data stream". Internally: a partitioned, sharded, append-only log with a 24-hour-to-365-day retention window.
- Apache Pulsar. Looks like a queue + topic system on the outside. Internally: a managed ledger built on Apache BookKeeper, which is — yes — an append-only log abstraction.
- etcd / Raft. The Raft consensus algorithm, which etcd uses to provide strongly-consistent KV storage to Kubernetes, is at heart "agree on the order of entries in a replicated log". The KV store is a state machine fed by the log.
- Git. Each commit is an append to a per-branch log; the working tree is a materialisation of the log up to HEAD. Every time you
git pull, you fetch the new log entries. - Event-sourced systems (Axon, EventStoreDB). The application's database is a log of domain events; current state is a fold over the log.
The pattern is so consistent that Jay Kreps's 2013 essay The Log: What every software engineer should know about real-time data's unifying abstraction (the founding text of Kafka's design) makes a stronger claim: the log is the universal substrate of distributed data systems, and most of what we call "the database" is a particular fold over the log. The CDC chapter (Build 11) makes this point concrete — once you can read the database's internal log, you can rebuild the database's table state anywhere else.
The deep idea here is that the log de-couples the source of truth from the materialisation. The log says what happened; the search index, the SQL mirror, and the live counter each say what currently is. If a materialisation is corrupted by a bug, you don't recover from a backup — you delete it and replay the log. This is why logs go hand-in-hand with idempotent consumers (Build 2): replay only works if processing the same record twice produces the same downstream state.
What "queue" gives up that "log" keeps
Indian platform teams that grew up on RabbitMQ or AWS SQS often arrive at Kafka with the wrong mental model. They expect "a faster queue". They get a log, and the differences trip them up for the first three months. The trip-ups are worth listing because they are the same trip-ups, in the same order, every time.
Acks delete messages. In RabbitMQ, a successful ack removes the message. In Kafka, an "ack" by a consumer is just an offset commit — the message stays on disk, available for re-read, available for the next consumer group, available for replay. Engineers who worked on RabbitMQ for years instinctively reach for "redrive" or "dead letter" patterns in Kafka and then realise they don't need them; the log already handles it.
Fan-out is a separate concept. In RabbitMQ, fan-out to multiple consumers requires a fan-out exchange and one queue per consumer. In Kafka, every consumer group reads the same partition; fan-out is free. A new use case for the data is a new consumer group, full stop, no producer change.
Order is per-partition, not per-topic. In a single-queue system, order is whatever the queue decided. In Kafka, order is preserved within a partition but not across partitions. This is the price of horizontal scale and is the topic of Chapter 49.
Retention is by time/size, not by ack. A queue's retention is "until acked". A log's retention is "until the time-or-size policy says delete it". Set retention to 7 days and a backfill engineer can rewind a week. Set retention to 1 hour and you cannot. The choice is operational, not architectural.
The reader, not the broker, tracks progress. In Kafka the consumer commits its offset to a special internal topic (__consumer_offsets). The broker's job is just to hold the data; offset bookkeeping is the consumer's. This is why a buggy consumer can be reset by simply rewinding its committed offset — the broker doesn't care.
Common confusions
- "A log is the same as a queue." A queue deletes a message when it's acknowledged; a log keeps every message until retention says otherwise. A queue typically pushes; a log is pulled. A queue is consumed once; a log is consumed by every reader independently. They are different shapes solving different problems, and conflating them is the most common Kafka onboarding mistake at Indian teams migrating from RabbitMQ or SQS.
- "A log is the same as a database table." A log is append-only and immutable; a table supports update and delete in place. A table answers "what is the current state of row X"; a log answers "what is the entire history of changes to row X". You can build a table from a log (apply every event in order); you cannot easily build a log from a table without a CDC mechanism that mines the table's own internal log.
- "Multiple readers means each gets a different message." Wrong by default. Multiple readers, by default, each see every message from the offset they started at. If you want load-balancing across readers — only one reader processes each message — you join them into a consumer group, which Chapter 50 covers. Consumer groups are an opt-in pattern on top of the log; the log itself is broadcast.
- "If I write a record and it's the most recent, it has the highest offset." Only on a single partition. Across partitions, offsets are independent — partition 0's offset 1000 has no time relationship with partition 1's offset 1000. Cross-partition ordering is something you have to either give up or recover with timestamps + watermarks (Build 8).
- "The log is just an array; I could replace it with a Postgres table with an
id SERIALcolumn." You could, and Postgres would even give you replay (SELECT * WHERE id > X). What you'd lose: write throughput (Postgres pays for indexes you're not using), retention (you'd build the truncate yourself), partitioning (Postgres has it but it's heavyweight), and the fundamental cost model (one S3-backed Kafka broker handles 1M events/sec; one Postgres table doing 1M inserts/sec is a war room). - "Logs are only for streaming use cases." The log is the substrate of consensus (Raft), replication (every database does WAL shipping), event sourcing (the app's truth IS the log), CDC (mining the database's internal log to build a stream), and version control (git). It is the most-reused data structure in distributed systems, full stop. Streaming is where the log surfaces as a product, but it has been hiding inside everything you use for decades.
Going deeper
Why offsets are byte positions, and what indexes that
In our 60-line TinyLog, the offset is the byte position. Kafka is more sophisticated: each record gets a logical offset (a 64-bit integer that increments per record), and Kafka maintains a separate sparse index file that maps logical offsets to byte positions every 4 KB or so. To serve a fetch request like "give me starting at offset 14,239,772", Kafka does a binary search in the index file (log-N), seeks to the resulting byte position, and scans forward a few records until it finds the exact one. This costs roughly one disk seek per fetch — not per record, per fetch. Once the seek is done, the kernel's sendfile(2) system call streams the bytes from page cache to network socket without copying through user-space memory. This is the basis of Kafka's famous throughput-per-broker numbers. Why sendfile is the unlock: a naive "read into Java buffer, write to socket" pipeline copies bytes four times (disk → kernel page cache → JVM heap → socket buffer → NIC). sendfile is one DMA from the page cache directly to the NIC. On a 2026 NVMe + 25 Gbps NIC node, the difference is roughly 6× throughput.
Sequential I/O is what makes append-only fast
A common worry from engineers seeing the log idea for the first time is: "Disk is slow. How can a system that's basically a flat file beat an in-memory queue?" The answer is that disks are slow at random I/O and very, very fast at sequential I/O. A modern NVMe SSD does about 2–7 GB/s sequential write; the same disk does about 1 MB/s if you scatter 4 KB writes randomly across it. The log structure is only sequential writes, by design. The producer always appends at the tail. The consumer always reads forward. There is no random seek in the steady state. This is the same insight LSM trees (Build 12, on Iceberg/Delta merge-on-read) use; it is the same insight database WALs use; it is the same insight the JFS, ZFS, and ext4 journals use. The log is the data structure that maximises the use of disks' best mode and avoids their worst.
The economic consequence is that on a single Kafka-class broker (one machine, 24 cores, 256 GB RAM, NVMe drives), throughput numbers like 1 GB/s ingest, 4 GB/s egress (because of fan-out), and a million events/sec are routine. Razorpay's payments-events Kafka cluster, as of public 2024 talks, ran around 8 brokers handling 4 million events per minute with peak headroom for IPL-finals traffic at roughly 12× normal. The cost-per-event, including replication, was about ₹0.0008 per event — one-thousandth of a paisa.
Compaction: the log that "remembers state, not history"
Sometimes you don't want every event ever; you want only the latest event per key. Kafka offers a second retention mode called log compaction where, instead of deleting records older than 7 days, the broker periodically rewrites the log to keep only the most recent record for each key. The result is still an append-only log, but old superseded records get garbage-collected. This is how Kafka stores its own __consumer_offsets topic: each commit is a record, and only the most recent commit per consumer-group-and-partition matters. Chapter 52 (retention-compaction-tiered-storage) covers the mechanism in detail. The point for now: compaction does not break any of the three properties — order, replay, multi-reader. It only changes what "the log's history" means. Why this matters for state stores: a compacted topic is the canonical way to seed a new state store. The state of a stream operator's materialised view is just the latest value per key; that's exactly what a compacted log gives you. Reseeding a crashed operator is "replay the compacted topic from offset 0". This is the design of Kafka Streams' state-store backups.
Where the log idea breaks down
The log is not a free lunch; there are workloads it does not fit.
- Random-access lookups by primary key. "Give me the row for
user_id=42" is a database operation, not a log operation. The log can feed a downstream KV store that answers it, but the log itself is not the right primitive for point lookups. (Some systems, like Pulsar's tiered storage, layer indexes on top — but at that point you are no longer using the log for lookups; you are using the index.) - Mutable in-place state. A live game-state with thousands of players changing position 60 times a second per player is better expressed as a state store with mutable cells, not a log of every position change. (You can still log significant state transitions to the log, for replay, but the hot path is the state store.)
- Latency floors below the network round-trip. Reading from a log involves a network call to the broker. If your decision must be made in 50 microseconds, the log is too slow; you want an in-process structure (a ring buffer, a memory-mapped file). HFT systems do this. Most application latency budgets are 10–100 ms, well above the log's overhead.
- Schema-less garbage with no consumer. Don't put data on a log just because you can. The log demands ordered, schema'd events; if your producer emits "whatever, sometimes JSON, sometimes nothing", you've built a write-only swamp, not a log. Build 5's data contracts apply here in full force.
Where this leads next
The next chapter, /wiki/a-single-partition-log-in-python, takes the 60-line TinyLog you just read and grows it into a single-partition broker — adding a record format, segment files, an offset index, and a clean producer/consumer protocol. Read it next; it makes the abstraction concrete enough that Kafka's later complications stop being scary.
After that, /wiki/partitions-and-parallelism introduces the second axis: how do you split a single logical log across many machines so that throughput scales horizontally, while preserving per-key order? That's where the partitioning function and the per-key consistent-hash idea enter. /wiki/consumer-groups-and-offset-management handles the how do many consumers cooperate question — the load-balancing fan-out pattern that turns a broadcast log into a work-sharing queue. /wiki/replication-and-isr-how-kafka-stays-up gives the partition fault tolerance: leader/follower replication and the ISR (in-sync replica) protocol that decides when a write is "committed". /wiki/retention-compaction-tiered-storage bounds the storage cost — time-based retention, log compaction, and the 2024-era trick of tiering old segments to S3 to make brokers nearly stateless. /wiki/kafka-vs-pulsar-vs-kinesis-vs-redpanda closes Build 7 with a head-to-head of the four real-world implementations, so you can pick the right one for your team.
The whole arc has one thesis: every streaming system is the same primitive (the log) plus a different choice on three knobs (partitioning, replication, retention). Once you internalise that, the vendor docs stop reading like marketing and start reading like a small set of design choices made on top of a structure you already understand.
References
- Jay Kreps — The Log: What every software engineer should know about real-time data's unifying abstraction (LinkedIn Engineering, 2013) — the foundational essay; required reading for the rest of Build 7.
- Pat Helland — Immutability Changes Everything (CIDR 2015) — the philosophical case for append-only data, beyond just streaming.
- Apache Kafka — Documentation: Design — the canonical reference for how the partitioned log substrate is implemented at scale.
- Diego Ongaro & John Ousterhout — In Search of an Understandable Consensus Algorithm (Raft, USENIX 2014) — the consensus algorithm that powers etcd, ZooKeeper-replacements, and modern Kafka's metadata layer; built on a replicated log.
- Tyler Akidau — The world beyond batch: streaming 101 — useful background on event-time vs processing-time, which the log makes possible.
- Confluent — Kafka Internals: Storage layer — vendor-leaning but accurate on segment files, indexes, and
sendfile. - /wiki/wall-daily-batches-are-too-slow-for-the-business — the Build 7 wall chapter; why we needed the log in the first place.
- /wiki/a-single-partition-log-in-python — the next chapter; takes the toy log here and grows it into a real one.